Search the documentation...

Search the documentation...

Start

Data cleanup & check

Data cleanup & check

Data cleaning in the machine learning process

A well-functioning machine learning model starts with one crucial step: clean and reliable data. Especially in the insurance industry - where decision-making directly impacts customer relationships, compliance and commercial value - data cleaning is an essential part of the AI process within Onesurance.

Why is data cleaning important?

Machine learning algorithms are prone to noise, errors and missing values. Failure to properly address these issues leads to:

  • Distorted or non-reproducible forecasts

  • Reduced reliability of reporting and steering information

  • Poor generalization to new customer data

  • Unwarranted conclusions and increased risk of bias

With a structured approach to data cleaning, Onesurance guarantees:

  • Transparency and explainability of each model

  • Repeatable and reproducible analyses

  • Reliable and honest output for decisions in customer management

Commonly used techniques for data cleaning

Removing or imputing missing values.

  • Records with too much missing data are deleted.

  • Imputation: missing values are completed based on averages, medians or algorithmic estimates (e.g. KNN imputation).

  • Case study at Onesurance:
    Missing claim amounts are imported based on similar claims in industry and segment.

Detection of outliers (outliers).

  • Statistical methods: Z-scores, interquartile range (IQR)

  • Visualization: box plots, scatterplots

  • Model-based detection: e.g., Isolation Forest

  • Case study:
    An extremely high number of contact moments or claim frequency may indicate data errors, fraud attempts, or exceptional customer profiles.

Cleaning up text fields

  • Lowercasing and punctuation removal

  • Delete stop words and normalization via lemmatization/voting

  • Case study:
    In analyses on contact notes and complaint descriptions, texts are structurally cleaned for sentiment analysis and classification of urgency.

Detection of inconsistent or duplicate records

  • Check for identical customer data, duplicate policies or claims without a valid policy reference

  • Case Study:
    Prevent a relationship with dual enrollment from being considered as two unique customers in CLV calculations and churn analysis.

Specific concerns with insurance data

  • Policy maturity: Correct dating is essential for churn and lifetime calculations

  • Customer hierarchy: Structure within families or companies is optimally linked to model value well

  • Links between datasets: Claims are always linked to the correct policy and relationship

Additional steps at Onesurance

  • Validation: During each onboarding, data quality is validated

  • Logging & auditing: Every correction and edit is tracked for review and compliance

  • Feedback loop: Data cleaning is a continuous process, with input from account managers being fed back to the data team

  • Periodic data quality reports are available upon request

Want to know more or request a data quality check?

Have questions about how your data is cleaned within Onesurance , or would you like a report on your own data quality? Contact your Customer Success Manager - who can explain which cleaning techniques are relevant to your data structure, how exceptions are handled, and how Onesurance ensures continuity and reliability of predictions.