Data cleanup & check

Search the documentation...

Start

Data cleaning in the machine learning process

A well-functioning machine learning model starts with one crucial step: clean and reliable data. Especially in the insurance industry - where decision-making directly impacts customer relationships, compliance and commercial value - data cleaning is an essential part of the AI process within Onesurance.

Why is data cleaning important?

Machine learning algorithms are prone to noise, errors and missing values. Failure to properly address these issues leads to:

Distorted or non-reproducible forecasts
Reduced reliability of reporting and steering information
Poor generalization to new customer data
Unwarranted conclusions and increased risk of bias

With a structured approach to data cleaning, Onesurance guarantees:

Transparency and explainability of each model
Repeatable and reproducible analyses
Reliable and honest output for decisions in customer management

Commonly used techniques for data cleaning

Removing or imputing missing values.

Records with too much missing data are deleted.
Imputation: missing values are completed based on averages, medians or algorithmic estimates (e.g. KNN imputation).
Case study at Onesurance:
Missing claim amounts are imported based on similar claims in industry and segment.

Detection of outliers (outliers).

Statistical methods: Z-scores, interquartile range (IQR)
Visualization: box plots, scatterplots
Model-based detection: e.g., Isolation Forest
Case study:
An extremely high number of contact moments or claim frequency may indicate data errors, fraud attempts, or exceptional customer profiles.

Cleaning up text fields

Lowercasing and punctuation removal
Delete stop words and normalization via lemmatization/voting
Case study:
In analyses on contact notes and complaint descriptions, texts are structurally cleaned for sentiment analysis and classification of urgency.

Detection of inconsistent or duplicate records

Check for identical customer data, duplicate policies or claims without a valid policy reference
Case Study:
Prevent a relationship with dual enrollment from being considered as two unique customers in CLV calculations and churn analysis.

Specific concerns with insurance data

Policy maturity: Correct dating is essential for churn and lifetime calculations
Customer hierarchy: Structure within families or companies is optimally linked to model value well
Links between datasets: Claims are always linked to the correct policy and relationship

Additional steps at Onesurance

Validation: During each onboarding, data quality is validated
Logging & auditing: Every correction and edit is tracked for review and compliance
Feedback loop: Data cleaning is a continuous process, with input from account managers being fed back to the data team
Periodic data quality reports are available upon request

Want to know more or request a data quality check?

Have questions about how your data is cleaned within Onesurance , or would you like a report on your own data quality? Contact your Customer Success Manager - who can explain which cleaning techniques are relevant to your data structure, how exceptions are handled, and how Onesurance ensures continuity and reliability of predictions.