Data cleaning

Data cleaning in the machine learning process

A well-functioning machine learning model starts with one crucial step: clean and reliable data. In practice, raw data is often incomplete, inconsistent or contains errors. Therefore, data cleaning is an essential part of the AI process, especially in the insurance industry where decision-making directly affects customers, compliance and commercial value.

In this article, we provide an overview of common data cleaning techniques, and how we apply them to the data processed at Onesurance .

Why is data cleaning important?

Machine learning algorithms are sensitive to noise, errors and missing values. If left unresolved, these can lead to:

Biased predictions
Reduced reliability
Poor generalization to new customer data
Unwarranted conclusions or unfair bias

Through structured work on data cleaning, we ensure that the models:

Being transparently explainable
Be repeatable and reproducible
Making reliable and honest predictions

Data cleaning techniques used

1. Removing or imputing missing values.

Missing values (nulls or empty fields) are often resolved by:

Delete records if too much data is missing
Imputation: fill in based on averages, medians or via algorithmic estimation (e.g. KNN imputation)

Application to Onesurance:

For claims data, the lack of a claim amount can be imputed based on similar claims in the same industry and segment.

2. Detection of outliers (outliers).

Outliers can greatly affect a model. They are identified through:

Statistical methods (e.g., Z-scores, IQR)
Visualization (boxplots, scatterplots)
Model-based detection (e.g., Isolation Forests)

An example:

An extremely high contact volume or very high claim frequency may indicate data errors, exceptional customers or fraud sensitivity.

3. Cleaning up text fields.

Free text (such as complaint description or contact note) is cleaned up with:

Lowercasing, punctuation removal
Delete stop words
Lemmatization/Voice

An example:

In sentiment analysis on contact moments, the text is first cleaned, then classified by sentiment or urgency.

4. Detection of inconsistent or duplicate records

We check for:

Customers with identical name and address information
Policies that occur multiple times
Claims without a policy reference

An example:

Prevent a relationship with dual enrollment from being considered as two unique customers (e.g., when calculating CLV).

Specific points of interest in the insurance data

Onesurance 's data models process complex and rich information. Specific focus is on:

Maturity of policies: correct dating is essential for churn and lifetime calculations.
Customer hierarchies: family structures or business structures must be correctly linked to understand customer value.
Links between datasets: claim must be linked to correct policy; contact moments to correct relationship.

This is how we ensure the continuity and reliability of all forecasts used in the Onesurance.

Do you have questions about how your data is cleaned or want insight into your own data quality? If so, please contact your Customer Success Manager.

Data requirements

Data sharing