Search Docs ...

Search Docs ...

Data migration

Data migration

Data migration

Data cleaning

Data cleaning in the machine learning process

A well-functioning machine learning model starts with one crucial step: clean and reliable data. In practice, raw data is often incomplete, inconsistent or contains errors. Therefore, data cleaning is an essential part of the AI process, especially in the insurance industry where decision-making directly affects customers, compliance and commercial value.

In this article, we provide an overview of common data cleaning techniques, and how we apply them to the data processed at Onesurance .


Why is data cleaning important?

Machine learning algorithms are sensitive to noise, errors and missing values. If left unresolved, these can lead to:

  • Biased predictions

  • Reduced reliability

  • Poor generalization to new customer data

  • Unwarranted conclusions or unfair bias


Through structured work on data cleaning, we ensure that the models:

  • Being transparently explainable

  • Be repeatable and reproducible

  • Making reliable and honest predictions


Data cleaning techniques used

1. Removing or imputing missing values.

Missing values (nulls or empty fields) are often resolved by:

  • Delete records if too much data is missing

  • Imputation: fill in based on averages, medians or via algorithmic estimation (e.g. KNN imputation)


Application to Onesurance:

For claims data, the lack of a claim amount can be imputed based on similar claims in the same industry and segment.


2. Detection of outliers (outliers).

Outliers can greatly affect a model. They are identified through:

  • Statistical methods (e.g., Z-scores, IQR)

  • Visualization (boxplots, scatterplots)

  • Model-based detection (e.g., Isolation Forests)


An example:

An extremely high contact volume or very high claim frequency may indicate data errors, exceptional customers or fraud sensitivity.


3. Cleaning up text fields.

Free text (such as complaint description or contact note) is cleaned up with:

  • Lowercasing, punctuation removal

  • Delete stop words

  • Lemmatization/Voice


An example:

In sentiment analysis on contact moments, the text is first cleaned, then classified by sentiment or urgency.


4. Detection of inconsistent or duplicate records

We check for:

  • Customers with identical name and address information

  • Policies that occur multiple times

  • Claims without a policy reference


An example:

Prevent a relationship with dual enrollment from being considered as two unique customers (e.g., when calculating CLV).


Specific points of interest in the insurance data

Onesurance 's data models process complex and rich information. Specific focus is on:

  • Maturity of policies: correct dating is essential for churn and lifetime calculations.

  • Customer hierarchies: family structures or business structures must be correctly linked to understand customer value.

  • Links between datasets: claim must be linked to correct policy; contact moments to correct relationship.


This is how we ensure the continuity and reliability of all forecasts used in the Onesurance.

Do you have questions about how your data is cleaned or want insight into your own data quality? If so, please contact your Customer Success Manager.