Accurate insurance claims prediction with Deep Learning

Lotus Labs
6 min readMar 31, 2021


Insurance companies are extremely interested in the prediction of the future. Accurate prediction gives a chance to reduce the financial loss for the company. A major cause of increased costs is payment errors made by the insurance companies while processing claims. Furthermore, processing the claims against the accounts for a significant portion of administrative costs because of the payment errors.

Many other sectors have long recognized the potential in self-learning software and cognitive systems. AI and Machine Learning can help companies optimize their services with higher accuracy, strengthening claims management by systematically identifying and correcting errors and providing tools for making better decisions.

Less known are the opportunities that the use of smart technology enables for insurers. Machine Learning can help insurers efficiently screen cases, evaluate them with greater precision, and make accurate cost predictions.

The conventional approach to claims management is built on rule-based algorithms. These algorithms are inflexible, and once the rules are written, they tend to be applied equally to every case. Insurance companies can apply cognitive models for analyzing and predicting insurance costs and perform claim management. These models use historical data to evolve and find patterns that can be used to optimize services further. McKinsey has estimated that German insurers could save about 500 Million euros each year by just switching to Machine Learning systems.

This article examines how machine learning algorithms developed at LotusLabs significantly improve the prediction accuracy, helping insurance companies automate their decision-making processes with better accuracy by generalizing and learning patterns from historical examples.

To showcase how machine learning can make a difference in claim prediction,, we will discuss a health insurance use case based on publicly available data.

Health Insurance Claims

The first dataset consists of 1338 anonymous records of health insurance claims with 7 features: the age of the policyholder, their gender, their body mass index (BMI), the number of children if they are smokers or not, the residence region, and the individual medical costs billed by the health insurance.

An example of how the data looks like can be seen in the following table:

If we plot the correlation between all the features, we observe some positive correlation between charges and age, BMI, and being a smoker. This makes sense, given that being a smoker and obese is a strong representative of having an unhealthy lifestyle. Despite the logical information, this correlation is not strong enough to come up with strong conclusions at this point.

A more in-depth look at the charges with a joint plot against BMI is illustrated in the following figure. There are clear non-linearities in the relationship between the two features. These non-linearities might be grouped in two or three categories, but at this point is still difficult to make conclusions. We will try to implement models that exploit these non-linearities, such as neural networks and tree-based models,.

If we further mark data points for smokers and non-smokers, as shown in the following figure, we observe two clear trends and can easily understand what they represent. In the first group, plotted in cyan, the nonsmokers have a flat trend, meaning that smoking and BMI are not linearly correlated with the charges, while the blue group, the smokers, underlines a clear, strong trend. This trend shows that being obese and a smoker strongly correlates with charges, and the more obese, the higher the charges will be.

Figure 3. Trends are shown in the data between smokers and non-smokers. Obesity does not influence the charges as much as smoking.

In the following sections, we will incorporate this information in machine learning models and then show how our deep learning model outperforms manual feature engineering and data analysis, giving better performances.

Our Model

Now that we have visualized and understood the dataset, we can create a model that predicts the cost of claims. To do so, we create a tailored deep learning algorithm that outperforms most common machine learning models.

Figure 4. Schematics between rule-based systems, machine learning, and deep learning. Deep learning systems, can produce better results with less manual input.

Deep learning is a powerful class of machine learning algorithms that use artificial neural networks to understand and leverage patterns in data. Deep learning algorithms use multiple layers to extract higher-level features from raw data progressively: this reduces the amount of feature extraction needed in other machine learning methods. The deep learning algorithm learns on its own by recognizing patterns using many layers of processing. That is why the “deep” in “deep learning” refers to the number of layers through which the data is transformed. Multiple transformations automatically extract important features from raw data.

This is totally the opposite of more traditional, rule-based methods, where the manual input is on both the data analysis and feature extraction plus the rule creation, which is usually a tedious process.

Figure 5. Our model schematics.

Our model's core idea is the use of entity embeddings, which means using a different set of sdimensions to represent a categorical set of data.

A categorical set of inputs is a type of data where we have different categories (or types) that are unrelated to each other. Each entity is now an embedding (vector) in new dimensions, hence the term entity embedding (More on Entity Embeddings in this paper). Think of these different dimensions as different characteristics in the dataset. What we find, applying this technique, is a hidden (or latent) representation that works for our specific problem. A neural network learns the hidden representation during the standard supervised training process. By mapping similar values close to each other in the embedding space, the model identifies patterns that would have been difficult to reveal for the categorical variables. This means that we can find useful patterns without performing any feature engineering! i.e., no tagging of records with any features and no clustering smokers or BMI patients.

Now, let us compare our deep learning model against some popular machine learning algorithms (XGBoost, Random Forest) to showcase deep learning models' predictive accuracy. The metric we choose to evaluate the regression models is the mean absolute error (MAE). As seen in the figure below, our Deep Learning model shows good performances compared to some classical machine learning models, in this case, improving the error MAE by 11%.

In future posts, we will discuss how this type of model generalizes extremely well and can be applied to similar datasets.

How can LotusLabs help you?

Building an AI system is clearly a complex undertaking. The right conditions must be in place to ensure that the system also works reliably in day-to-day operations, performing as planned. The factors that determine whether the implementation is successful cover all levels of the insurance business.

At Lotus Labs, we are experts in Machine Learning and AI infrastructure. Our people work with your people, at all levels. Our methods help you find ways to put AI to work.

You want to see AI drive value in every corner of your business. But how do you get started? And how do you get there before your competition? LotusLabs helps you define an AI Roadmap that contains your vision. With the roadmap ready, you can focus on projects with the highest return and least risk.

Transform your business into an AI-driven enterprise, implementing machine learning models that solve complex business problems and drive real ROI on the path toward functioning AI-supported insurers.



Lotus Labs

Transform your business into an AI-driven enterprise. We specialize in Machine learning for Retail, Insurance, and Healthcare industries.