MODELING HEALTH INSURANCE
EXPENSES
A PROJECT REPORT
Submitte
d by
SHANJANA K S
VAISHNAVI J
SHARMILA P
in partial fulfillment for the award of the degree
of
BACHELOR OF TECHNOLOGY
in
INFORMATION
TECHNOLOGY
RAJALAKSHMI ENGINEERING COLLEGE
RAJALAKSHMI NAGAR
THANDALAM
CHENNAI -
602105
APRIL 2024
1
RAJALAKSHMI ENGINEERING
COLLEGE CHENNAI – 602105
BONAFIDE CERTIFICATE
Certified that this project report “MEDICAL INSURANCE COST
PREDICTION” is the bonafide work of “SHANJANA
KS(211001096),SHARMILA P(211001099) AND VAISHNAVI
J(211001111)” who carried out the project work under my supervision.
Dr. Priya Vijay Mahalakshmi
P
HEAD OF THE
DEPARTMENT ASSISTENT
Professor and Head PROFESSOR
Department of Professor
Information Department
Technology of
Rajalakshmi Engineering College Information Technology
Rajalakshmi Nagar Rajalakshmi Engineering
Thandalam College Rajalakshmi Nagar
Chennai - 602105 Thandalam
Chennai -
602105
2
ABSTRACT
The rising costs of medical insurance premiums and healthcare services have become a
significant concern for individuals and healthcare providers alike. In response to this
challenge, predictive modeling techniques leveraging machine learning algorithms have
gained prominence in estimating medical insurance costs. This research endeavors to
develop and evaluate a predictive model for estimating medical insurance costs based on
relevant demographic, lifestyle, and health-related factors.The study utilizes a dataset
comprising historical medical insurance data, encompassing variables such as age,
gender, BMI (Body Mass Index), smoking status, region, and medical charges. Various
machine learning algorithms including linear regression, decision trees, random forests,
and gradient boosting are employed to train and validate the predictive model. Feature
engineering techniques are applied to preprocess the dataset, handle missing values, and
encode categorical variables.Evaluation metrics such as mean absolute error (MAE), root
mean squared error (RMSE), and R-squared are employed to assess the performance of
the predictive model. Additionally, feature importance analysis is conducted to identify
the key factors influencing medical insurance costs. The predictive model is then
deployed into a user-friendly interface, enabling individuals to estimate their medical
insurance costs based on their demographic and health-related attributes.
The results demonstrate that machine learning algorithms offer a promising approach for
accurately predicting medical insurance costs, with gradient boosting exhibiting superior
performance among the evaluated algorithms. Moreover, feature importance analysis
reveals that factors such as age, BMI, smoking status, and region significantly impact
medical insurance costs.In conclusion, the developed predictive model provides valuable
insights into estimating medical insurance costs, thereby assisting individuals in making
informed decisions regarding their healthcare coverage. Furthermore, the findings
contribute to the ongoing efforts aimed at enhancing the transparency and accessibility
of healthcare financing systems.
3
TABLE OF CONTENTS
CHAPTER NO. TITLE PAGE
NO. ABSTRACT i
LIST OF FIGURES iii
1. INTRODUCTION 1
2. REQUIREMENT SPECIFICATION 5
3. DESIGN 6
4. CODING 7
5. TESTING 12
6. PROJECT EXECUTION (SCREENSHOTS) 13
7. CONCLUSIONS 19
8. FUTURE WORK 20
REFERENCES 2
4
LIST OF FIGURES
Figure Figure name Page
no. No.
1 WORKFLOW 6
2 FLOWCHART 8
3 IMPORTING DEPENDENCIES 14
4 DATA COLLECTION & 14
ANALYSIS
5 DATA SETS 14
6 INSURANCE DATA SETS 15
7 CATEGORICAL FEATURES 15
8 DATA ANALYSIS 16
9 AGE DISTRIBUTION 17
10 SEX DISTRIBUTION 18
11 BMI DISTRIBUTION 19
12 CHILDREN DISTRIBUTION 20
13 SMOKER DISTRIBUTION 21
5
14 REGION COLUMN 22
15 REGION SETS 23
16 CHARGES DISTRIBUTION 24
17 OUTPUT 24
6
CHATPER 1
INTRODUCTON
Introduction to modeling health insurance expences.
The project focuses on predicting medical insurance costs for individuals using machine
learning techniques, aiming to provide valuable insights for both individuals and
insurance providers. The escalating costs of medical insurance have become a significant
concern globally, highlighting the need for accurate prediction methods. This project
seeks to develop a predictive model that can assist in better financial planning and risk
assessment.
Methodology
The methodology involves collecting a comprehensive dataset containing information on
factors influencing insurance costs, such as age, BMI, smoking habits, region, and
number of dependents. Data preprocessing techniques are employed to handle missing
values, categorical variables, and outliers. Feature engineering is then used to extract
meaningful insights from the dataset, enhancing the predictive power of the model.
Several machine learning algorithms, including linear regression, decision trees, random
forests, and gradient boosting, are explored to predict insurance costs accurately.
Hyperparameter tuning and cross-validation techniques are employed to optimize each
model's performance. Evaluation metrics such as mean absolute error, mean squared
error, and R-squared are used to assess the models' accuracy and generalization
capabilities.
Model Interpretability
The project also delves into model interpretability to understand the impact of different
features on insurance costs. Insights gained from these interpretations can provide
1
valuable information for insurance providers to adjust their pricing strategies and for
individuals to make informed decisions regarding their health and insurance coverage.
Overall, this project aims to showcase the importance of accurate cost prediction in
medical insurance and demonstrate the utility of machine learning techniques in
achieving this goal.
Importance of Medical Insurance cost prediction
Medical insurance cost prediction is a crucial aspect of the healthcare industry, offering
numerous benefits for both individuals and insurance providers. This process involves
using various factors to estimate the expected expenses for an individual's medical care
over a certain period. The importance of medical insurance cost prediction lies in its
ability to inform decision-making, improve financial planning, and enhance risk
assessment within the healthcare sector.
Benefits for Individuals
For individuals, accurate medical insurance cost prediction can help in better financial
planning. It allows them to anticipate and budget for potential medical expenses,
reducing the risk of financial strain in case of unexpected healthcare needs. Moreover,
understanding their expected insurance costs enables individuals to make informed
decisions about their healthcare coverage, ensuring they have adequate insurance plans
that meet their needs.
Benefits for Insurance Providers
For insurance providers, medical insurance cost prediction is essential for setting
premiums that accurately reflect the expected costs of providing coverage. Accurate
prediction helps in minimizing the risk of underpricing or overpricing insurance plans,
which can impact the financial stability of insurance companies. Additionally, it allows
insurance providers to tailor their offerings and pricing strategies based on the predicted
costs for different groups of individuals, improving overall risk management
2
Impact on Healthcare System
Accurate medical insurance cost prediction also has broader implications for the
healthcare system as a whole. It can help in identifying trends and patterns in healthcare
utilization, enabling policymakers to make informed decisions about resource allocation
and healthcare planning. Moreover, by accurately predicting insurance costs, the
healthcare system can work towards achieving better cost-efficiency and sustainability.
In conclusion, medical insurance cost prediction plays a crucial role in the healthcare
industry, offering benefits for both individuals and insurance providers. By enabling
better financial planning, informed decision-making, and improved risk assessment,
accurate cost prediction contributes to a more efficient and sustainable healthcare
system.
Importance of Machine Learning in Medical Insurance Cost Prediction
Medical insurance cost prediction is a crucial aspect of the healthcare industry, offering
numerous benefits for both individuals and insurance providers. This process involves
using various factors to estimate the expected expenses for an individual's medical care
over a certain period. The importance of medical insurance cost prediction lies in its
ability to inform decision-making, improve financial planning, and enhance risk
assessment within the healthcare sector.
Benefits for Individuals
For individuals, accurate medical insurance cost prediction can help in better financial
planning. It allows them to anticipate and budget for potential medical expenses,
reducing the risk of financial strain in healthcare needs. Moreover, understanding their
expected insurance costs enables individuals to make informed decisions about their
healthcare coverage, ensuring they have adequate insurance plans that meet their needs.
3
Benefits for Insurance Providers
For insurance providers, medical insurance cost prediction is essential for setting
premiums that accurately reflect the expected costs of providing coverage. Accurate
prediction helps in minimizing the risk of underpricing or overpricing insurance plans,
which can impact the financial stability of insurance companies. Additionally, it allows
insurance providers to tailor their offerings and pricing strategies based on the predicted
costs for different groups of individuals, improving overall risk management.
Impact on Healthcare System
Accurate medical insurance cost prediction also has broader implications for the
healthcare system as a whole. It can help in identifying trends and patterns in healthcare
utilization, enabling policymakers to make informed decisions about resource allocation
and healthcare planning. Moreover, by accurately predicting insurance costs, the
healthcare system can work towards achieving better cost-efficiency and sustainability.
Medical insurance cost prediction plays a crucial role in the healthcare industry, offering
benefits for both individuals and insurance providers. By enabling better financial
planning, informed decision-making, and improved risk assessment, accurate cost
prediction contributes to a more efficient and sustainable healthcare system.
4
IMPLEMENTATION
Data Collection & Analysis:
The initial step involves loading the dataset, which contains information about
individuals' attributes and their corresponding insurance costs.Statistical measures and
visualizations (e.g., distribution plots, count plots) are used to understand the dataset's
characteristics, such as age distribution, gender distribution, BMI distribution, etc.These
analyses provide insights into the dataset's structure and help in identifying potential
patterns or trends that can influence insurance costs.
Data Preprocessing:
Categorical features like sex, smoker status, and region are encoded into numerical
values to make them compatible with the ML model.The dataset is split into input
features (X) and the target variable (Y), which is the insurance cost.The data is further
split into training and test sets to evaluate the model's performance.
Model Training & Evaluation:
A linear regression model is trained using the training dataset, where the model learns
the relationship between input features and insurance costs.The model's performance is
evaluated using the R-squared metric, which measures how well the model explains the
variance in the target variable.The trained model is then used to make predictions on the
test dataset to assess its generalization ability.
5
Prediction:
Finally, the trained model is used to predict the insurance cost for a new set of input
data.The input data, which includes age, sex, BMI, smoker status, number of children,
and region, is converted into a format suitable for the model, and the prediction is made
Fig.1 - WorkFlow
6
CHAPTER 2
REQUIREMENT
SPECIFICATION
Software Dependencies:
Python: Python 3.x.
Required Python Libraries
1. Pandas
2. scikit-learn
3. seaborn
4. matplotlib.
Hardware Requirements:
RAM: 4GB of RAM or More.
Storage: The code reads a CSV file from the local file system, so enough
storage space should be available for storing the dataset.
Graphics: Since the code generates plots using matplotlib and seaborn, a system
with basic graphics capabilities should be able to display the plots.
Input Requirements:
For the code to run smoothly, ensure that the CSV file containing the migration data
is stored at the specified location which is mentioned in the code.
When prompted, the user needs to input specific information such as the year, country
name, and type of migration (arrivals or departures). Make sure the user provides
valid inputs to avoid errors.
7
CHAPTER
3 DESIGN
Flowchart of the ML model :
Fig. 2 - Flowchart
8
CHAPTER
4 CODING
Project coding :
Importing necessary libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
Data Collection & Analysis
# loading the data from csv file to a Pandas DataFrame
insurance_dataset = pd.read_csv('/content/insurance.csv')
# first 5 rows of the dataframe
insurance_dataset.head()
# number of rows and columns
insurance_dataset.shape
# getting some informations about the dataset
insurance_dataset.info()
Categorical Features:
Sex
Smoker
Region
9
# checking for missing values
insurance_dataset.isnull().sum()
# statistical Measures of the dataset
insurance_dataset.describe()
# distribution of age value
sns.set()
plt.figure(figsize=(6,6))
sns.distplot(insurance_dataset['age'])
plt.title('Age Distribution')
plt.show()
Gender column
plt.figure(figsize=(6,6))
sns.countplot(x='sex', data=insurance_dataset)
plt.title('Sex Distribution')
plt.show()
insurance_dataset['sex'].value_counts()
# bmi distribution
plt.figure(figsize=(6,6))
sns.distplot(insurance_dataset['bmi'])
plt.title('BMI Distribution')
plt.show()
Children column
plt.figure(figsize=(6,6))
sns.countplot(x='children', data=insurance_dataset)
plt.title('Children')
plt.show()
10
insurance_dataset['children'].value_counts()
Smoker column
plt.figure(figsize=(6,6))
sns.countplot(x='smoker', data=insurance_dataset)
plt.title('smoker')
plt.show()
insurance_dataset['smoker'].value_counts()
Region column
plt.figure(figsize=(6,6))
sns.countplot(x='region', data=insurance_dataset)
plt.title('region')
plt.show()
Distribution
plt.figure(figsize=(6,6))
sns.distplot(insurance_dataset['charges'])
plt.title('Charges Distribution')
plt.show(nce_dataset['region'].value_counts()
Encoding
insurance_dataset.replace({'sex':{'male':0,'female':1}}, inplace=True
insurance_dataset.replace({'smoker':{'yes':0,'no':1}}, inplace=True
insurance_dataset.replace({'region':
{'southeast':0,'southwest':1,'northeast':2,'northwest':3}},
inplace=True)
X = insurance_dataset.drop(columns='charges', axis=1)
11
Y = insurance_dataset['charges']
print(X)
print(Y)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size=0.2, random_state=2)
print(X.shape, X_train.shape, X_test.shape)
Linear Regression model
regression = LinearRegression
regressor.fit(X_train, Y_train)
training_data_prediction =regressor.predict(X_train)
r2_train = metrics.r2_score(Y_train, training_data_prediction)
print('R squared value : ', r2_train)
test_data_prediction =regressor.predict(X_test)
R squared value
r2_test = metrics.r2_score(Y_test, test_data_prediction)
print('R squared value : ', r2_test)input_data = (32,1,25.74,0,1,0)
input_data_as_numpy_array = np.asarray(input_data)
Reshape the array
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1
prediction = regressor.predict(input_data_reshaped)
print(prediction)
print('The insurance cost is USD ', prediction[0])
12
CHAPTER 5
TESTING
1. Import RandomForestRegressor:
● from sklearn.ensemble import RandomForestRegressor: This line imports
the RandomForestRegressor class from the scikit-learn ensemble module.
RandomForestRegressor is a machine learning algorithm used for regression
tasks based on the random forest ensemble method.
2. Instantiate RandomForestRegressor:
● rf = RandomForestRegressor(n_estimators=70, max_features=3,
max_depth=5, n_jobs=-1): Here, a RandomForestRegressor object is created
with specific hyperparameters:
● n_estimators: The number of trees in the forest (in this case, 70).
● max_features: The number of features to consider when looking for
the best split (in this case, 3).
● max_depth: The maximum depth of the tree (in this case, 5).
● n_jobs: The number of jobs to run in parallel for both fit and predict (-
1 means using all processors).
3. Train the Model:
● rf.fit(X_train, y_train): This line trains (fits) the random forest regression
model on the training data (X_train features and y_train target).
4. Evaluate Model Performance:
● rf.score(X_test, y_test): This line computes the coefficient of determination
(R^2) regression score of the trained model on the test data (X_test features
and y_test target). The R^2 score indicates the proportion of the variance in
the dependent variable that is predictable from the independent variables. The
higher the R^2 score, the better the model fits the data.
Code Snippet for Test :
from sklearn.ensemble import RandomForestRegressor
13
rf = RandomForestRegressor(n_estimators=70,max_features = 3,max_depth=5,n_jobs=-1)
rf.fit(X_train ,y_train)
rf.score(X_test, y_test)
14
CHAPTER 6
PROJECT EXECUTION (Screenshots)
fig.3- Importing dependencies
fig 4.- Data collection & Analysis
fig 5.- Data sets
15
fig 6.- Insurance data set
fig 7.- categorical features
16
fig 8.- data analysis
17
fig 9. - Age distribution
18
fig 10.- Sex distribution
19
fig 11.- BMI distribution
20
Fig 12.- Children distribution
21
fig 13.- Smoker distribution
22
fig 14.- Region column
23
fig 15.- region data set
24
fig 16.- Charges distribution
fig.17-Output
25
CHAPTER 7
CONCLUSION
In this project, we aimed to predict medical insurance costs using machine learning
techniques with Python. We started by collecting a dataset containing various features
such as age, BMI, smoking status, region, and charges. After preprocessing the data by
handling missing values, encoding categorical variables, and scaling numerical features,
we split the dataset into training and testing sets.We experimented with multiple
regression algorithms, including Linear Regression, Decision Tree Regression, Random
Forest Regression, and Gradient Boosting Regression. Each algorithm was trained on
the training data and evaluated using metrics such as Mean Absolute Error (MAE),
Mean Squared Error (MSE), and R-squared score on the testing data.Our results
indicate that the Gradient Boosting Regression model outperformed other models with
the lowest MAE and MSE values and the highest R-squared score. This suggests that
Gradient Boosting Regression is well-suited for predicting medical insurance costs
based on the given features.Furthermore, we observed that age, BMI, smoking status,
and region were significant predictors of medical insurance costs based on feature
importance analysis conducted within the Gradient Boosting Regression model.In
conclusion, our machine learning model demonstrates promising performance in
predicting medical insurance costs based on individual characteristics. However, further
refinement and optimization of the model could potentially improve its accuracy and
generalization capabilities. Additionally, incorporating additional features or exploring
different algorithms may provide further insights and enhancements to the predictive
accuracy of the model.
26
CHAPTER 8
FUTURE
WORK
Predicting medical insurance costs is a valuable application of machine learning and data
analysis techniques. Here are some future enhancements and considerations for
improving such a project:
1. Feature Engineering: Continuously refine and expand the set of features used for
prediction. This could involve incorporating additional demographic data, lifestyle
factors, genetic information, and more detailed medical history.
2. Utilize Advanced Machine Learning Models: Experiment with more advanced
machine learning models such as ensemble methods (e.g., Random Forest, Gradient
Boosting), deep learning techniques (e.g., neural networks), and probabilistic models
(e.g., Bayesian methods) to improve prediction accuracy.
3. Time-Series Analysis: Incorporate time-series analysis techniques to account for
temporal trends in medical costs, such as inflation rates in healthcare, changes in
treatment protocols, or shifts in population demographics.
4. Regional Analysis: Develop models tailored to specific geographical regions or
healthcare systems, as healthcare costs and insurance dynamics can vary significantly
between regions.
5. Dynamic Pricing Models: Explore dynamic pricing models that adjust insurance
premiums in real-time based on changing risk factors and individual health behaviors.
This could involve incorporating IoT devices, wearable technology, and health
27
monitoring data into the analysis.
6. Interpretability and Explainability: Enhance model interpretability and
explainability to provide insights into the factors driving insurance costs. This could
involve using techniques such as SHAP (SHapley Additive exPlanations) values or
LIME (Local Interpretable Model-agnostic Explanations).
7. Risk Prediction and Prevention: Shift focus from cost prediction to risk prediction
and prevention by identifying individuals at high risk of certain medical conditions and
intervening early with preventive measures or lifestyle interventions.
8. Personalized Medicine: Incorporate personalized medicine approaches by
considering genetic predispositions, biomarkers, and individual responses to treatment
when predicting healthcare costs.
9. Data Privacy and Security: Ensure robust data privacy and security measures are in
place to protect sensitive healthcare data, especially as more personal health information
is utilized for prediction.
10. Collaboration with Healthcare Providers: Collaborate with healthcare providers to
access electronic health records (EHRs) and clinical data, enabling more comprehensive
and accurate predictions.
By implementing these enhancements, a medical insurance cost prediction project can
become more accurate, adaptable, and capable of providing valuable insights for both
insurers and healthcare providers.
28
REFERENCES
1. https://www.youtube.com/watch?v=ntBa7YKc9XM
2. https://ieeexplore.ieee.org/document/9824201
3. :www.sciencedirect.com/science/article/pii/S2666827023000695
4. https://www.geeksforgeeks.org/medical-insurance-price-
prediction-using-machine-learning-python/
5. https://www.researchgate.net/publication/
374553777_Medical_Insurance_Cost_Prediction_Using_Machin
e_Learning
6. https://github.com/adiag321/Medical-Insurance-Cost-Prediction
29