0% found this document useful (0 votes)
12 views35 pages

Report

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views35 pages

Report

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

MODELING HEALTH INSURANCE

EXPENSES

A PROJECT REPORT

Submitte

d by

SHANJANA K S
VAISHNAVI J
SHARMILA P

in partial fulfillment for the award of the degree


of
BACHELOR OF TECHNOLOGY
in
INFORMATION
TECHNOLOGY

RAJALAKSHMI ENGINEERING COLLEGE


RAJALAKSHMI NAGAR
THANDALAM
CHENNAI -
602105

APRIL 2024

1
RAJALAKSHMI ENGINEERING
COLLEGE CHENNAI – 602105

BONAFIDE CERTIFICATE

Certified that this project report “MEDICAL INSURANCE COST

PREDICTION” is the bonafide work of “SHANJANA

KS(211001096),SHARMILA P(211001099) AND VAISHNAVI

J(211001111)” who carried out the project work under my supervision.

Dr. Priya Vijay Mahalakshmi


P
HEAD OF THE
DEPARTMENT ASSISTENT
Professor and Head PROFESSOR
Department of Professor
Information Department
Technology of
Rajalakshmi Engineering College Information Technology
Rajalakshmi Nagar Rajalakshmi Engineering
Thandalam College Rajalakshmi Nagar
Chennai - 602105 Thandalam

Chennai -
602105

2
ABSTRACT

The rising costs of medical insurance premiums and healthcare services have become a
significant concern for individuals and healthcare providers alike. In response to this
challenge, predictive modeling techniques leveraging machine learning algorithms have
gained prominence in estimating medical insurance costs. This research endeavors to
develop and evaluate a predictive model for estimating medical insurance costs based on
relevant demographic, lifestyle, and health-related factors.The study utilizes a dataset
comprising historical medical insurance data, encompassing variables such as age,
gender, BMI (Body Mass Index), smoking status, region, and medical charges. Various
machine learning algorithms including linear regression, decision trees, random forests,
and gradient boosting are employed to train and validate the predictive model. Feature
engineering techniques are applied to preprocess the dataset, handle missing values, and
encode categorical variables.Evaluation metrics such as mean absolute error (MAE), root
mean squared error (RMSE), and R-squared are employed to assess the performance of
the predictive model. Additionally, feature importance analysis is conducted to identify
the key factors influencing medical insurance costs. The predictive model is then
deployed into a user-friendly interface, enabling individuals to estimate their medical
insurance costs based on their demographic and health-related attributes.
The results demonstrate that machine learning algorithms offer a promising approach for
accurately predicting medical insurance costs, with gradient boosting exhibiting superior
performance among the evaluated algorithms. Moreover, feature importance analysis
reveals that factors such as age, BMI, smoking status, and region significantly impact
medical insurance costs.In conclusion, the developed predictive model provides valuable
insights into estimating medical insurance costs, thereby assisting individuals in making
informed decisions regarding their healthcare coverage. Furthermore, the findings
contribute to the ongoing efforts aimed at enhancing the transparency and accessibility
of healthcare financing systems.

3
TABLE OF CONTENTS
CHAPTER NO. TITLE PAGE

NO. ABSTRACT i

LIST OF FIGURES iii

1. INTRODUCTION 1

2. REQUIREMENT SPECIFICATION 5

3. DESIGN 6

4. CODING 7

5. TESTING 12

6. PROJECT EXECUTION (SCREENSHOTS) 13

7. CONCLUSIONS 19

8. FUTURE WORK 20

REFERENCES 2

4
LIST OF FIGURES

Figure Figure name Page


no. No.
1 WORKFLOW 6

2 FLOWCHART 8

3 IMPORTING DEPENDENCIES 14

4 DATA COLLECTION & 14


ANALYSIS

5 DATA SETS 14

6 INSURANCE DATA SETS 15

7 CATEGORICAL FEATURES 15

8 DATA ANALYSIS 16

9 AGE DISTRIBUTION 17

10 SEX DISTRIBUTION 18

11 BMI DISTRIBUTION 19

12 CHILDREN DISTRIBUTION 20

13 SMOKER DISTRIBUTION 21

5
14 REGION COLUMN 22
15 REGION SETS 23
16 CHARGES DISTRIBUTION 24
17 OUTPUT 24

6
CHATPER 1

INTRODUCTON

Introduction to modeling health insurance expences.

The project focuses on predicting medical insurance costs for individuals using machine
learning techniques, aiming to provide valuable insights for both individuals and
insurance providers. The escalating costs of medical insurance have become a significant
concern globally, highlighting the need for accurate prediction methods. This project
seeks to develop a predictive model that can assist in better financial planning and risk
assessment.

Methodology

The methodology involves collecting a comprehensive dataset containing information on


factors influencing insurance costs, such as age, BMI, smoking habits, region, and
number of dependents. Data preprocessing techniques are employed to handle missing
values, categorical variables, and outliers. Feature engineering is then used to extract
meaningful insights from the dataset, enhancing the predictive power of the model.

Several machine learning algorithms, including linear regression, decision trees, random
forests, and gradient boosting, are explored to predict insurance costs accurately.
Hyperparameter tuning and cross-validation techniques are employed to optimize each
model's performance. Evaluation metrics such as mean absolute error, mean squared
error, and R-squared are used to assess the models' accuracy and generalization
capabilities.

Model Interpretability

The project also delves into model interpretability to understand the impact of different
features on insurance costs. Insights gained from these interpretations can provide
1
valuable information for insurance providers to adjust their pricing strategies and for
individuals to make informed decisions regarding their health and insurance coverage.
Overall, this project aims to showcase the importance of accurate cost prediction in
medical insurance and demonstrate the utility of machine learning techniques in
achieving this goal.

Importance of Medical Insurance cost prediction

Medical insurance cost prediction is a crucial aspect of the healthcare industry, offering
numerous benefits for both individuals and insurance providers. This process involves
using various factors to estimate the expected expenses for an individual's medical care
over a certain period. The importance of medical insurance cost prediction lies in its
ability to inform decision-making, improve financial planning, and enhance risk
assessment within the healthcare sector.

Benefits for Individuals


For individuals, accurate medical insurance cost prediction can help in better financial
planning. It allows them to anticipate and budget for potential medical expenses,
reducing the risk of financial strain in case of unexpected healthcare needs. Moreover,
understanding their expected insurance costs enables individuals to make informed
decisions about their healthcare coverage, ensuring they have adequate insurance plans
that meet their needs.

Benefits for Insurance Providers


For insurance providers, medical insurance cost prediction is essential for setting
premiums that accurately reflect the expected costs of providing coverage. Accurate
prediction helps in minimizing the risk of underpricing or overpricing insurance plans,
which can impact the financial stability of insurance companies. Additionally, it allows
insurance providers to tailor their offerings and pricing strategies based on the predicted
costs for different groups of individuals, improving overall risk management
2
Impact on Healthcare System

Accurate medical insurance cost prediction also has broader implications for the
healthcare system as a whole. It can help in identifying trends and patterns in healthcare
utilization, enabling policymakers to make informed decisions about resource allocation
and healthcare planning. Moreover, by accurately predicting insurance costs, the
healthcare system can work towards achieving better cost-efficiency and sustainability.

In conclusion, medical insurance cost prediction plays a crucial role in the healthcare
industry, offering benefits for both individuals and insurance providers. By enabling
better financial planning, informed decision-making, and improved risk assessment,
accurate cost prediction contributes to a more efficient and sustainable healthcare
system.

Importance of Machine Learning in Medical Insurance Cost Prediction

Medical insurance cost prediction is a crucial aspect of the healthcare industry, offering
numerous benefits for both individuals and insurance providers. This process involves
using various factors to estimate the expected expenses for an individual's medical care
over a certain period. The importance of medical insurance cost prediction lies in its
ability to inform decision-making, improve financial planning, and enhance risk
assessment within the healthcare sector.

Benefits for Individuals

For individuals, accurate medical insurance cost prediction can help in better financial
planning. It allows them to anticipate and budget for potential medical expenses,
reducing the risk of financial strain in healthcare needs. Moreover, understanding their
expected insurance costs enables individuals to make informed decisions about their
healthcare coverage, ensuring they have adequate insurance plans that meet their needs.
3
Benefits for Insurance Providers

For insurance providers, medical insurance cost prediction is essential for setting
premiums that accurately reflect the expected costs of providing coverage. Accurate
prediction helps in minimizing the risk of underpricing or overpricing insurance plans,
which can impact the financial stability of insurance companies. Additionally, it allows
insurance providers to tailor their offerings and pricing strategies based on the predicted
costs for different groups of individuals, improving overall risk management.

Impact on Healthcare System

Accurate medical insurance cost prediction also has broader implications for the
healthcare system as a whole. It can help in identifying trends and patterns in healthcare
utilization, enabling policymakers to make informed decisions about resource allocation
and healthcare planning. Moreover, by accurately predicting insurance costs, the
healthcare system can work towards achieving better cost-efficiency and sustainability.

Medical insurance cost prediction plays a crucial role in the healthcare industry, offering
benefits for both individuals and insurance providers. By enabling better financial
planning, informed decision-making, and improved risk assessment, accurate cost
prediction contributes to a more efficient and sustainable healthcare system.

4
IMPLEMENTATION

Data Collection & Analysis:

The initial step involves loading the dataset, which contains information about

individuals' attributes and their corresponding insurance costs.Statistical measures and

visualizations (e.g., distribution plots, count plots) are used to understand the dataset's

characteristics, such as age distribution, gender distribution, BMI distribution, etc.These

analyses provide insights into the dataset's structure and help in identifying potential

patterns or trends that can influence insurance costs.

Data Preprocessing:

Categorical features like sex, smoker status, and region are encoded into numerical

values to make them compatible with the ML model.The dataset is split into input

features (X) and the target variable (Y), which is the insurance cost.The data is further

split into training and test sets to evaluate the model's performance.

Model Training & Evaluation:

A linear regression model is trained using the training dataset, where the model learns

the relationship between input features and insurance costs.The model's performance is

evaluated using the R-squared metric, which measures how well the model explains the

variance in the target variable.The trained model is then used to make predictions on the

test dataset to assess its generalization ability.

5
Prediction:

Finally, the trained model is used to predict the insurance cost for a new set of input

data.The input data, which includes age, sex, BMI, smoker status, number of children,

and region, is converted into a format suitable for the model, and the prediction is made

Fig.1 - WorkFlow

6
CHAPTER 2

REQUIREMENT

SPECIFICATION

Software Dependencies:
Python: Python 3.x.
Required Python Libraries
1. Pandas
2. scikit-learn
3. seaborn
4. matplotlib.

Hardware Requirements:
RAM: 4GB of RAM or More.
Storage: The code reads a CSV file from the local file system, so enough
storage space should be available for storing the dataset.
Graphics: Since the code generates plots using matplotlib and seaborn, a system
with basic graphics capabilities should be able to display the plots.

Input Requirements:
For the code to run smoothly, ensure that the CSV file containing the migration data
is stored at the specified location which is mentioned in the code.
When prompted, the user needs to input specific information such as the year, country
name, and type of migration (arrivals or departures). Make sure the user provides
valid inputs to avoid errors.

7
CHAPTER

3 DESIGN

Flowchart of the ML model :

Fig. 2 - Flowchart

8
CHAPTER

4 CODING

Project coding :

Importing necessary libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

Data Collection & Analysis

# loading the data from csv file to a Pandas DataFrame


insurance_dataset = pd.read_csv('/content/insurance.csv')
# first 5 rows of the dataframe
insurance_dataset.head()
# number of rows and columns
insurance_dataset.shape
# getting some informations about the dataset
insurance_dataset.info()

Categorical Features:

Sex
Smoker
Region

9
# checking for missing values
insurance_dataset.isnull().sum()
# statistical Measures of the dataset
insurance_dataset.describe()
# distribution of age value
sns.set()
plt.figure(figsize=(6,6))
sns.distplot(insurance_dataset['age'])
plt.title('Age Distribution')
plt.show()

Gender column
plt.figure(figsize=(6,6))
sns.countplot(x='sex', data=insurance_dataset)
plt.title('Sex Distribution')
plt.show()
insurance_dataset['sex'].value_counts()
# bmi distribution
plt.figure(figsize=(6,6))
sns.distplot(insurance_dataset['bmi'])
plt.title('BMI Distribution')
plt.show()

Children column
plt.figure(figsize=(6,6))
sns.countplot(x='children', data=insurance_dataset)
plt.title('Children')
plt.show()

10
insurance_dataset['children'].value_counts()

Smoker column
plt.figure(figsize=(6,6))
sns.countplot(x='smoker', data=insurance_dataset)
plt.title('smoker')
plt.show()
insurance_dataset['smoker'].value_counts()

Region column
plt.figure(figsize=(6,6))
sns.countplot(x='region', data=insurance_dataset)
plt.title('region')
plt.show()

Distribution
plt.figure(figsize=(6,6))
sns.distplot(insurance_dataset['charges'])
plt.title('Charges Distribution')
plt.show(nce_dataset['region'].value_counts()

Encoding
insurance_dataset.replace({'sex':{'male':0,'female':1}}, inplace=True
insurance_dataset.replace({'smoker':{'yes':0,'no':1}}, inplace=True
insurance_dataset.replace({'region':
{'southeast':0,'southwest':1,'northeast':2,'northwest':3}},
inplace=True)
X = insurance_dataset.drop(columns='charges', axis=1)

11
Y = insurance_dataset['charges']
print(X)
print(Y)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size=0.2, random_state=2)
print(X.shape, X_train.shape, X_test.shape)

Linear Regression model

regression = LinearRegression
regressor.fit(X_train, Y_train)
training_data_prediction =regressor.predict(X_train)
r2_train = metrics.r2_score(Y_train, training_data_prediction)
print('R squared value : ', r2_train)
test_data_prediction =regressor.predict(X_test)

R squared value
r2_test = metrics.r2_score(Y_test, test_data_prediction)
print('R squared value : ', r2_test)input_data = (32,1,25.74,0,1,0)
input_data_as_numpy_array = np.asarray(input_data)

Reshape the array

input_data_reshaped = input_data_as_numpy_array.reshape(1,-1
prediction = regressor.predict(input_data_reshaped)
print(prediction)
print('The insurance cost is USD ', prediction[0])

12
CHAPTER 5

TESTING
1. Import RandomForestRegressor:

● from sklearn.ensemble import RandomForestRegressor: This line imports


the RandomForestRegressor class from the scikit-learn ensemble module.
RandomForestRegressor is a machine learning algorithm used for regression
tasks based on the random forest ensemble method.
2. Instantiate RandomForestRegressor:

● rf = RandomForestRegressor(n_estimators=70, max_features=3,
max_depth=5, n_jobs=-1): Here, a RandomForestRegressor object is created
with specific hyperparameters:

● n_estimators: The number of trees in the forest (in this case, 70).

● max_features: The number of features to consider when looking for


the best split (in this case, 3).

● max_depth: The maximum depth of the tree (in this case, 5).

● n_jobs: The number of jobs to run in parallel for both fit and predict (-
1 means using all processors).
3. Train the Model:

● rf.fit(X_train, y_train): This line trains (fits) the random forest regression
model on the training data (X_train features and y_train target).
4. Evaluate Model Performance:

● rf.score(X_test, y_test): This line computes the coefficient of determination


(R^2) regression score of the trained model on the test data (X_test features
and y_test target). The R^2 score indicates the proportion of the variance in
the dependent variable that is predictable from the independent variables. The
higher the R^2 score, the better the model fits the data.
Code Snippet for Test :
from sklearn.ensemble import RandomForestRegressor

13
rf = RandomForestRegressor(n_estimators=70,max_features = 3,max_depth=5,n_jobs=-1)
rf.fit(X_train ,y_train)
rf.score(X_test, y_test)

14
CHAPTER 6
PROJECT EXECUTION (Screenshots)

fig.3- Importing dependencies

fig 4.- Data collection & Analysis

fig 5.- Data sets

15
fig 6.- Insurance data set

fig 7.- categorical features

16
fig 8.- data analysis

17
fig 9. - Age distribution

18
fig 10.- Sex distribution

19
fig 11.- BMI distribution

20
Fig 12.- Children distribution

21
fig 13.- Smoker distribution

22
fig 14.- Region column

23
fig 15.- region data set

24
fig 16.- Charges distribution

fig.17-Output

25
CHAPTER 7

CONCLUSION

In this project, we aimed to predict medical insurance costs using machine learning
techniques with Python. We started by collecting a dataset containing various features
such as age, BMI, smoking status, region, and charges. After preprocessing the data by
handling missing values, encoding categorical variables, and scaling numerical features,
we split the dataset into training and testing sets.We experimented with multiple
regression algorithms, including Linear Regression, Decision Tree Regression, Random
Forest Regression, and Gradient Boosting Regression. Each algorithm was trained on
the training data and evaluated using metrics such as Mean Absolute Error (MAE),
Mean Squared Error (MSE), and R-squared score on the testing data.Our results
indicate that the Gradient Boosting Regression model outperformed other models with
the lowest MAE and MSE values and the highest R-squared score. This suggests that
Gradient Boosting Regression is well-suited for predicting medical insurance costs
based on the given features.Furthermore, we observed that age, BMI, smoking status,
and region were significant predictors of medical insurance costs based on feature
importance analysis conducted within the Gradient Boosting Regression model.In
conclusion, our machine learning model demonstrates promising performance in
predicting medical insurance costs based on individual characteristics. However, further
refinement and optimization of the model could potentially improve its accuracy and
generalization capabilities. Additionally, incorporating additional features or exploring
different algorithms may provide further insights and enhancements to the predictive
accuracy of the model.

26
CHAPTER 8

FUTURE

WORK

Predicting medical insurance costs is a valuable application of machine learning and data
analysis techniques. Here are some future enhancements and considerations for
improving such a project:

1. Feature Engineering: Continuously refine and expand the set of features used for
prediction. This could involve incorporating additional demographic data, lifestyle
factors, genetic information, and more detailed medical history.

2. Utilize Advanced Machine Learning Models: Experiment with more advanced


machine learning models such as ensemble methods (e.g., Random Forest, Gradient
Boosting), deep learning techniques (e.g., neural networks), and probabilistic models
(e.g., Bayesian methods) to improve prediction accuracy.

3. Time-Series Analysis: Incorporate time-series analysis techniques to account for


temporal trends in medical costs, such as inflation rates in healthcare, changes in
treatment protocols, or shifts in population demographics.

4. Regional Analysis: Develop models tailored to specific geographical regions or


healthcare systems, as healthcare costs and insurance dynamics can vary significantly
between regions.

5. Dynamic Pricing Models: Explore dynamic pricing models that adjust insurance
premiums in real-time based on changing risk factors and individual health behaviors.
This could involve incorporating IoT devices, wearable technology, and health
27
monitoring data into the analysis.

6. Interpretability and Explainability: Enhance model interpretability and


explainability to provide insights into the factors driving insurance costs. This could
involve using techniques such as SHAP (SHapley Additive exPlanations) values or
LIME (Local Interpretable Model-agnostic Explanations).

7. Risk Prediction and Prevention: Shift focus from cost prediction to risk prediction
and prevention by identifying individuals at high risk of certain medical conditions and
intervening early with preventive measures or lifestyle interventions.

8. Personalized Medicine: Incorporate personalized medicine approaches by


considering genetic predispositions, biomarkers, and individual responses to treatment
when predicting healthcare costs.

9. Data Privacy and Security: Ensure robust data privacy and security measures are in
place to protect sensitive healthcare data, especially as more personal health information
is utilized for prediction.

10. Collaboration with Healthcare Providers: Collaborate with healthcare providers to


access electronic health records (EHRs) and clinical data, enabling more comprehensive
and accurate predictions.

By implementing these enhancements, a medical insurance cost prediction project can


become more accurate, adaptable, and capable of providing valuable insights for both
insurers and healthcare providers.

28
REFERENCES

1. https://www.youtube.com/watch?v=ntBa7YKc9XM
2. https://ieeexplore.ieee.org/document/9824201
3. :www.sciencedirect.com/science/article/pii/S2666827023000695
4. https://www.geeksforgeeks.org/medical-insurance-price-
prediction-using-machine-learning-python/
5. https://www.researchgate.net/publication/
374553777_Medical_Insurance_Cost_Prediction_Using_Machin
e_Learning
6. https://github.com/adiag321/Medical-Insurance-Cost-Prediction

29

You might also like