Internship Documnet - 1
Internship Documnet - 1
CERTIFICATE
This is to certify that the mini project on “Health Insurance Price Prediction: Machine
Learning Regression for Predicting Health Insurance Prices” is a bonafide work done
by “K. Chaitanya Abhishikth (21471A4326), K. Ramya Sri Abhitha (21471A4327), P.
Anusha (2141A4344), P. Manohar Naga Jayanth Sai (21471A4346), S. Vamakeswari
(21471A4356)” in partial fulfillment of requirement for the award of the degree of
Bachelor of Technology in the department of CSE (ARTIFICIAL INTELLIGENCE) of
NARASARAOPETA ENGINEERING COLLEGE, NARASARAOPET, during the
academic year 2024-25.
We express our deep-felt gratitude to Dr. B. Jhansi Vazram B. Tech., M. Tech., Ph.D.,
Professor & Head of the Department of CSE (AI) and also to Project Coordinator
Dr. Shaik Mohammed Jany M. Tech., Ph.D., Asst. Prof, Department of CSE (AI) whose
unstinting encouragement enabled us to accomplish our project successfully and in time.
We extend our sincere thanks to all other teaching and non-teaching faculty of the
department for their cooperation and encouragement during our B. Tech degree. we have
no words to acknowledge the warm affection, constant inspiration, and encouragement
that we receive from our parents.
We affectionately acknowledge the encouragement received from our friends and those
who involved in giving valuable suggestions had clarifying out doubts, which had really
helped us in successfully completing our project.
By
K. Chaitanya Abhishikth 21471A4326
K. Ramya Sri Abhitha 21471A4327
P. Anusha 21471A4344
P. Manohar Naga Jayanth Sai 21471A4346
S. Vamakeswari 21471A4356
I
ABSTRACT
Predicting medical insurance premiums has become increasingly significant due to the rising
cost of healthcare services. Insurance companies aim to provide accurate and fair premium
estimations, which require the consideration of multiple individual factors. This project
addresses the need for a robust predictive model that considers critical variables like age, body
mass index (BMI), smoking status, number of children, gender, and region. The goal is to
identify the most influential predictors and develop a reliable system that can accurately
forecast medical insurance prices.
The project utilizes machine learning, specifically regression models, to perform the prediction
task. Among the different approaches explored, Ridge Regression combined with polynomial
feature transformation has demonstrated significant improvement in performance. Ridge
Regression helps in reducing the effect of multicollinearity and prevents overfitting by
applying regularization. Polynomial features capture non-linear relationships, thus increasing
the model's expressiveness.
The dataset used for this project is sourced from Kaggle, containing real-world anonymized
data. The data undergoes extensive preprocessing, including handling missing values, encoding
categorical variables, and scaling numeric features. Exploratory Data Analysis (EDA) plays a
key role in understanding the distribution and relationships between variables.
Initial models such as simple and multiple linear regression serve as baselines. These are
enhanced by introducing polynomial terms and Ridge regularization. Model evaluation is
conducted using metrics such as the R-squared value and Mean Squared Error (MSE). A
pipeline approach ensures streamlined transformation and model training.
The results indicate a notable improvement in predictive performance when Ridge Regression
is applied. The final model achieves a high R-squared score on the test set, proving its efficacy.
Insights obtained from the model, such as the prominent impact of smoking on premium costs,
are critical for policy formulation and risk assessment in the insurance domain.
This project not only demonstrates the technical application of machine learning in insurance
but also lays a foundation for future enhancements involving more complex models and
broader datasets. It shows how technology can assist in building fairer and more accurate
systems for premium calculation.
II
TABLE OF CONTENTS
III
1. INTRODUCTION
The domain of this project is healthcare analytics, specifically focusing on insurance cost
prediction using machine learning. In the current digital age, data plays a critical role in
decision-making across industries, and healthcare is no exception. Insurance companies are
under pressure to offer competitive and personalized pricing, which can only be achieved by
leveraging data-driven methods. Machine learning provides tools and techniques to analyse
patterns in data and make informed predictions.
Machine learning (ML) is a branch of artificial intelligence that enables systems to learn from
historical data and improve their performance over time without being explicitly programmed.
ML can be categorized into supervised, unsupervised, and reinforcement learning. In this
project, supervised learning is used since the output variable (insurance premium) is known
and the task is to predict it based on input features.
Regression is a key technique under supervised learning, where the model estimates a
continuous outcome. Linear regression is one of the simplest and most widely used methods
for this purpose. However, real-world data often exhibit complexities that cannot be captured
by a simple linear relationship. Hence, more advanced techniques like Ridge Regression and
polynomial feature expansion are required to capture these intricacies.
Ridge Regression is a regularized version of linear regression. It introduces a penalty term to
the loss function, which helps in reducing the model's complexity and improves its ability to
generalize to unseen data. This is especially useful when there is multicollinearity among the
features, a common issue in datasets with many variables.
The healthcare domain is particularly suitable for regression modelling due to the continuous
nature of many health indicators. Predictive analytics in healthcare helps improve service
delivery, reduce costs, and provide better patient outcomes. This project demonstrates how ML
techniques can be applied effectively in healthcare insurance prediction, ultimately helping
insurers assess risk and allocate resources more efficiently.
Understanding the domain allows developers and data scientists to choose the right features,
preprocessing techniques, and modelling strategies. It also helps interpret results in a
meaningful way, making the final application more valuable for stakeholders.
1.1 Contribution of the work
This project contributes to the field of healthcare analytics by applying machine learning
techniques to predict insurance costs with improved accuracy and interpretability. By utilizing
1
real-world data and transforming it through appropriate preprocessing techniques, the project
ensures that models are trained on clean, consistent, and machine-readable datasets. Encoding
categorical variables, handling numerical distributions, and preparing the data using standard
scaling techniques allowed for effective model training and evaluation. These foundational
steps enhanced the reliability of the results and enabled the application of more sophisticated
algorithms beyond basic linear regression.
A significant contribution of the work lies in the use of Ridge Regression, which improves
upon standard linear regression by addressing the problem of multicollinearity. In real-world
healthcare datasets, input variables often exhibit correlations, which can lead to overfitting and
unstable predictions in simple linear models. Ridge Regression introduces regularization by
penalizing large coefficients, thereby controlling model complexity and improving
generalization on unseen data. This approach not only yields better prediction accuracy but
also offers more robust insights into how different factors, such as age, BMI, or smoking status,
impact insurance charges.
The project further adds value by implementing polynomial feature expansion, enabling the
model to capture nonlinear relationships between features and the insurance cost. Health-
related factors rarely affect outcomes in a purely linear fashion—for example, the impact of
BMI or age on insurance premiums might increase exponentially or interact with other
variables. By transforming the feature space, the model becomes more expressive and capable
of identifying these intricate patterns, which is critical for generating actionable insights in a
domain as complex as healthcare.
Another key contribution is the domain-specific interpretation and understanding of the results.
Rather than treating this project as a purely technical exercise, the approach integrates
healthcare knowledge to inform feature selection, model design, and result evaluation. This
contextual understanding ensures the insights generated are meaningful and relevant for
stakeholders such as insurance companies, healthcare providers, and policyholders. Overall,
this work demonstrates the effective use of machine learning in a real-world healthcare
scenario, offering a pathway toward more data-driven and personalized insurance pricing.
2
have been in use for decades and are grounded in empirical knowledge, they lack adaptability
and fail to leverage the predictive power of modern computational tools.
Many traditional systems treat each variable independently and ignore potential interactions
between features. For example, the combined effect of age and BMI on medical costs may be
more significant than their individual effects. Without capturing such interactions, the
predictions from these models may lack accuracy and granularity.
Another limitation of conventional systems is their inability to handle multicollinearity
effectively. When input features are correlated, standard linear regression models become
unstable, resulting in high variance and unreliable predictions. Furthermore, most existing
approaches do not implement automated hyperparameter tuning or cross-validation, leading to
suboptimal models.
The increasing availability of health-related data, including electronic health records and self-
reported lifestyle information, calls for more advanced analytical methods. Machine learning
models, especially those incorporating regularization and transformation techniques, offer
better performance by adapting to data complexity and uncovering hidden patterns.
Despite the availability of open-source tools and computing resources, many insurance
companies are slow to adopt machine learning due to a lack of understanding and the perceived
complexity of these methods. However, research projects like this demonstrate that
implementing modern regression techniques can significantly improve predictive performance
with relatively modest effort.
By comparing the results of this project with baseline models, it becomes evident that
incorporating polynomial features and Ridge regularization enhances both accuracy and model
stability. This underscores the importance of transitioning from conventional to data-driven
methodologies in the insurance industry.
1.3 Proposed System
The proposed system is a comprehensive machine learning solution built to predict medical
insurance premiums with high accuracy. The system leverages data preprocessing, feature
engineering, polynomial transformation, and Ridge Regression within a unified pipeline. The
main goal is to address the shortcomings of existing systems and provide an advanced, reliable,
and interpretable prediction model.
At the core of the system is a supervised learning pipeline. The first step involves importing
the dataset and performing data cleaning. This includes handling missing values, correcting
data types, and ensuring consistency. Categorical variables such as gender, region, and smoker
3
status are encoded using one-hot encoding to make them suitable for model training. Numerical
features are standardized using StandardScaler to bring them to a uniform scale.
Once the data is clean and properly formatted, polynomial feature expansion is applied. This
technique generates new features by computing all polynomial combinations of the original
features up to a specified degree. It helps the model learn non-linear relationships, which are
often prevalent in real-world data. For instance, the impact of age on insurance cost might not
increase linearly but exponentially.
To avoid overfitting due to the expanded feature space, Ridge Regression is employed. Ridge
introduces L2 regularization, which penalizes large coefficients and discourages complex
models. The regularization strength, or alpha, is optimized using GridSearchCV, which tests
different values through cross-validation. This ensures the final model is both accurate and
generalizable.
After training, the model is evaluated using performance metrics such as R2 score and Mean
Squared Error. Visualization tools such as heatmaps, scatter plots, and line graphs are used to
understand feature importance and compare predicted vs. actual values. These visual aids make
the model outputs more interpretable to stakeholders.
The proposed system represents a shift from traditional heuristics to data-driven insights. It not
only improves predictive accuracy but also enhances transparency and interpretability, making
it a valuable tool for insurers aiming to adopt modern technology in risk assessment and
pricing.
4
2. LITERATURE SURVEY & FEASIBILITY STUDY
5
Insurance companies typically use actuarial methods to determine premiums, relying on
historical trends and risk assessment. However, these traditional methods often do not capture
non-linear relationships or interactions among predictors. As a result, premiums might be
inaccurately assessed, leading either to customer dissatisfaction or financial loss for the insurer.
The presence of multicollinearity, outliers, and data imbalance can further complicate the
modeling process. Variables such as BMI and age may not have a linear relationship with
insurance cost, and categorical features like smoker status can disproportionately affect the
outcome. Therefore, a comprehensive modeling approach is needed to account for these issues.
This project formulates the problem as a regression task, where the dependent variable is the
insurance premium and the independent variables are the customer attributes. The goal is to
minimize the prediction error while ensuring that the model generalizes well to new, unseen
data. A significant component of the problem also includes identifying the most influential
variables that drive insurance costs.
Another key aspect of the problem is interpretability. While building an accurate model is
important, stakeholders such as insurance providers and customers need to understand how
predictions are made. Therefore, techniques like feature importance analysis and visualization
of relationships are essential to provide insights into the model’s decision-making process.
The proposed solution involves developing a machine learning pipeline that includes data
preprocessing, feature engineering, model training, and evaluation. The model of choice is
Ridge Regression enhanced with polynomial features to capture non-linear relationships
between variables. This approach allows for a more flexible and robust modeling process
compared to standard linear regression.
The pipeline starts with thorough data preprocessing. Categorical variables are encoded using
one-hot encoding, and numerical features are standardized. This ensures that the model treats
all input features equally and that the learning process is not biased by differing scales. Missing
values, if any, are handled through imputation, and outliers are examined and treated based on
their impact on the model.
Polynomial features are generated to allow the model to learn interactions between variables.
For example, the interaction between age and BMI might be significant in determining the
premium. However, introducing polynomial terms can increase the risk of overfitting. To
counter this, Ridge Regression applies L2 regularization, which penalizes large coefficients
and encourages simpler models.
To optimize the model, GridSearchCV is used to perform hyperparameter tuning. This ensures
that the regularization strength (alpha) is selected based on cross-validation performance,
6
thereby improving the model's generalization ability. The model's performance is evaluated
using R-squared and MSE metrics, both on training and test datasets.
The proposed solution also includes visualization techniques to assess feature correlations and
model predictions. Heatmaps, regression plots, and comparison graphs between actual and
predicted values provide a comprehensive view of the model's effectiveness. These visual tools
also help in communicating findings to non-technical stakeholders.
By implementing this solution, the project achieves a balance between model accuracy and
interpretability. It also lays the groundwork for future enhancements, such as integrating more
advanced models or deploying the solution as a software tool for insurance companies.
The proposed system provides several advantages over traditional and existing methods.
Firstly, it delivers significantly improved prediction accuracy. By leveraging polynomial
features and regularization, the model captures complex relationships while minimizing
overfitting. This results in a higher R2 score, indicating that a greater proportion of the variance
in premium prices is explained by the input features.
Another major advantage is the system’s robustness. Regularization techniques like Ridge
make the model less sensitive to noise and multicollinearity. This ensures more stable
predictions even when the data includes correlated or less informative variables. It also
contributes to better performance on test data, which is crucial for real-world applications.
The system is highly interpretable. By analyzing feature importance and visualizing
correlations, users can understand which attributes have the greatest impact on insurance
prices. For example, the model reveals that smoking status significantly influences premiums,
followed by age and BMI. These insights can guide insurance companies in designing targeted
policies and interventions.
Flexibility is another strength of the system. It is built using a modular pipeline that can be
easily extended to include new features, models, or datasets. This allows for continuous
improvement and adaptation to changing data environments. The use of open-source tools
ensures that the system is accessible and can be implemented without significant financial
investment.
Additionally, the system supports transparency and accountability. Every stage of the process,
from data preprocessing to model evaluation, is documented and reproducible. This is
particularly important in domains like healthcare and insurance, where ethical considerations
and compliance with regulations are critical.
Overall, the proposed system offers a balance of accuracy, robustness, interpretability, and
flexibility. These advantages make it a practical and powerful tool for modern insurance
7
pricing, supporting data-driven decision-making and operational efficiency.
2.2 Feasibility Study
Feasibility analysis is a critical component of any project as it determines whether the project
can be successfully implemented from operational, technical, and economic perspectives. This
project has been evaluated on all three fronts to ensure that it meets practical requirements and
resource constraints.
Operational Feasibility focuses on the usability and implement ability of the system in a real-
world environment. The system is designed to be user-friendly and requires minimal input from
users after setup. It can be operated by professionals with basic knowledge of data science and
Python, making it accessible to a wide range of organizations. The use of widely adopted tools
like Jupyter Notebook and scikit-learn ensures that operational support is readily available.
Technical Feasibility evaluates the technical resources and expertise required to build and
maintain the system. This project is implemented using Python and leverages libraries such as
pandas, numpy, matplotlib, and sklearn, which are industry standards for data analysis and
machine learning. The required hardware specifications are minimal, with most operations
capable of running on a standard laptop or desktop with at least 4 GB RAM. This ensures that
technical barriers to implementation are low.
Economic Feasibility assesses the cost-effectiveness of the project. Since the tools and datasets
used are open-source, there are no licensing costs. The computational resources required are
modest, making the solution affordable even for small- to mid-sized insurance firms.
Additionally, the insights gained from improved premium prediction can lead to substantial
cost savings and increased profitability, justifying the investment in the system.
In conclusion, the feasibility study confirms that the project is practical, sustainable, and
beneficial. It requires reasonable resources, is built on accessible technology, and offers
substantial return on investment. These factors collectively support the successful deployment
and long-term viability of the system in real-world insurance applications.
8
3. SOFTWARE REQUIREMENTS
9
4. SYSTEM ANALYSIS
4.1 Specifications
The specifications of a system form the foundation for its development and implementation.
They outline the functional and non-functional requirements that the system must meet to be
considered successful. Functional requirements describe the specific behaviors and functions
the system should perform, while non-functional requirements refer to the quality attributes of
the system such as performance, usability, reliability, and maintainability.
For the Medical Insurance Price Prediction system, one of the primary functional requirements
is the ability to ingest a dataset in CSV format and preprocess it for model training. This
involves cleaning the data, encoding categorical variables, handling missing values, and
standardizing numerical variables. Another critical requirement is the implementation of
multiple regression models, including linear, polynomial, and Ridge Regression models. The
system should also include modules to evaluate these models using standard metrics such as
R2 score and MSE.
The system must allow for visualization of relationships among features using heatmaps and
scatter plots. This supports exploratory data analysis and model interpretation. Additionally,
the model should produce output in a user-readable format, including a comparison between
actual and predicted premium values. These functionalities enhance transparency and usability
of the system for end users such as insurance analysts and data scientists.
In terms of non-functional requirements, the system should be reliable, ensuring consistent
outputs for the same input under the same conditions. It should also be scalable, allowing for
the integration of new data or models without significant changes to the architecture. The
system must be maintainable, with modular code and clear documentation to support future
updates.
Security is another important consideration, especially if the system is deployed in
environments handling sensitive personal data. Although the current implementation does not
include security features, future enhancements may include access controls and encryption for
data protection. Overall, the system's design and specifications ensure that it meets the practical
needs of its users while remaining flexible and extendable for future development.
10
4.2 Modules
The system is organized into distinct modules, each responsible for a specific stage of the
machine learning workflow. This modular approach enhances clarity, maintainability, and
extensibility of the system. The key modules include data preprocessing, model development,
evaluation, and visualization.
Data Preprocessing Module: This module is responsible for preparing raw data for analysis.
It includes operations like loading the dataset, handling missing values, encoding categorical
variables using one-hot encoding, and standardizing numerical features. These steps ensure
that the dataset is clean, structured, and suitable for feeding into machine learning models.
The Figure 4.1 shows a dataset before preprocessing, typically used for analyzing health
insurance charges. It includes demographic and health-related attributes such as age, sex, body
mass index (BMI), number of children, smoking status, residential region, and insurance
charges. Visual summaries indicate that ages range from 18 to 64, and BMI values range from
16 to 53.1. The gender distribution is fairly balanced, with 51% male and 49% female. Around
20% of individuals are smokers, and 80% are non-smokers. The dataset also shows the
distribution of individuals across different regions, with the southeast having the highest share
at 28%. The charges column represents the insurance cost incurred by each individual. Before
using this data for machine learning or statistical modeling, preprocessing steps like encoding
categorical variables, scaling numerical features, and handling missing data would be necessary
to prepare it for analysis.
11
specific regions (like southeast or northwest). Other columns like age, bmi, no_of_child, and
charges remain as numerical values, ready for use in modeling. This transformation ensures
the dataset is in a machine-readable format, allowing algorithms to interpret categorical data
effectively,
12
variables.
13
based on smoking status. Non-smokers tend to have lower charges, with most values falling
below $15,000, though some outliers extend up to around $35,000. In contrast, smokers have
substantially higher charges, with the median around $35,000 and values reaching beyond
$60,000. The interquartile range for smokers is also much wider, indicating greater variability
in charges.This plot clearly illustrates that smoking status is a major factor affecting insurance
costs, which aligns with the correlation analysis showing a strong positive relationship between
smoking and charges. Such a visualization is useful for identifying key cost drivers in health
insurance analytics.
Each module is designed to be independent but integrated within a common pipeline. This
structure allows users to execute the entire workflow sequentially or modify individual
components as needed. The modular design also facilitates future upgrades, such as
incorporating new models or extending the dataset with additional features.
14
5. SYSTEM DESIGN
15
like Ridge Regression with GridSearchCV. Finally, Output & Visualization presents prediction
results, performance metrics, and visual insights.
16
transformation process, which in turn feeds the model training phase.
Further, a Detailed Level DFD expands on individual subprocesses, such as the model training
block. It breaks down the Ridge Regression training into finer components — setting up the
polynomial transformer, splitting the data, applying cross-validation, training the model, and
generating predictions. These detailed DFDs are instrumental in understanding and debugging
specific sections of the system.
Data stores are also represented in DFDs, showing where intermediate data is held between
processes. Examples include the cleaned dataset after preprocessing and the transformed
dataset post polynomial expansion. These storage blocks play an important role in modularity
and process isolation.
Overall, DFDs make the system transparent and help developers and stakeholders alike to
comprehend how data is being processed. They support better planning, easier debugging, and
smoother future enhancements by making the flow of information within the system explicit.
5.3 UNIFIED MODELING LANGUAGE (UML) DIAGRAMS
Unified Modeling Language (UML) diagrams provide a structured way to visualize a system’s
architecture, its components, and the interactions among them. In this project, UML diagrams
help model the dynamic behavior of the insurance prediction system and make the design
process more methodical and standardized.
The Use Case Diagram outlines the interactions between users (actors) and the system. The
primary actor is the data analyst or insurance officer who interacts with the system to input
data, train models, generate predictions, and visualize results. The diagram identifies key use
cases such as “Upload Dataset,” “Preprocess Data,” “Train Model,” “Tune Parameters,”
“Evaluate Model,” and “Generate Report.” These use cases help map user expectations and
guide the implementation of core functionalities.
The Sequence Diagram provides a timeline-based view of how processes interact over time. It
shows the sequence of method calls or events as the system progresses from data upload to
final output. For instance, the diagram would begin with the user invoking the data loader,
which then triggers preprocessing, followed by feature transformation, model training, and
evaluation. This representation is crucial for understanding system behavior during runtime
and for identifying bottlenecks or inefficiencies.
The Collaboration Diagram focuses on how objects in the system interact with each other to
perform a function. It emphasizes the relationships and message exchanges among components
such as the Preprocessor, Transformer, Regressor, and Visualizer classes. By visualizing object
17
collaboration, developers can better structure the system into loosely coupled, highly cohesive
modules.
The Activity Diagram captures the workflow of the system. It presents the logical flow of
activities such as data validation, encoding, transformation, and model training in a clear and
sequential manner. Branches and decision nodes are used to model conditional paths, such as
whether to apply standard scaling or polynomial transformation based on user input or model
choice.
These UML diagrams contribute to clearer documentation, easier collaboration among team
members, and a better understanding of both the structure and behavior of the system. They
are essential tools for planning, developing, testing, and maintaining complex machine learning
applications like this one.
18
6. IMPLEMENTATION
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# for regression model design
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score, train_test_split
url = '/Users/graceluan/Documents/Data Science Job Hunting/Medical Insurance Price
Prediction/medical_insurance_dataset.csv'
df = pd.read_csv(url,header=None)
headers = ['age','gender','bmi','no_of_child','smoker','region','charges']
df.columns = headers
# region 1,2,3,4 stands for US region NW, NE, SW, SE respectively
df.dtypes #age, smoker should be integer
# checking nan, empty values
print(df['age'].unique())
print(df['smoker'].unique())
# discovered the question mark
df.replace('?',np.nan, inplace=True) #replace ? with nan, easy to convert to float
# replece ? in age with mean
age_mean = df['age'].astype('float').mean()
df['age'].replace(np.nan, age_mean, inplace=True)
df['age'] = df['age'].astype('int')
# replace ? in smoke with mode
smk_mode = df['smoker'].value_counts()
print(smk_mode) #more 0, none smoker
df['smoker'] = df['smoker'].replace(np.nan,'0').astype('int')
# price decimal too long
19
df['charges'] = np.round(df['charges'],2)
df.head(3)
correlation = df.corr()
plt.figure(figsize=(5, 5))
sns.heatmap(
correlation,
xticklabels=correlation.columns.values,
yticklabels=correlation.columns.values,
annot=True,
annot_kws={'size': 8}
)
# Axis ticks size
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)
plt.show()
sns.regplot(x='age',y='charges',data=df, line_kws={'color':'red'})
plt.title('Regression Plot for Price to Age')
plt.show()
sns.regplot(x='bmi',y='charges',data=df, line_kws={'color':'red'})
plt.title('Regression Plot for Price to BMI')
plt.show()
sns.boxplot(x='smoker', y='charges', data=df)
# Explore a simple linear regression with only smoker attribute with price.
X = df[['smoker']]
Y = df['charges']
lm = LinearRegression()
lm.fit(X,Y)
print('R2_score is:',lm.score(X,Y).round(3))
# Use all the attributes to fit a linear regression model.
Z = df[['age','bmi','gender','no_of_child','region','smoker']]
lm.fit(Z,Y)
print('R2_score with all attributes is:',lm.score(Z,Y).round(3))
# Continue to adjust the model with polynomial features and standard scaler. Create a training
pipeline.
20
input = [
('scale',StandardScaler()),
('Polynomial',PolynomialFeatures(include_bias=False)),
('model',LinearRegression())
]
pipe = Pipeline(input)
Z = Z.astype('int')
pipe.fit(Z,Y)
ypipe = pipe.predict(Z)
print('R2_score with pipeline adjustment is:', r2_score(Y,ypipe).round(3))
# split the data to test and train set
x_train, x_test, y_train, y_test = train_test_split(Z, Y, test_size=0.2, random_state=1)
# initialize ridge regressor with hyperparameter 0.1. fit the model using the training data.
ridge_model = Ridge(alpha=0.1)
ridge_model.fit(x_train,y_train)
yhat = ridge_model.predict(x_test)
print('Ridge model (alpha 0.1) r2 score is:', r2_score(y_test,yhat))
# use grid search to try different alpha
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
param_grid = {'alpha':[0.01,0.1,1,5,10,100]}
scoring = make_scorer(r2_score)
grid_search =
GridSearchCV(estimator=ridge_model,param_grid=param_grid,scoring=scoring, cv=10)
grid_search.fit(x_train,y_train)
best_model = grid_search.best_estimator_
yhat = best_model.predict(x_test)
print(f"Test 22 Score: {r2_score(y_test, yhat)}")
# apply polynomial transformation to the training parammeters degree=2
21
pr = PolynomialFeatures(degree=2)
x_train_pr = pr.fit_transform(x_train)
x_test_pr = pr.fit_transform(x_test)
ridge_model.fit(x_train_pr, y_train)
y_hat = ridge_model.predict(x_test_pr)
print(r2_score(y_test,y_hat))
22
7. TESTING ANALYSIS
Testing is a vital phase in the machine learning development lifecycle, as it ensures the
accuracy, reliability, and robustness of the system. It helps identify issues in the workflow and
validates whether the implemented model is functioning as expected. In this project, testing
was carried out using a combination of black box testing, white box testing, and validation
techniques, each targeting different aspects of the system.
Black box testing focuses on the system's functionality without considering its internal
structure or workings. In this context, black box testing involved feeding the model with
different sets of input data and verifying whether the outputs (predicted insurance premiums)
were within logical and expected ranges. Edge cases, such as extreme BMI values or age
boundaries, were tested to assess how well the model handles uncommon scenarios. The output
predictions were also checked for consistency when run multiple times under the same
conditions.
White box testing, on the other hand, involves understanding the internal logic of the system.
Each function and module was tested independently to ensure correctness. For instance, data
preprocessing functions were validated by comparing expected encoded and scaled outputs
against actual ones. Similarly, the polynomial transformation module was tested by ensuring
that the correct number of additional features was generated for a given polynomial degree.
Model training steps were traced to confirm that Ridge Regression was being applied with the
correct hyperparameters.
A critical component of testing in this project was model validation, which helps assess the
model's performance on unseen data. The dataset was split into training and testing subsets in
an 80-20 ratio. The training set was used to build the model, while the testing set evaluated its
generalization capability. GridSearchCV was employed to perform cross-validation, which
further splits the training data into multiple folds to find the best value of the regularization
parameter (alpha). This technique helps prevent overfitting by ensuring that the model performs
well across multiple data subsets.
The results from validation showed that the model performed consistently across folds, with
the R2 score on the test set reaching approximately 0.783. Mean Squared Error (MSE) values
were also calculated and monitored to assess the model's prediction error. Lower MSE and high
R2 values indicated that the model was not only accurate but also reliable across different test
scenarios. Visual analysis using residual plots confirmed that errors were randomly distributed,
23
which is a good sign of a well-fitted model.
Finally, the robustness of the system was tested by simulating different operational conditions.
These included providing missing or incorrect input types, testing with large datasets, and
measuring processing time. The system responded gracefully to such inputs, with
preprocessing modules handling errors effectively. The implementation of exception handling
and modular code contributed to the system’s resilience. All these testing strategies together
ensured that the developed system is both functional and dependable in a practical setting.
24
8. RESULT ANALYSIS
The various output screens and visualizations that were generated during the development and
evaluation of the insurance price prediction system. These outputs provide both quantitative
and qualitative insights into the performance and behaviour of the system, facilitating better
understanding and communication of results. The outputs also help in interpreting the model's
predictions and evaluating its overall reliability.
The first significant output is the correlation heatmap, which visually represents the
relationships between different variables in the dataset. This heatmap was created using the
seaborn library and indicates how strongly each feature correlates with the insurance charges.
The most striking correlation observed was between the smoker status and premium charges,
with a coefficient close to 0.79, as shown in Figure 8.1 highlighting it as the most influential
predictor. Age and BMI also showed moderate correlations, while other features like region
and gender were less impactful. This heatmap guided the feature engineering process by
helping prioritize important variables.
25
78% of the variance in the target variable. The MSE value gave a direct sense of average
prediction error. These outputs were instrumental in comparing different models during the
development phase.
Additionally, the GridSearchCV output provided insights into the best-performing model
parameters. Figure 8.2 displays the optimal value for the alpha parameter used in Ridge
Regression. This tuning output ensured that the model avoided overfitting while maintaining
accuracy. The results of GridSearchCV were logged and printed clearly, showing the scores
achieved with different alpha values. This transparency helped in validating that the chosen
model configuration was the best among the tested options.
26
9. CONCLUSION
The primary objective of this project was to build a predictive model capable of estimating
medical insurance premiums using personal attributes such as age, BMI, smoking status, and
more. By leveraging machine learning techniques, especially regression models, the project
successfully implemented a robust and interpretable system. Ridge Regression, coupled with
polynomial features, emerged as the most effective approach, delivering reliable performance
while minimizing overfitting through regularization and proper hyperparameter tuning.
A major highlight of the project was the identification of key predictors influencing insurance
costs. Through correlation analysis and exploratory data examination, smoking status was
found to have the strongest positive correlation with medical charges, followed by age and
BMI. These insights not only enhanced the model's predictive accuracy but also provided
valuable information for insurance providers seeking to refine their pricing strategies. Such
findings can support initiatives aimed at encouraging healthier lifestyles among policyholders
through incentives or targeted premium adjustments.
The model achieved a strong R² score on the test data, demonstrating its effectiveness in
explaining a significant portion of the variation in medical costs. Incorporating polynomial
features allowed the model to capture non-linear relationships, which are common in health-
related datasets. Additionally, Ridge Regression’s regularization ensured that the model
maintained its generalization ability without being overly influenced by outliers or noise. The
modular and open-source design of the project ensures flexibility and ease of adaptation.
Implemented in a Jupyter Notebook format, it is highly accessible and allows for transparency,
collaboration, and reproducibility. Moreover, the inclusion of meaningful visualizations
throughout the process not only supported data understanding but also improved
communication with stakeholders, fostering trust in the model’s predictions. This structure also
enables further enhancements, such as integrating new features like medical lifestyle habits.
In summary, this project is a solid demonstration of the practical use of machine learning in
the healthcare insurance domain. It exemplifies how technical solutions can be aligned with
real-world applications to support better decision-making. The developed model serves as a
valuable tool not just for insurance companies, but also for healthcare providers and researchers
looking to explore the intersection of data science and health economics. It establishes a strong
foundation for future research and development in predictive analytics for medical insurance.
27
10. FUTURE ENHANCEMENTS
As technology continues to evolve and more health-related data becomes accessible, there are
numerous opportunities to enhance the medical insurance price prediction system. One major
area for improvement lies in expanding the feature set used for predictions. While the current
model relies on basic attributes like age, BMI, smoking status, and region, incorporating
additional variables such as income level, exercise routines, dietary habits, and chronic disease
history can improve prediction accuracy and create a more holistic view of an individual’s
health profile. This enriched dataset would allow the model to capture deeper patterns and
provide more personalized premium estimates.
Beyond expanding the dataset, exploring more sophisticated machine learning algorithms is
another promising direction. Ridge Regression has shown strong performance in this project,
but advanced models such as Random Forest, Gradient Boosting Machines (GBMs), and
XGBoost are well-suited for handling complex, non-linear relationships in the data. These
ensemble methods improve prediction performance by combining multiple weak learners and
offer tools for identifying the importance of different variables. Fine-tuning such models using
cross-validation can result in more accurate and reliable outputs across diverse data inputs.
For real-world use, deployment and user accessibility are crucial. While the current
implementation is ideal for experimentation in Jupyter Notebooks, transitioning to a web or
mobile application using frameworks like Flask or Streamlit can make the system widely
accessible. This would allow users such as insurance agents, healthcare providers, or customers
to enter personal data and receive instant, personalized insurance estimates. A clean, interactive
interface would not only increase adoption but also improve user experience and trust in the
system.
Looking ahead, integrating real-time data from wearable devices and health apps could
revolutionize the system. Dynamic metrics such as heart rate, sleep quality, and activity levels
could be used to update insurance predictions on an ongoing basis, leading to personalized and
behavior-based premium assessments. Additionally, incorporating explainable AI tools like
SHAP or LIME can enhance transparency by showing how each feature influences a
prediction, building user confidence. As this system handles sensitive data, it must also comply
with privacy regulations like GDPR and HIPAA, employing encryption and secure
authentication to protect users. With these enhancements, the project can evolve into a scalable,
real-time, and trustworthy tool for modern insurance analytics.
28
11. BIBLIOGRAPHY
[1] Aurélien Géron - Hands-On Machine Learning with Scikit-Learn, Keras, and
TensorFlow
[2] Andreas C. Müller and Sarah Guido - Introduction to Machine Learning with Python
[3] Online tutorials and videos from Coursera, YouTube, and various blogs
[4] Ian Goodfellow, Yoshua Bengio, and Aaron Courville – Deep Learning
[5] Trevor Hastie, Robert Tibshirani, and Jerome Friedman – The Elements of Statistical
Learning
[6] Tom M. Mitchell – Machine Learning
[7] Kevin P. Murphy – Machine Learning: A Probabilistic Perspective
[8] Christopher M. Bishop – Pattern Recognition and Machine Learning
[9] Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin – Learning From
Data
[10] Sebastian Raschka and Vahid Mirjalili – Python Machine Learning
[11] Ethem Alpaydin – Introduction to Machine Learning
[12] Peter Flach – Machine Learning: The Art and Science of Algorithms that Make Sense
of Data
[13] Ron Zacharski – A Programmer’s Guide to Data Mining
[14] Francois Chollet – Deep Learning with Python
[15] Judith S. Hurwitz, Alan Nugent, Fern Halper, and Marcia Kaufman – Machine
Learning for Dummies
[16] Matt Harrison – Machine Learning Pocket Reference
[17] Steven Bird, Ewan Klein, and Edward Loper – Natural Language Processing with
Python
[18] Steven Skiena – The Data Science Design Manual
29