0% found this document useful (0 votes)

14 views34 pages

Internship Documnet - 1

The project report focuses on predicting health insurance prices using machine learning regression techniques, specifically Ridge Regression combined with polynomial feature transformation. It aims to develop a robust model that considers various factors such as age, BMI, and smoking status to provide accurate premium estimations. The findings highlight the importance of advanced analytical methods in improving prediction accuracy and addressing the limitations of traditional actuarial approaches in the insurance industry.

Uploaded by

vamakeswaris197

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views34 pages

Internship Documnet - 1

Uploaded by

vamakeswaris197

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

HEALTH INSURANCE PRICE PREDICTION: MACHINE LEARNING

REGRESSION FOR PREDICTING HEALTH INSURANCE PRICES

A Project Report submitted in the partial fulfilment of the Requirements for the award of the
degree of
BACHELOR OF TECHNOLOGY
In
CSE (ARTIFICIAL INTELLIGENCE)
Submitted by

KOLLIPARA CHAITANYA ABHISHIKTH 21471A4326

KOLLIPARA RAMYA SRI ABHITHA 21471A4327
PALLIKONDA ANUSHA 21471A4344
PILLI MANOHAR NAGA JAYANTH SAI 21471A4346
SUNKARA VAMAKESWARI 21471A4356
Under the esteemed guidance of
Dr. Sk. MOHAMMED JANY M. Tech., Ph.D.
Asst. Professor

DEPARTMENT OF CSE (ARTIFICIAL INTELLIGENCE)

NARASARAOPETA ENGINEERING COLLEGE: NARASARAOPET
(AUTONOMOUS)
Accredited by NAAC with A+ Grade, ISO-9001:2015 certified,
Approved by AICTE, New Delhi, Permanently Affiliated to JNTUK, Kakinada
KOTAPPAKONDA ROAD, YALAMANDA VILLAGE, NARASARAOPET-522601
2024- 2025
NARASARAOPETA ENGINEERING COLLEGE: NARASARAOPET
(AUTONOMOUS)

DEPARTMENT OF CSE (ARTIFICIAL INTELLIGENCE)

CERTIFICATE

This is to certify that the mini project on “Health Insurance Price Prediction: Machine
Learning Regression for Predicting Health Insurance Prices” is a bonafide work done
by “K. Chaitanya Abhishikth (21471A4326), K. Ramya Sri Abhitha (21471A4327), P.
Anusha (2141A4344), P. Manohar Naga Jayanth Sai (21471A4346), S. Vamakeswari
(21471A4356)” in partial fulfillment of requirement for the award of the degree of
Bachelor of Technology in the department of CSE (ARTIFICIAL INTELLIGENCE) of
NARASARAOPETA ENGINEERING COLLEGE, NARASARAOPET, during the
academic year 2024-25.

Project Guide: Head of the Department

Dr. Sk. Mohammed Jany M. Tech Dr. B. Jhansi Vazram M. Tech, Ph. D
Asst. Professor Professor & HOD
ACKNOWLEDGEMENT
We wish to express our thanks to various personalities who are responsible for the
completion of the project. We are extremely thankful to our beloved chairperson
Sri M. V. Koteswara Rao B.Sc., who took keen interest on us in every effort throughout
this course. We owe our gratitude to our principal Dr. S. Venkateswarlu M. Tech., Ph.D. for his
kind attention and valuable guidance throughout the course.

We express our deep-felt gratitude to Dr. B. Jhansi Vazram B. Tech., M. Tech., Ph.D.,

Professor & Head of the Department of CSE (AI) and also to Project Coordinator
Dr. Shaik Mohammed Jany M. Tech., Ph.D., Asst. Prof, Department of CSE (AI) whose
unstinting encouragement enabled us to accomplish our project successfully and in time.

We extend our sincere thanks to all other teaching and non-teaching faculty of the
department for their cooperation and encouragement during our B. Tech degree. we have
no words to acknowledge the warm affection, constant inspiration, and encouragement
that we receive from our parents.

We affectionately acknowledge the encouragement received from our friends and those
who involved in giving valuable suggestions had clarifying out doubts, which had really
helped us in successfully completing our project.

By
K. Chaitanya Abhishikth 21471A4326
K. Ramya Sri Abhitha 21471A4327
P. Anusha 21471A4344
P. Manohar Naga Jayanth Sai 21471A4346
S. Vamakeswari 21471A4356

I
ABSTRACT
Predicting medical insurance premiums has become increasingly significant due to the rising
cost of healthcare services. Insurance companies aim to provide accurate and fair premium
estimations, which require the consideration of multiple individual factors. This project
addresses the need for a robust predictive model that considers critical variables like age, body
mass index (BMI), smoking status, number of children, gender, and region. The goal is to
identify the most influential predictors and develop a reliable system that can accurately
forecast medical insurance prices.
The project utilizes machine learning, specifically regression models, to perform the prediction
task. Among the different approaches explored, Ridge Regression combined with polynomial
feature transformation has demonstrated significant improvement in performance. Ridge
Regression helps in reducing the effect of multicollinearity and prevents overfitting by
applying regularization. Polynomial features capture non-linear relationships, thus increasing
the model's expressiveness.
The dataset used for this project is sourced from Kaggle, containing real-world anonymized
data. The data undergoes extensive preprocessing, including handling missing values, encoding
categorical variables, and scaling numeric features. Exploratory Data Analysis (EDA) plays a
key role in understanding the distribution and relationships between variables.
Initial models such as simple and multiple linear regression serve as baselines. These are
enhanced by introducing polynomial terms and Ridge regularization. Model evaluation is
conducted using metrics such as the R-squared value and Mean Squared Error (MSE). A
pipeline approach ensures streamlined transformation and model training.
The results indicate a notable improvement in predictive performance when Ridge Regression
is applied. The final model achieves a high R-squared score on the test set, proving its efficacy.
Insights obtained from the model, such as the prominent impact of smoking on premium costs,
are critical for policy formulation and risk assessment in the insurance domain.
This project not only demonstrates the technical application of machine learning in insurance
but also lays a foundation for future enhancements involving more complex models and
broader datasets. It shows how technology can assist in building fairer and more accurate
systems for premium calculation.

II
TABLE OF CONTENTS

S.NO CONTENT PAGE NO

1. Introduction 1-4
1.1 Contribution of the Work 1-2
1.2 Existing System 2-3
1.3 Proposed System 3-4
2. Literature Survey & Feasibility 5-8
Study
2.1 Literature Survey 5-7
2.2 Feasibility Study 8
3. System Requirements 9
3.1 Hardware Requirements 9
3.2 Software Requirements 9
4. System Analysis 10-14
4.1 Specifications 10
4.2 Modules 11-14
5. System Design 15-18
5.1 Block Diagram 15-16
5.2 Data Flow Diagrams 16-17
5.3 Unified Modeling Language 17-18
(UML) Diagrams
6. Implementation 19-22
7. Testing Analysis 23-24
8. Result Analysis 25-26
9. Conclusion 27
10. Future Enhancements 28
11. Bibliography 29

III
1. INTRODUCTION

The domain of this project is healthcare analytics, specifically focusing on insurance cost
prediction using machine learning. In the current digital age, data plays a critical role in
decision-making across industries, and healthcare is no exception. Insurance companies are
under pressure to offer competitive and personalized pricing, which can only be achieved by
leveraging data-driven methods. Machine learning provides tools and techniques to analyse
patterns in data and make informed predictions.
Machine learning (ML) is a branch of artificial intelligence that enables systems to learn from
historical data and improve their performance over time without being explicitly programmed.
ML can be categorized into supervised, unsupervised, and reinforcement learning. In this
project, supervised learning is used since the output variable (insurance premium) is known
and the task is to predict it based on input features.
Regression is a key technique under supervised learning, where the model estimates a
continuous outcome. Linear regression is one of the simplest and most widely used methods
for this purpose. However, real-world data often exhibit complexities that cannot be captured
by a simple linear relationship. Hence, more advanced techniques like Ridge Regression and
polynomial feature expansion are required to capture these intricacies.
Ridge Regression is a regularized version of linear regression. It introduces a penalty term to
the loss function, which helps in reducing the model's complexity and improves its ability to
generalize to unseen data. This is especially useful when there is multicollinearity among the
features, a common issue in datasets with many variables.
The healthcare domain is particularly suitable for regression modelling due to the continuous
nature of many health indicators. Predictive analytics in healthcare helps improve service
delivery, reduce costs, and provide better patient outcomes. This project demonstrates how ML
techniques can be applied effectively in healthcare insurance prediction, ultimately helping
insurers assess risk and allocate resources more efficiently.
Understanding the domain allows developers and data scientists to choose the right features,
preprocessing techniques, and modelling strategies. It also helps interpret results in a
meaningful way, making the final application more valuable for stakeholders.
1.1 Contribution of the work
This project contributes to the field of healthcare analytics by applying machine learning
techniques to predict insurance costs with improved accuracy and interpretability. By utilizing

1
real-world data and transforming it through appropriate preprocessing techniques, the project
ensures that models are trained on clean, consistent, and machine-readable datasets. Encoding
categorical variables, handling numerical distributions, and preparing the data using standard
scaling techniques allowed for effective model training and evaluation. These foundational
steps enhanced the reliability of the results and enabled the application of more sophisticated
algorithms beyond basic linear regression.
A significant contribution of the work lies in the use of Ridge Regression, which improves
upon standard linear regression by addressing the problem of multicollinearity. In real-world
healthcare datasets, input variables often exhibit correlations, which can lead to overfitting and
unstable predictions in simple linear models. Ridge Regression introduces regularization by
penalizing large coefficients, thereby controlling model complexity and improving
generalization on unseen data. This approach not only yields better prediction accuracy but
also offers more robust insights into how different factors, such as age, BMI, or smoking status,
impact insurance charges.
The project further adds value by implementing polynomial feature expansion, enabling the
model to capture nonlinear relationships between features and the insurance cost. Health-
related factors rarely affect outcomes in a purely linear fashion—for example, the impact of
BMI or age on insurance premiums might increase exponentially or interact with other
variables. By transforming the feature space, the model becomes more expressive and capable
of identifying these intricate patterns, which is critical for generating actionable insights in a
domain as complex as healthcare.
Another key contribution is the domain-specific interpretation and understanding of the results.
Rather than treating this project as a purely technical exercise, the approach integrates
healthcare knowledge to inform feature selection, model design, and result evaluation. This
contextual understanding ensures the insights generated are meaningful and relevant for
stakeholders such as insurance companies, healthcare providers, and policyholders. Overall,
this work demonstrates the effective use of machine learning in a real-world healthcare
scenario, offering a pathway toward more data-driven and personalized insurance pricing.

1.2 Existing System

Existing systems for insurance premium prediction typically rely on rule-based actuarial
methods or basic statistical techniques. These methods involve assessing risk based on
historical trends and predefined weightage for different risk factors. Although these systems

2
have been in use for decades and are grounded in empirical knowledge, they lack adaptability
and fail to leverage the predictive power of modern computational tools.
Many traditional systems treat each variable independently and ignore potential interactions
between features. For example, the combined effect of age and BMI on medical costs may be
more significant than their individual effects. Without capturing such interactions, the
predictions from these models may lack accuracy and granularity.
Another limitation of conventional systems is their inability to handle multicollinearity
effectively. When input features are correlated, standard linear regression models become
unstable, resulting in high variance and unreliable predictions. Furthermore, most existing
approaches do not implement automated hyperparameter tuning or cross-validation, leading to
suboptimal models.
The increasing availability of health-related data, including electronic health records and self-
reported lifestyle information, calls for more advanced analytical methods. Machine learning
models, especially those incorporating regularization and transformation techniques, offer
better performance by adapting to data complexity and uncovering hidden patterns.
Despite the availability of open-source tools and computing resources, many insurance
companies are slow to adopt machine learning due to a lack of understanding and the perceived
complexity of these methods. However, research projects like this demonstrate that
implementing modern regression techniques can significantly improve predictive performance
with relatively modest effort.
By comparing the results of this project with baseline models, it becomes evident that
incorporating polynomial features and Ridge regularization enhances both accuracy and model
stability. This underscores the importance of transitioning from conventional to data-driven
methodologies in the insurance industry.
1.3 Proposed System
The proposed system is a comprehensive machine learning solution built to predict medical
insurance premiums with high accuracy. The system leverages data preprocessing, feature
engineering, polynomial transformation, and Ridge Regression within a unified pipeline. The
main goal is to address the shortcomings of existing systems and provide an advanced, reliable,
and interpretable prediction model.
At the core of the system is a supervised learning pipeline. The first step involves importing
the dataset and performing data cleaning. This includes handling missing values, correcting
data types, and ensuring consistency. Categorical variables such as gender, region, and smoker

3
status are encoded using one-hot encoding to make them suitable for model training. Numerical
features are standardized using StandardScaler to bring them to a uniform scale.
Once the data is clean and properly formatted, polynomial feature expansion is applied. This
technique generates new features by computing all polynomial combinations of the original
features up to a specified degree. It helps the model learn non-linear relationships, which are
often prevalent in real-world data. For instance, the impact of age on insurance cost might not
increase linearly but exponentially.
To avoid overfitting due to the expanded feature space, Ridge Regression is employed. Ridge
introduces L2 regularization, which penalizes large coefficients and discourages complex
models. The regularization strength, or alpha, is optimized using GridSearchCV, which tests
different values through cross-validation. This ensures the final model is both accurate and
generalizable.
After training, the model is evaluated using performance metrics such as R2 score and Mean
Squared Error. Visualization tools such as heatmaps, scatter plots, and line graphs are used to
understand feature importance and compare predicted vs. actual values. These visual aids make
the model outputs more interpretable to stakeholders.
The proposed system represents a shift from traditional heuristics to data-driven insights. It not
only improves predictive accuracy but also enhances transparency and interpretability, making
it a valuable tool for insurers aiming to adopt modern technology in risk assessment and
pricing.

4
2. LITERATURE SURVEY & FEASIBILITY STUDY

2.1 Literature Survey

The theoretical foundation of this project lies in the principles of regression analysis and
regularization techniques. Regression analysis helps in modeling the relationship between a
dependent variable and one or more independent variables. In the context of this project, the
dependent variable is the insurance premium, and the independent variables are customer-
related factors like age, BMI, smoking status, etc. This relationship can be visualized as a
function mapping input variables to a continuous output.
Linear regression is a fundamental statistical technique where a straight line is fitted to the data
points. While simple and intuitive, linear regression may not capture complex relationships
inherent in real-world data. In such cases, polynomial regression becomes useful, as it allows
for curved relationships between inputs and outputs by introducing polynomial terms of the
predictors.
However, polynomial regression, particularly with higher-degree terms, can lead to overfitting.
Overfitting occurs when the model performs well on training data but fails to generalize to
unseen data. To mitigate overfitting, regularization techniques such as Ridge Regression are
employed. Ridge Regression adds a penalty term proportional to the square of the coefficients
to the loss function. This discourages large coefficient values and helps stabilize the model.
Ridge Regression is a type of L2 regularization, where the cost function minimizes both the
residual sum of squares and the sum of squared coefficients. This is particularly beneficial
when there is multicollinearity in the data, meaning that some features are highly correlated.
By applying Ridge Regression, the variance of the predictions can be significantly reduced.
Another key concept used in this project is feature scaling. Features with different ranges and
units can cause regression models to perform poorly. Standardization ensures that each feature
contributes equally to the model. Combining polynomial transformation, feature scaling, and
Ridge regularization forms a robust modeling pipeline that can effectively capture and
generalize patterns in complex datasets.
The problem of predicting medical insurance premiums involves estimating the amount a
person is expected to pay for their health insurance based on various attributes. These attributes
include both categorical variables like gender, region, and smoker status, and numerical
variables such as age and BMI. The challenge lies in the complex interplay between these
factors and their collective influence on the premium amount.

5
Insurance companies typically use actuarial methods to determine premiums, relying on
historical trends and risk assessment. However, these traditional methods often do not capture
non-linear relationships or interactions among predictors. As a result, premiums might be
inaccurately assessed, leading either to customer dissatisfaction or financial loss for the insurer.
The presence of multicollinearity, outliers, and data imbalance can further complicate the
modeling process. Variables such as BMI and age may not have a linear relationship with
insurance cost, and categorical features like smoker status can disproportionately affect the
outcome. Therefore, a comprehensive modeling approach is needed to account for these issues.
This project formulates the problem as a regression task, where the dependent variable is the
insurance premium and the independent variables are the customer attributes. The goal is to
minimize the prediction error while ensuring that the model generalizes well to new, unseen
data. A significant component of the problem also includes identifying the most influential
variables that drive insurance costs.
Another key aspect of the problem is interpretability. While building an accurate model is
important, stakeholders such as insurance providers and customers need to understand how
predictions are made. Therefore, techniques like feature importance analysis and visualization
of relationships are essential to provide insights into the model’s decision-making process.
The proposed solution involves developing a machine learning pipeline that includes data
preprocessing, feature engineering, model training, and evaluation. The model of choice is
Ridge Regression enhanced with polynomial features to capture non-linear relationships
between variables. This approach allows for a more flexible and robust modeling process
compared to standard linear regression.
The pipeline starts with thorough data preprocessing. Categorical variables are encoded using
one-hot encoding, and numerical features are standardized. This ensures that the model treats
all input features equally and that the learning process is not biased by differing scales. Missing
values, if any, are handled through imputation, and outliers are examined and treated based on
their impact on the model.
Polynomial features are generated to allow the model to learn interactions between variables.
For example, the interaction between age and BMI might be significant in determining the
premium. However, introducing polynomial terms can increase the risk of overfitting. To
counter this, Ridge Regression applies L2 regularization, which penalizes large coefficients
and encourages simpler models.
To optimize the model, GridSearchCV is used to perform hyperparameter tuning. This ensures
that the regularization strength (alpha) is selected based on cross-validation performance,

6
thereby improving the model's generalization ability. The model's performance is evaluated
using R-squared and MSE metrics, both on training and test datasets.
The proposed solution also includes visualization techniques to assess feature correlations and
model predictions. Heatmaps, regression plots, and comparison graphs between actual and
predicted values provide a comprehensive view of the model's effectiveness. These visual tools
also help in communicating findings to non-technical stakeholders.
By implementing this solution, the project achieves a balance between model accuracy and
interpretability. It also lays the groundwork for future enhancements, such as integrating more
advanced models or deploying the solution as a software tool for insurance companies.
The proposed system provides several advantages over traditional and existing methods.
Firstly, it delivers significantly improved prediction accuracy. By leveraging polynomial
features and regularization, the model captures complex relationships while minimizing
overfitting. This results in a higher R2 score, indicating that a greater proportion of the variance
in premium prices is explained by the input features.
Another major advantage is the system’s robustness. Regularization techniques like Ridge
make the model less sensitive to noise and multicollinearity. This ensures more stable
predictions even when the data includes correlated or less informative variables. It also
contributes to better performance on test data, which is crucial for real-world applications.
The system is highly interpretable. By analyzing feature importance and visualizing
correlations, users can understand which attributes have the greatest impact on insurance
prices. For example, the model reveals that smoking status significantly influences premiums,
followed by age and BMI. These insights can guide insurance companies in designing targeted
policies and interventions.
Flexibility is another strength of the system. It is built using a modular pipeline that can be
easily extended to include new features, models, or datasets. This allows for continuous
improvement and adaptation to changing data environments. The use of open-source tools
ensures that the system is accessible and can be implemented without significant financial
investment.
Additionally, the system supports transparency and accountability. Every stage of the process,
from data preprocessing to model evaluation, is documented and reproducible. This is
particularly important in domains like healthcare and insurance, where ethical considerations
and compliance with regulations are critical.
Overall, the proposed system offers a balance of accuracy, robustness, interpretability, and
flexibility. These advantages make it a practical and powerful tool for modern insurance

7
pricing, supporting data-driven decision-making and operational efficiency.
2.2 Feasibility Study
Feasibility analysis is a critical component of any project as it determines whether the project
can be successfully implemented from operational, technical, and economic perspectives. This
project has been evaluated on all three fronts to ensure that it meets practical requirements and
resource constraints.
Operational Feasibility focuses on the usability and implement ability of the system in a real-
world environment. The system is designed to be user-friendly and requires minimal input from
users after setup. It can be operated by professionals with basic knowledge of data science and
Python, making it accessible to a wide range of organizations. The use of widely adopted tools
like Jupyter Notebook and scikit-learn ensures that operational support is readily available.
Technical Feasibility evaluates the technical resources and expertise required to build and
maintain the system. This project is implemented using Python and leverages libraries such as
pandas, numpy, matplotlib, and sklearn, which are industry standards for data analysis and
machine learning. The required hardware specifications are minimal, with most operations
capable of running on a standard laptop or desktop with at least 4 GB RAM. This ensures that
technical barriers to implementation are low.
Economic Feasibility assesses the cost-effectiveness of the project. Since the tools and datasets
used are open-source, there are no licensing costs. The computational resources required are
modest, making the solution affordable even for small- to mid-sized insurance firms.
Additionally, the insights gained from improved premium prediction can lead to substantial
cost savings and increased profitability, justifying the investment in the system.
In conclusion, the feasibility study confirms that the project is practical, sustainable, and
beneficial. It requires reasonable resources, is built on accessible technology, and offers
substantial return on investment. These factors collectively support the successful deployment
and long-term viability of the system in real-world insurance applications.

8
3. SOFTWARE REQUIREMENTS

3.1 Hardware Requirements

• Processor : 12th Gen Intel(R) Core(TM) i5-

1240P 1.70 GHz
• RAM : 8GB or more
• Storage : 256GB SSD or higher
• GPU : NVIDIA RTX 3060 or higher
(for deep learning model training)

3.2 Software Requirements

• Operating System : Windows 11/Linux

• Programming Language : Python
• Libraries : Linear Regression, Ridge Regression,
Lasso Regression, Polynomial
Regression
• Development Environment : Jupyter Notebook, or VS Code

9
4. SYSTEM ANALYSIS

4.1 Specifications
The specifications of a system form the foundation for its development and implementation.
They outline the functional and non-functional requirements that the system must meet to be
considered successful. Functional requirements describe the specific behaviors and functions
the system should perform, while non-functional requirements refer to the quality attributes of
the system such as performance, usability, reliability, and maintainability.
For the Medical Insurance Price Prediction system, one of the primary functional requirements
is the ability to ingest a dataset in CSV format and preprocess it for model training. This
involves cleaning the data, encoding categorical variables, handling missing values, and
standardizing numerical variables. Another critical requirement is the implementation of
multiple regression models, including linear, polynomial, and Ridge Regression models. The
system should also include modules to evaluate these models using standard metrics such as
R2 score and MSE.
The system must allow for visualization of relationships among features using heatmaps and
scatter plots. This supports exploratory data analysis and model interpretation. Additionally,
the model should produce output in a user-readable format, including a comparison between
actual and predicted premium values. These functionalities enhance transparency and usability
of the system for end users such as insurance analysts and data scientists.
In terms of non-functional requirements, the system should be reliable, ensuring consistent
outputs for the same input under the same conditions. It should also be scalable, allowing for
the integration of new data or models without significant changes to the architecture. The
system must be maintainable, with modular code and clear documentation to support future
updates.
Security is another important consideration, especially if the system is deployed in
environments handling sensitive personal data. Although the current implementation does not
include security features, future enhancements may include access controls and encryption for
data protection. Overall, the system's design and specifications ensure that it meets the practical
needs of its users while remaining flexible and extendable for future development.

10
4.2 Modules
The system is organized into distinct modules, each responsible for a specific stage of the
machine learning workflow. This modular approach enhances clarity, maintainability, and
extensibility of the system. The key modules include data preprocessing, model development,
evaluation, and visualization.
Data Preprocessing Module: This module is responsible for preparing raw data for analysis.
It includes operations like loading the dataset, handling missing values, encoding categorical
variables using one-hot encoding, and standardizing numerical features. These steps ensure
that the dataset is clean, structured, and suitable for feeding into machine learning models.
The Figure 4.1 shows a dataset before preprocessing, typically used for analyzing health
insurance charges. It includes demographic and health-related attributes such as age, sex, body
mass index (BMI), number of children, smoking status, residential region, and insurance
charges. Visual summaries indicate that ages range from 18 to 64, and BMI values range from
16 to 53.1. The gender distribution is fairly balanced, with 51% male and 49% female. Around
20% of individuals are smokers, and 80% are non-smokers. The dataset also shows the
distribution of individuals across different regions, with the southeast having the highest share
at 28%. The charges column represents the insurance cost incurred by each individual. Before
using this data for machine learning or statistical modeling, preprocessing steps like encoding
categorical variables, scaling numerical features, and handling missing data would be necessary
to prepare it for analysis.

Figure 4.1: Data Set Before Preprocessing

The Figure 4.2 shows the dataset after preprocessing, where categorical variables have been
encoded into numerical format to prepare the data for machine learning models. In this
processed version, the gender column is encoded with values like 1 for male and 2 for female.
The smoker status is converted to binary form, with 1 indicating a smoker and 0 indicating a
non-smoker. Similarly, the region column is encoded numerically, such as 3 or 4 representing

11
specific regions (like southeast or northwest). Other columns like age, bmi, no_of_child, and
charges remain as numerical values, ready for use in modeling. This transformation ensures
the dataset is in a machine-readable format, allowing algorithms to interpret categorical data
effectively,

Figure 4.2: Data Set After Preprocessing

Model Development Module: In this module, multiple regression models are developed and
trained. It starts with a baseline linear regression model to establish initial performance metrics.
Then, polynomial regression is implemented to capture non-linear relationships. Finally, Ridge
Regression with polynomial features is developed to combine the benefits of regularization and
expanded feature representation. Hyperparameter tuning using GridSearchCV is performed to
optimize model performance.
Model Evaluation Module: This module assesses the effectiveness of the models using
metrics such as R2 score and Mean Squared Error. These metrics provide insights into how
well the model fits the data and how accurately it can make predictions. The evaluation is
performed on both training and test datasets to detect overfitting and ensure generalizability.
The Figure 4.3 displays a correlation matrix representing the linear relationships between
various features in a medical insurance dataset. Each value in the matrix ranges between -1 and
1, indicating the strength and direction of the correlation between pairs of variables. A value
closer to 1 signifies a strong positive correlation, while a value closer to -1 indicates a strong
negative correlation. From the matrix, 'smoker' shows the strongest positive correlation with
'charges' (0.788783), suggesting that smoking status is a major factor influencing insurance
premiums. 'Age' and 'BMI' also show moderate positive correlations with charges (0.298624
and 0.199846 respectively), indicating that older individuals and those with higher BMI tend
to pay more for insurance. On the other hand, features like 'gender', 'region', and 'number of
children' show very weak correlations with charges, implying they have minimal impact on
premium prediction. This correlation matrix is useful in feature selection, helping to identify
the most influential variables for building an effective predictive model. It also supports
exploratory data analysis by revealing patterns and potential multicollinearity issues between

12
variables.

Figure 4.3: Correlation Matrix

Visualization Module: Visualizations play a crucial role in understanding data and
interpreting model outputs. This module includes generation of heatmaps to show feature
correlations, scatter plots to visualize predictions, and bar charts to represent model scores.
These visual aids are essential for presenting results to stakeholders in an intuitive and
understandable manner.

Figure 4.4: Box Plot

The Figure 4.4 displays a box plot comparing medical insurance charges between non-smokers
(0) and smokers (1). It visually highlights a significant difference in the distribution of charges

13
based on smoking status. Non-smokers tend to have lower charges, with most values falling
below $15,000, though some outliers extend up to around $35,000. In contrast, smokers have
substantially higher charges, with the median around $35,000 and values reaching beyond
$60,000. The interquartile range for smokers is also much wider, indicating greater variability
in charges.This plot clearly illustrates that smoking status is a major factor affecting insurance
costs, which aligns with the correlation analysis showing a strong positive relationship between
smoking and charges. Such a visualization is useful for identifying key cost drivers in health
insurance analytics.

Each module is designed to be independent but integrated within a common pipeline. This
structure allows users to execute the entire workflow sequentially or modify individual
components as needed. The modular design also facilitates future upgrades, such as
incorporating new models or extending the dataset with additional features.

14
5. SYSTEM DESIGN

5.1 BLOCK DIAGRAM

The block diagram offers a high-level overview of the entire system, providing a visual
representation of the sequential flow of data through the major components. In the context of
the Medical Insurance Price Prediction system, the block diagram serves as a blueprint for
understanding how the different modules interact from start to finish. It is essential during both
the design and implementation phases as it clarifies component boundaries and data
dependencies.
At the core, the block diagram begins with the data acquisition phase. This block represents
the process of importing the dataset from an external source, such as a CSV file. The dataset
includes features such as age, gender, BMI, smoking status, number of children, and region,
along with the target variable — medical insurance charges.
Next, the data preprocessing block takes over. This component handles various tasks like
cleaning null or inconsistent values, encoding categorical data, and standardizing numerical
features. This step is critical because raw data is often messy, and machine learning models
require structured, clean inputs to produce reliable results.
After preprocessing, the processed dataset is passed to the feature engineering and
transformation block. Here, polynomial features are generated to enrich the dataset with higher-
order interactions between variables. These transformations allow the model to learn complex
patterns that would otherwise be missed in a simple linear regression setup.
The transformed data is then sent to the model training and prediction block. In this phase, the
Ridge Regression algorithm is applied. This block also includes cross-validation and
hyperparameter tuning using GridSearchCV. The model is evaluated, and predictions are
generated based on the test set.
Finally, the output block showcases the model's predictions, comparison graphs between actual
and predicted premiums, and evaluation scores like R2 and MSE. This output can be visualized
through dashboards or reports and used by insurance analysts for informed decision-making.
The Figure 5.1 illustrates a block diagram representing the complete workflow of a medical
insurance price prediction system. It begins with Data Acquisition, followed by Data
Preprocessing, which involves cleaning and encoding. Next is Feature Engineering and
transformation, including techniques like polynomial features. The Transformed Data is then
split into training and testing sets. Model Training & Prediction is performed using algorithms

15
like Ridge Regression with GridSearchCV. Finally, Output & Visualization presents prediction
results, performance metrics, and visual insights.

Figure 5.1: System Design

5.2 DATA FLOW DIAGRAMS
Data Flow Diagrams (DFDs) are crucial tools used to represent the flow of data within a
system. They help in visualizing how inputs are transformed into outputs through a series of
processes and how data moves between different modules. For this project, the DFDs provide
both a macro and micro-level understanding of the system's data processing logic.
The Context Level DFD offers a top-level view of the entire system. It shows the interaction
between external entities — such as the user or data source — and the system itself. In this
diagram, the primary input is the raw dataset, and the outputs include the prediction results and
model performance reports. This DFD identifies the system as a single process and focuses on
its communication with external elements.
The Level 1 DFD decomposes the system into subprocesses. It illustrates the internal modules
including data preprocessing, feature transformation, model training, evaluation, and output
visualization. Each subprocess is linked through data stores and flows, showcasing how data is
manipulated and passed through the system. For instance, preprocessed data flows into the

16
transformation process, which in turn feeds the model training phase.
Further, a Detailed Level DFD expands on individual subprocesses, such as the model training
block. It breaks down the Ridge Regression training into finer components — setting up the
polynomial transformer, splitting the data, applying cross-validation, training the model, and
generating predictions. These detailed DFDs are instrumental in understanding and debugging
specific sections of the system.
Data stores are also represented in DFDs, showing where intermediate data is held between
processes. Examples include the cleaned dataset after preprocessing and the transformed
dataset post polynomial expansion. These storage blocks play an important role in modularity
and process isolation.
Overall, DFDs make the system transparent and help developers and stakeholders alike to
comprehend how data is being processed. They support better planning, easier debugging, and
smoother future enhancements by making the flow of information within the system explicit.
5.3 UNIFIED MODELING LANGUAGE (UML) DIAGRAMS
Unified Modeling Language (UML) diagrams provide a structured way to visualize a system’s
architecture, its components, and the interactions among them. In this project, UML diagrams
help model the dynamic behavior of the insurance prediction system and make the design
process more methodical and standardized.
The Use Case Diagram outlines the interactions between users (actors) and the system. The
primary actor is the data analyst or insurance officer who interacts with the system to input
data, train models, generate predictions, and visualize results. The diagram identifies key use
cases such as “Upload Dataset,” “Preprocess Data,” “Train Model,” “Tune Parameters,”
“Evaluate Model,” and “Generate Report.” These use cases help map user expectations and
guide the implementation of core functionalities.
The Sequence Diagram provides a timeline-based view of how processes interact over time. It
shows the sequence of method calls or events as the system progresses from data upload to
final output. For instance, the diagram would begin with the user invoking the data loader,
which then triggers preprocessing, followed by feature transformation, model training, and
evaluation. This representation is crucial for understanding system behavior during runtime
and for identifying bottlenecks or inefficiencies.
The Collaboration Diagram focuses on how objects in the system interact with each other to
perform a function. It emphasizes the relationships and message exchanges among components
such as the Preprocessor, Transformer, Regressor, and Visualizer classes. By visualizing object

17
collaboration, developers can better structure the system into loosely coupled, highly cohesive
modules.
The Activity Diagram captures the workflow of the system. It presents the logical flow of
activities such as data validation, encoding, transformation, and model training in a clear and
sequential manner. Branches and decision nodes are used to model conditional paths, such as
whether to apply standard scaling or polynomial transformation based on user input or model
choice.
These UML diagrams contribute to clearer documentation, easier collaboration among team
members, and a better understanding of both the structure and behavior of the system. They
are essential tools for planning, developing, testing, and maintaining complex machine learning
applications like this one.

18
6. IMPLEMENTATION

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# for regression model design
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score, train_test_split
url = '/Users/graceluan/Documents/Data Science Job Hunting/Medical Insurance Price
Prediction/medical_insurance_dataset.csv'
df = pd.read_csv(url,header=None)
headers = ['age','gender','bmi','no_of_child','smoker','region','charges']
df.columns = headers
# region 1,2,3,4 stands for US region NW, NE, SW, SE respectively
df.dtypes #age, smoker should be integer
# checking nan, empty values
print(df['age'].unique())
print(df['smoker'].unique())
# discovered the question mark
df.replace('?',np.nan, inplace=True) #replace ? with nan, easy to convert to float
# replece ? in age with mean
age_mean = df['age'].astype('float').mean()
df['age'].replace(np.nan, age_mean, inplace=True)
df['age'] = df['age'].astype('int')
# replace ? in smoke with mode
smk_mode = df['smoker'].value_counts()
print(smk_mode) #more 0, none smoker
df['smoker'] = df['smoker'].replace(np.nan,'0').astype('int')
# price decimal too long

19
df['charges'] = np.round(df['charges'],2)
df.head(3)
correlation = df.corr()
plt.figure(figsize=(5, 5))
sns.heatmap(
correlation,
xticklabels=correlation.columns.values,
yticklabels=correlation.columns.values,
annot=True,
annot_kws={'size': 8}
)
# Axis ticks size
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)
plt.show()
sns.regplot(x='age',y='charges',data=df, line_kws={'color':'red'})
plt.title('Regression Plot for Price to Age')
plt.show()
sns.regplot(x='bmi',y='charges',data=df, line_kws={'color':'red'})
plt.title('Regression Plot for Price to BMI')
plt.show()
sns.boxplot(x='smoker', y='charges', data=df)
# Explore a simple linear regression with only smoker attribute with price.
X = df[['smoker']]
Y = df['charges']
lm = LinearRegression()
lm.fit(X,Y)
print('R2_score is:',lm.score(X,Y).round(3))
# Use all the attributes to fit a linear regression model.
Z = df[['age','bmi','gender','no_of_child','region','smoker']]
lm.fit(Z,Y)
print('R2_score with all attributes is:',lm.score(Z,Y).round(3))
# Continue to adjust the model with polynomial features and standard scaler. Create a training
pipeline.

20
input = [
('scale',StandardScaler()),
('Polynomial',PolynomialFeatures(include_bias=False)),
('model',LinearRegression())
]
pipe = Pipeline(input)
Z = Z.astype('int')
pipe.fit(Z,Y)
ypipe = pipe.predict(Z)
print('R2_score with pipeline adjustment is:', r2_score(Y,ypipe).round(3))
# split the data to test and train set
x_train, x_test, y_train, y_test = train_test_split(Z, Y, test_size=0.2, random_state=1)

# initialize ridge regressor with hyperparameter 0.1. fit the model using the training data.
ridge_model = Ridge(alpha=0.1)
ridge_model.fit(x_train,y_train)
yhat = ridge_model.predict(x_test)
print('Ridge model (alpha 0.1) r2 score is:', r2_score(y_test,yhat))
# use grid search to try different alpha
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
param_grid = {'alpha':[0.01,0.1,1,5,10,100]}
scoring = make_scorer(r2_score)
grid_search =
GridSearchCV(estimator=ridge_model,param_grid=param_grid,scoring=scoring, cv=10)
grid_search.fit(x_train,y_train)

print(f"Best alpha: {grid_search.best_params_['alpha']}")

print(f"Best r2 Score: {grid_search.best_score_}")

best_model = grid_search.best_estimator_
yhat = best_model.predict(x_test)
print(f"Test 22 Score: {r2_score(y_test, yhat)}")
# apply polynomial transformation to the training parammeters degree=2

21
pr = PolynomialFeatures(degree=2)
x_train_pr = pr.fit_transform(x_train)
x_test_pr = pr.fit_transform(x_test)
ridge_model.fit(x_train_pr, y_train)
y_hat = ridge_model.predict(x_test_pr)
print(r2_score(y_test,y_hat))

22
7. TESTING ANALYSIS

Testing is a vital phase in the machine learning development lifecycle, as it ensures the
accuracy, reliability, and robustness of the system. It helps identify issues in the workflow and
validates whether the implemented model is functioning as expected. In this project, testing
was carried out using a combination of black box testing, white box testing, and validation
techniques, each targeting different aspects of the system.
Black box testing focuses on the system's functionality without considering its internal
structure or workings. In this context, black box testing involved feeding the model with
different sets of input data and verifying whether the outputs (predicted insurance premiums)
were within logical and expected ranges. Edge cases, such as extreme BMI values or age
boundaries, were tested to assess how well the model handles uncommon scenarios. The output
predictions were also checked for consistency when run multiple times under the same
conditions.
White box testing, on the other hand, involves understanding the internal logic of the system.
Each function and module was tested independently to ensure correctness. For instance, data
preprocessing functions were validated by comparing expected encoded and scaled outputs
against actual ones. Similarly, the polynomial transformation module was tested by ensuring
that the correct number of additional features was generated for a given polynomial degree.
Model training steps were traced to confirm that Ridge Regression was being applied with the
correct hyperparameters.
A critical component of testing in this project was model validation, which helps assess the
model's performance on unseen data. The dataset was split into training and testing subsets in
an 80-20 ratio. The training set was used to build the model, while the testing set evaluated its
generalization capability. GridSearchCV was employed to perform cross-validation, which
further splits the training data into multiple folds to find the best value of the regularization
parameter (alpha). This technique helps prevent overfitting by ensuring that the model performs
well across multiple data subsets.
The results from validation showed that the model performed consistently across folds, with
the R2 score on the test set reaching approximately 0.783. Mean Squared Error (MSE) values
were also calculated and monitored to assess the model's prediction error. Lower MSE and high
R2 values indicated that the model was not only accurate but also reliable across different test
scenarios. Visual analysis using residual plots confirmed that errors were randomly distributed,

23
which is a good sign of a well-fitted model.
Finally, the robustness of the system was tested by simulating different operational conditions.
These included providing missing or incorrect input types, testing with large datasets, and
measuring processing time. The system responded gracefully to such inputs, with
preprocessing modules handling errors effectively. The implementation of exception handling
and modular code contributed to the system’s resilience. All these testing strategies together
ensured that the developed system is both functional and dependable in a practical setting.

Figure 7.1: Heat Map

The Figure 7.1 is a correlation heatmap that visually represents the strength and direction of
relationships between different features in the dataset used for medical insurance price
prediction. The diagonal values are all 1, indicating perfect correlation of each variable with
itself. Notably, the 'smoker' variable shows the highest positive correlation (0.79) with
'charges', suggesting that smoking status significantly impacts insurance cost. The 'age' and
'bmi' also exhibit moderate positive correlations with 'charges' (0.3 and 0.2 respectively),
implying these features influence pricing. In contrast, variables like 'gender' and 'region' show
weak or negligible correlation with charges. Darker shades indicate weaker relationships, while
lighter shades indicate stronger correlations. This plot is valuable in feature selection and model
interpretation as it highlights which features are more influential in predicting medical charges.

24
8. RESULT ANALYSIS

The various output screens and visualizations that were generated during the development and
evaluation of the insurance price prediction system. These outputs provide both quantitative
and qualitative insights into the performance and behaviour of the system, facilitating better
understanding and communication of results. The outputs also help in interpreting the model's
predictions and evaluating its overall reliability.
The first significant output is the correlation heatmap, which visually represents the
relationships between different variables in the dataset. This heatmap was created using the
seaborn library and indicates how strongly each feature correlates with the insurance charges.
The most striking correlation observed was between the smoker status and premium charges,
with a coefficient close to 0.79, as shown in Figure 8.1 highlighting it as the most influential
predictor. Age and BMI also showed moderate correlations, while other features like region
and gender were less impactful. This heatmap guided the feature engineering process by
helping prioritize important variables.

Figure 8.1: R2 Score of the model

Another essential output screen is the comparison plot between actual and predicted insurance
charges. This scatter plot compares model predictions against the real premium values for each
test case. Ideally, the data points should align closely along the diagonal line representing a
perfect prediction. In this project, the plot demonstrated a tight cluster around the diagonal,
indicating that the model performed well and that prediction errors were minimal. Deviations
from the line were observed for some extreme values, which were analysed further to refine
model performance.
The R2 score and Mean Squared Error (MSE) values were displayed as numerical outputs post-
model evaluation. These metrics appeared in the notebook as print statements and summarized
the model’s effectiveness. An R2 score of 0.783 suggests that the model could explain nearly

25
78% of the variance in the target variable. The MSE value gave a direct sense of average
prediction error. These outputs were instrumental in comparing different models during the
development phase.
Additionally, the GridSearchCV output provided insights into the best-performing model
parameters. Figure 8.2 displays the optimal value for the alpha parameter used in Ridge
Regression. This tuning output ensured that the model avoided overfitting while maintaining
accuracy. The results of GridSearchCV were logged and printed clearly, showing the scores
achieved with different alpha values. This transparency helped in validating that the chosen
model configuration was the best among the tested options.

Figure 8.2: Ridge Regression Plot

Lastly, the output visualization of residuals was used to evaluate prediction error distribution.
A residual plot was generated to show the difference between actual and predicted charges for
each test instance. Ideally, residuals should be randomly scattered around zero, indicating no
consistent bias in predictions. The residuals in this project met this criterion, further confirming
the reliability of the model. Together, these outputs and visualizations validated the robustness,
accuracy, and usefulness of the system, reinforcing its potential for real-world application in
medical insurance prediction.

26
9. CONCLUSION

The primary objective of this project was to build a predictive model capable of estimating
medical insurance premiums using personal attributes such as age, BMI, smoking status, and
more. By leveraging machine learning techniques, especially regression models, the project
successfully implemented a robust and interpretable system. Ridge Regression, coupled with
polynomial features, emerged as the most effective approach, delivering reliable performance
while minimizing overfitting through regularization and proper hyperparameter tuning.
A major highlight of the project was the identification of key predictors influencing insurance
costs. Through correlation analysis and exploratory data examination, smoking status was
found to have the strongest positive correlation with medical charges, followed by age and
BMI. These insights not only enhanced the model's predictive accuracy but also provided
valuable information for insurance providers seeking to refine their pricing strategies. Such
findings can support initiatives aimed at encouraging healthier lifestyles among policyholders
through incentives or targeted premium adjustments.
The model achieved a strong R² score on the test data, demonstrating its effectiveness in
explaining a significant portion of the variation in medical costs. Incorporating polynomial
features allowed the model to capture non-linear relationships, which are common in health-
related datasets. Additionally, Ridge Regression’s regularization ensured that the model
maintained its generalization ability without being overly influenced by outliers or noise. The
modular and open-source design of the project ensures flexibility and ease of adaptation.
Implemented in a Jupyter Notebook format, it is highly accessible and allows for transparency,
collaboration, and reproducibility. Moreover, the inclusion of meaningful visualizations
throughout the process not only supported data understanding but also improved
communication with stakeholders, fostering trust in the model’s predictions. This structure also
enables further enhancements, such as integrating new features like medical lifestyle habits.
In summary, this project is a solid demonstration of the practical use of machine learning in
the healthcare insurance domain. It exemplifies how technical solutions can be aligned with
real-world applications to support better decision-making. The developed model serves as a
valuable tool not just for insurance companies, but also for healthcare providers and researchers
looking to explore the intersection of data science and health economics. It establishes a strong
foundation for future research and development in predictive analytics for medical insurance.

27
10. FUTURE ENHANCEMENTS

As technology continues to evolve and more health-related data becomes accessible, there are
numerous opportunities to enhance the medical insurance price prediction system. One major
area for improvement lies in expanding the feature set used for predictions. While the current
model relies on basic attributes like age, BMI, smoking status, and region, incorporating
additional variables such as income level, exercise routines, dietary habits, and chronic disease
history can improve prediction accuracy and create a more holistic view of an individual’s
health profile. This enriched dataset would allow the model to capture deeper patterns and
provide more personalized premium estimates.
Beyond expanding the dataset, exploring more sophisticated machine learning algorithms is
another promising direction. Ridge Regression has shown strong performance in this project,
but advanced models such as Random Forest, Gradient Boosting Machines (GBMs), and
XGBoost are well-suited for handling complex, non-linear relationships in the data. These
ensemble methods improve prediction performance by combining multiple weak learners and
offer tools for identifying the importance of different variables. Fine-tuning such models using
cross-validation can result in more accurate and reliable outputs across diverse data inputs.
For real-world use, deployment and user accessibility are crucial. While the current
implementation is ideal for experimentation in Jupyter Notebooks, transitioning to a web or
mobile application using frameworks like Flask or Streamlit can make the system widely
accessible. This would allow users such as insurance agents, healthcare providers, or customers
to enter personal data and receive instant, personalized insurance estimates. A clean, interactive
interface would not only increase adoption but also improve user experience and trust in the
system.
Looking ahead, integrating real-time data from wearable devices and health apps could
revolutionize the system. Dynamic metrics such as heart rate, sleep quality, and activity levels
could be used to update insurance predictions on an ongoing basis, leading to personalized and
behavior-based premium assessments. Additionally, incorporating explainable AI tools like
SHAP or LIME can enhance transparency by showing how each feature influences a
prediction, building user confidence. As this system handles sensitive data, it must also comply
with privacy regulations like GDPR and HIPAA, employing encryption and secure
authentication to protect users. With these enhancements, the project can evolve into a scalable,
real-time, and trustworthy tool for modern insurance analytics.

28
11. BIBLIOGRAPHY

[1] Aurélien Géron - Hands-On Machine Learning with Scikit-Learn, Keras, and
TensorFlow
[2] Andreas C. Müller and Sarah Guido - Introduction to Machine Learning with Python
[3] Online tutorials and videos from Coursera, YouTube, and various blogs
[4] Ian Goodfellow, Yoshua Bengio, and Aaron Courville – Deep Learning
[5] Trevor Hastie, Robert Tibshirani, and Jerome Friedman – The Elements of Statistical
Learning
[6] Tom M. Mitchell – Machine Learning
[7] Kevin P. Murphy – Machine Learning: A Probabilistic Perspective
[8] Christopher M. Bishop – Pattern Recognition and Machine Learning
[9] Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin – Learning From
Data
[10] Sebastian Raschka and Vahid Mirjalili – Python Machine Learning
[11] Ethem Alpaydin – Introduction to Machine Learning
[12] Peter Flach – Machine Learning: The Art and Science of Algorithms that Make Sense
of Data
[13] Ron Zacharski – A Programmer’s Guide to Data Mining
[14] Francois Chollet – Deep Learning with Python
[15] Judith S. Hurwitz, Alan Nugent, Fern Halper, and Marcia Kaufman – Machine
Learning for Dummies
[16] Matt Harrison – Machine Learning Pocket Reference
[17] Steven Bird, Ewan Klein, and Edward Loper – Natural Language Processing with
Python
[18] Steven Skiena – The Data Science Design Manual

Mini - Project - Report Health Insurance Price Prediction
50% (2)
Mini - Project - Report Health Insurance Price Prediction
33 pages
Project Report Certificate - PDF
No ratings yet
Project Report Certificate - PDF
13 pages
P4 Project Report
No ratings yet
P4 Project Report
28 pages
Medical Insurance Cost Prediction
100% (2)
Medical Insurance Cost Prediction
16 pages
Cap 2 Report
No ratings yet
Cap 2 Report
26 pages
Implementation of Medical Insurance Price Prediction System Using Regression Algorithms
No ratings yet
Implementation of Medical Insurance Price Prediction System Using Regression Algorithms
7 pages
Predictive Analytics in Health Care Using Machine Learningtools and Techniques
No ratings yet
Predictive Analytics in Health Care Using Machine Learningtools and Techniques
1 page
Medical Insurance Cost Prediction
No ratings yet
Medical Insurance Cost Prediction
48 pages
Full Document
No ratings yet
Full Document
35 pages
Medical Insurance Cost Prediction Using Machine Learning
No ratings yet
Medical Insurance Cost Prediction Using Machine Learning
7 pages
Accurate Predictionof Medical Insurance Pricesusing Machine Learningin Python
No ratings yet
Accurate Predictionof Medical Insurance Pricesusing Machine Learningin Python
28 pages
Medical Insurance Cost Prediction System: Dharesh Bahety EN18EL301057 Under The Guidance of Mr. Parag Ravekar Sir
0% (1)
Medical Insurance Cost Prediction System: Dharesh Bahety EN18EL301057 Under The Guidance of Mr. Parag Ravekar Sir
18 pages
PBL Sem 3 Documentation
No ratings yet
PBL Sem 3 Documentation
20 pages
Internship REPOER
No ratings yet
Internship REPOER
31 pages
Report Final 2
No ratings yet
Report Final 2
58 pages
Medical Insurance Cost Prediction
No ratings yet
Medical Insurance Cost Prediction
2 pages
201-15-3650,3032-Project Presentation Slide
No ratings yet
201-15-3650,3032-Project Presentation Slide
9 pages
Report
No ratings yet
Report
35 pages
Medical Insurance Cost Prediction Report Full
100% (1)
Medical Insurance Cost Prediction Report Full
50 pages
Medicial
No ratings yet
Medicial
13 pages
Medical Insurance Premium Prediction With Machine Learning
No ratings yet
Medical Insurance Premium Prediction With Machine Learning
7 pages
SSRN 4867135
No ratings yet
SSRN 4867135
4 pages
Insurance Premium Prediction
No ratings yet
Insurance Premium Prediction
12 pages
Project Abstract01
No ratings yet
Project Abstract01
3 pages
A Project Report
No ratings yet
A Project Report
5 pages
Irjet V11i4171
No ratings yet
Irjet V11i4171
8 pages
(2024 Issue) DIRDC2-301-PUB24 - 319 - Full Paper - JES - AL
No ratings yet
(2024 Issue) DIRDC2-301-PUB24 - 319 - Full Paper - JES - AL
10 pages
An Ensemble Methods For Medical Insurance Costs Prediction Task
No ratings yet
An Ensemble Methods For Medical Insurance Costs Prediction Task
16 pages
Wjarr 2025 0368
No ratings yet
Wjarr 2025 0368
9 pages
Health Insurance Amount Prediction: Nidhi Bhardwaj, Rishabh Anand
No ratings yet
Health Insurance Amount Prediction: Nidhi Bhardwaj, Rishabh Anand
4 pages
Health Insurance Cost Prediction
No ratings yet
Health Insurance Cost Prediction
4 pages
Project Presentation Template
No ratings yet
Project Presentation Template
26 pages
Machine Learning in Healthcare Management For Medical Insurance Cost Prediction
No ratings yet
Machine Learning in Healthcare Management For Medical Insurance Cost Prediction
11 pages
Predict Health Insurance Cost by Using Machine Learning and DNN Regression Models
No ratings yet
Predict Health Insurance Cost by Using Machine Learning and DNN Regression Models
7 pages
Medical Expenses Prediction
No ratings yet
Medical Expenses Prediction
51 pages
BT3046 PR
No ratings yet
BT3046 PR
22 pages
A Computational Intelligence Approach For Predicti
No ratings yet
A Computational Intelligence Approach For Predicti
13 pages
Project
No ratings yet
Project
18 pages
Prediction of Health Insurance111 Price U111sing Machine Learning Algorithms
No ratings yet
Prediction of Health Insurance111 Price U111sing Machine Learning Algorithms
6 pages
PDF P Classtruncatedtext Module Lineclamped 85ulhh Style Max Lines5mini Project Report Health Insurance Price Prediction P - Compress
No ratings yet
PDF P Classtruncatedtext Module Lineclamped 85ulhh Style Max Lines5mini Project Report Health Insurance Price Prediction P - Compress
33 pages
Aiml (Medical Insurance Cost Detection) - 2
No ratings yet
Aiml (Medical Insurance Cost Detection) - 2
27 pages
Medical Insurance Cost Prediction
No ratings yet
Medical Insurance Cost Prediction
7 pages
Report 5
No ratings yet
Report 5
51 pages
Vikash Rai Project Report
No ratings yet
Vikash Rai Project Report
53 pages
1822 B.E Cse Batchno 296
No ratings yet
1822 B.E Cse Batchno 296
83 pages
Automating E Government Services Using Machine Learning
No ratings yet
Automating E Government Services Using Machine Learning
11 pages
Final Report
No ratings yet
Final Report
53 pages
AI Disease Prediction Report
No ratings yet
AI Disease Prediction Report
94 pages
Project Concept Idea
No ratings yet
Project Concept Idea
2 pages
Prediction of Medical Costs Using Regression Algorithms: A. Lakshmanarao, Chandra Sekhar Koppireddy, G.Vijay Kumar
0% (1)
Prediction of Medical Costs Using Regression Algorithms: A. Lakshmanarao, Chandra Sekhar Koppireddy, G.Vijay Kumar
7 pages
Final Project Report
No ratings yet
Final Project Report
31 pages
Comparative Analysis of Machine Learning Algorithms Using Diabetes Dataset
100% (1)
Comparative Analysis of Machine Learning Algorithms Using Diabetes Dataset
35 pages
Predictive Model for Health Insurance Premiums
No ratings yet
Predictive Model for Health Insurance Premiums
55 pages
Final Report
No ratings yet
Final Report
25 pages
5 - Fraud Detection in Insurance Claim Using Machine Learning
No ratings yet
5 - Fraud Detection in Insurance Claim Using Machine Learning
69 pages
Medical Insurance Cost
No ratings yet
Medical Insurance Cost
12 pages
4-2 Project Documentation
No ratings yet
4-2 Project Documentation
72 pages
Batch - 47 - Documentation - 19131A05C0 MALLA MONISHA
No ratings yet
Batch - 47 - Documentation - 19131A05C0 MALLA MONISHA
89 pages
Final
No ratings yet
Final
49 pages
Amazon ML Summer School Sample Test
No ratings yet
Amazon ML Summer School Sample Test
7 pages
Powerpoint Slides Prepared by Robert F. Brooker, Ph.D. Slide 1
No ratings yet
Powerpoint Slides Prepared by Robert F. Brooker, Ph.D. Slide 1
55 pages
2019 AlbariandKartikasari Ajefb
No ratings yet
2019 AlbariandKartikasari Ajefb
17 pages
An Analysis of Factors That Affecting The Number of Car Sales in Malaysia
No ratings yet
An Analysis of Factors That Affecting The Number of Car Sales in Malaysia
11 pages
FTSE 500 Tax Impact Analysis
No ratings yet
FTSE 500 Tax Impact Analysis
7 pages
Factors Affecting Ethiopian MFIs
No ratings yet
Factors Affecting Ethiopian MFIs
63 pages
Faculty Interaction & Learning Outcomes
No ratings yet
Faculty Interaction & Learning Outcomes
24 pages
Chapter 08
No ratings yet
Chapter 08
3 pages
Advanced Statistical Techniques For Analytics (Course Handout, 2018H2)
No ratings yet
Advanced Statistical Techniques For Analytics (Course Handout, 2018H2)
6 pages
Groebner Business Statistics 7 Ch15
No ratings yet
Groebner Business Statistics 7 Ch15
70 pages
Media Advertising and Consumers' Buying Behavior in Banking Industry
No ratings yet
Media Advertising and Consumers' Buying Behavior in Banking Industry
4 pages
Multiple Linear Regression Analysis
No ratings yet
Multiple Linear Regression Analysis
23 pages
Corporate Board Attributes and Tax Aggressiveness
No ratings yet
Corporate Board Attributes and Tax Aggressiveness
28 pages
SME Resilience Post-COVID in Vietnam
No ratings yet
SME Resilience Post-COVID in Vietnam
31 pages
ICT's Impact on School Education
No ratings yet
ICT's Impact on School Education
64 pages
Linear Regression Analysis Theory and Computing 1st Edition Xin Yan Download
100% (7)
Linear Regression Analysis Theory and Computing 1st Edition Xin Yan Download
71 pages
Optimal Ridge Parameter Selection
No ratings yet
Optimal Ridge Parameter Selection
6 pages
I Dont Want To Work in Agriculture The Transition
No ratings yet
I Dont Want To Work in Agriculture The Transition
35 pages
The Nexus Between Resettlement and Quality of Life of Mining Induced Migrants in Ghana: A PLS-SEM Approach
No ratings yet
The Nexus Between Resettlement and Quality of Life of Mining Induced Migrants in Ghana: A PLS-SEM Approach
22 pages
Budget Transparency Nagan Raya Regency Government: International Journal of Current Science Research and Review
No ratings yet
Budget Transparency Nagan Raya Regency Government: International Journal of Current Science Research and Review
12 pages
1 s2.0 S2352710221001443 Main
No ratings yet
1 s2.0 S2352710221001443 Main
13 pages
SPSS Independent Samples T Test
No ratings yet
SPSS Independent Samples T Test
72 pages
Assignment 2: Executive Summary
No ratings yet
Assignment 2: Executive Summary
22 pages
1 s2.0 S2666154324004617 Main
No ratings yet
1 s2.0 S2666154324004617 Main
20 pages
Pengpid, S., & Peltzer, K. (2018) - Utilization of Traditional and Complementary Medicine in Indonesia
No ratings yet
Pengpid, S., & Peltzer, K. (2018) - Utilization of Traditional and Complementary Medicine in Indonesia
8 pages
Risk Management Practices in Islamic Banking Institutions
No ratings yet
Risk Management Practices in Islamic Banking Institutions
22 pages
Multiple Linear Regression: Application
No ratings yet
Multiple Linear Regression: Application
22 pages
Toward Sustainable Development and Consumption The Role of The Green Promotion Mix in Driving Gre (HIGHLIGHTED)
No ratings yet
Toward Sustainable Development and Consumption The Role of The Green Promotion Mix in Driving Gre (HIGHLIGHTED)
29 pages
Effect of Ease of Use of Application, E-Service Quality and Benefit Perception of E-Wallet Application On Customer Satisfaction
No ratings yet
Effect of Ease of Use of Application, E-Service Quality and Benefit Perception of E-Wallet Application On Customer Satisfaction
14 pages
Belisa Aliyi - Assignments - For - Econometrics
No ratings yet
Belisa Aliyi - Assignments - For - Econometrics
34 pages

Internship Documnet - 1

Uploaded by

Internship Documnet - 1

Uploaded by

HEALTH INSURANCE PRICE PREDICTION: MACHINE LEARNING

REGRESSION FOR PREDICTING HEALTH INSURANCE PRICES

KOLLIPARA CHAITANYA ABHISHIKTH 21471A4326

DEPARTMENT OF CSE (ARTIFICIAL INTELLIGENCE)

DEPARTMENT OF CSE (ARTIFICIAL INTELLIGENCE)

Project Guide: Head of the Department

S.NO CONTENT PAGE NO

1.2 Existing System

2.1 Literature Survey

3.1 Hardware Requirements

• Processor : 12th Gen Intel(R) Core(TM) i5-

3.2 Software Requirements

• Operating System : Windows 11/Linux

Figure 4.1: Data Set Before Preprocessing

Figure 4.2: Data Set After Preprocessing

Figure 4.3: Correlation Matrix

Figure 4.4: Box Plot

5.1 BLOCK DIAGRAM

Figure 5.1: System Design

print(f"Best alpha: {grid_search.best_params_['alpha']}")

Figure 7.1: Heat Map

Figure 8.1: R2 Score of the model

Figure 8.2: Ridge Regression Plot

You might also like