Internship REPOER
Internship REPOER
MACHINE LEARNING
INTERNSHIP R E P O R T
Submitted by
SANTHOSH S
[EA*******10127]
June 2025
DIRECTORATE OF ONLINE EDUCATION
BONAFIDE CERTIFICATE
carried out the Internship Work under my supervision along with the company mentor.
Certified further, that to the best of my knowledge, the work reported herein does not form any
other internship report or dissertation based on which a degree or award was conferred on an
1
INTERNSHIP OFFER LETTER
THIS WORK HAS BEEN CARRIED OUT IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
2
ACKNOWLEDGEMENTS
SRM Institute of Science and Technology, for the facilities extended for the project work and his
continued support. We extend our sincere thanks to Director DOE, SRM Institute of Science and
Technology, Prof. Dr Manoranjan Pon Ram, for his invaluable support. SRM Institute of
Science and Technology, for her support throughout the project work. We want to convey our
thanks to Programme Coordinator Dr. G. Babu, Directorate of online Education, SRM Institute
of Science and Technology, for their inputs during the project reviews and support. Our
inexpressible respect and thanks to my guide, Dr G.Babu., Assistant Professor & Programme
Coordinator Directorate of online Education,, SRM Institute of Science and Technology, for
He provided me with the freedom and support to explore the research topics of my
interest. His passion for solving problems and making a difference in the world has always been
inspiring. We sincerely thank the Directorate of online Education, staff and students, SRM
Institute of Science and Technology, for their help during our project. Finally, we would like to
thank parents, family members, and friends for their unconditional love, constant support, and
encouragement.
NAME OF STUDENT
3
TABLE OF CONTENTS
1. ABSTRACT .......................................................................... 6
2. INTRODUCTION.................................................................. 7
2.1 Objective................................................................................ 7
4.3 Machine learning Algorithms and where they are used ......... 13
11 CONCLUSION................................................................................... 28
REFERENCES ........................................................................................ 29
5
1. ABSTRACT
Health care costs increase day by day. As there are a greater number of new viruses entering into
people, there is a need to predict health charges. This type of prediction helps the governments to
make a decision regarding health issues. People also know the importance of health care costs.
Machine Learning is a field which has its impact on every field. Health care system also uses
machine learning models for several health related applications. In this paper, we have done
predicate analysis on medical health insurance charges. We build a model to predict the medical
insurance cost of a person based on gender. We collect the dataset from Kaggle, which contains
1338 rows of data with the features age, gender, smoker, BMI, children, region, and insurance
charges. The data contains medical information and costs billed by health insurance companies. We
applied various regression algorithms on this dataset to predict medical costs.We live on a planet
that is filled with dangers and uncertainties. People, households, businesses, properties, and property
are all vulnerable to various types of risk, and the degree of risk can differ. These Death, illness, and
The most important aspects of people's lives are their health and happiness lives. However, because
dangers cannot always be avoided, the world of Finance has created a slew of tools to protect you.
Individuals and organizations can be protected against these dangers by employing. They will be
reimbursed with financial capital. As a result, insurance is a need. Policy that reduces or eliminates
the expenses of loss incurred by various risks, as well as the importance of insurance in people's
lives It is crucial for businesses to understand the needs of individuals. When it comes to the value
of insurance in people's lives, it's critical for insurance firms to be able to accurately evaluate or
quantify the amount covered by the policy and the insurance payments that must be paid.
6
2. INTRODUCTION
We live on a planet that is filled with dangers and uncertainties. People, households,
businesses, properties, and property are all vulnerable to various types of risk, and the degree of risk
can differ. These Death, illness, and property loss are all potential threats and assets. The most
important aspects of people's lives are their health and happiness lives. However, because dangers
cannot always be avoided, the world of Finance has created a slew of tools to protect you.
Individuals and organizations can be protected against these dangers by employing. They will be
reimbursed with financial capital. As a result, insurance is a need. Policy that reduces or eliminates
the expenses of loss incurred by various risks, as well as the importance of insurance in people's
lives It is crucial for businesses to understand the needs of individuals. When it comes to the value
of insurance in people's lives, it's critical for insurance firms to be able to accurately evaluate or
quantify the amount covered by the policy and the insurance payments that must be paid.
2.1 Objective
Predict the future medical expenses of subjects based on certain features building a robust machine
learning model. Identifying the factors affecting the medical expenses of the subjects based on the
model output.
7
Drawbacks:
Data preprocessing is a technique in which we can remove missing values in the data. Because of
these missing values, it is not possible to apply machine learning algorithms. After removal of
missing values, we need to apply label encoding, one hot encoding data to the categorical features.
Categorical features are the features whose values are labels instead of values. After that, apply
standardization or normalization techniques to our data. This method is used when all the attribute
values are not in the same scale.
In regression analysis, we need to predict the value of dependent variables using independent
variables. First, we collected a dataset and applied various data preprocessing methods. Then we
applied the following four regression models on the dataset.
i) Linear Regression
8
3. SYSTEM REQUIREMENTS
RAM - 1 GB Or more
Language - Python
Framework - FLASK
9
4. MACHINE LEARNING
Machine Learning is a system that can learn from example through self-
improvement and without being explicitly coded by programmers. The breakthrough comes with
the idea that a machine can singularly learn from the data (i.e., example) to produce accurate
results. Machine learning combines data with statistical tools to predict an output. This output is
then used by corporations to make actionable insights. Machine learning is closely related to data
mining and Bayesian predictive modeling. The machine receives data as input, use an algorithm to
formulate answers.
DATA RULES
COMPUTER
OUTPUT
Machine Learning
10
4.2 How does Machine learning work?
Machine learning is the brain where all the learning takes place. The way the machine learns is
similar to the human being. Humans learn from experience. The more we know, the more easily we
can predict. By analogy, when we face an unknown situation, the likelihood of success is lower than
the known situation. Machines are trained the same. To make an accurate prediction, the machine
sees an example. When we give the machine a similar example, it can figure out the outcome.
However, like a human, if it feeds a previously unseen example, the machine has difficulties to
predict.
The core objective of machine learning is the learning and inference. First of all, the machine
learns through the discovery of patterns. This discovery is made thanks to the data. One crucial part
of the data scientist is to choose carefully which data to provide to the machine. The list of attributes
used to solve a problem is called a feature vector. You can think of a feature vector as a subset of
data that is used to tackle a problem.
11
The machine uses some fancy algorithms to simplify the reality and transform this discovery into
a model. Therefore, the learning stage is used to describe the data and summarize it into a model.
For instance, the machine is trying to understand the relationship between the wage of an individual
and the likelihood to go to a fancy restaurant. It turns out the machine finds a positive relationship
between wage and going to a high-end restaurant: This is the model
When the model is built, it is possible to test how powerful it is on never-seen-before data. The new
data are transformed into a features vector, go through the model and give a prediction. This is all
the beautiful part of machine learning. There is no need to update the rules or train again the model.
You can use the model previously trained to make inference on new data.
12
The life of Machine Learning programs is straightforward and can be summarized in the following
points:
1. Define a question
2. Collect data
3. Visualize data
4. Train algorithm
5. Test the Algorithm
6. Collect feedback
7. Refine the algorithm
8. Loop 4-7 until the results are satisfying
9. Use the model to make a prediction
Once the algorithm gets good at drawing the right conclusions, it applies that knowledge to new sets
of data.
13
4.3 Machine learning Algorithms and where they are used
Machine learning can be grouped into two broad learning tasks: Supervised and Unsupervised.
There are many other algorithms
Supervised learning
An algorithm uses training data and feedback from humans to learn the relationship of given inputs
to a given output. For instance, a practitioner can use marketing expense and weather forecast as
input data to predict the sales of cans.
You can use supervised learning when the output data is known. The algorithm will predict new
data.
There are two categories of supervised learning:
● Classification task
● Regression task
14
5. PYTHON OVERVIEW
Python is a high-level, interpreted, interactive and object-oriented scripting language. Python is
designed to be highly readable. It uses English keywords frequently whereas other languages use
o Python is interpreted: Python is processed at runtime by the interpreter. You do not need
to compile your program before executing it. This is similar to PERL and PHP.
o Python is Interactive: You can actually sit at a Python prompt and interact with the
interpreter directly to write your programs.
o Python is Object-Oriented: Python supports Object-Oriented style or technique of
programming that encapsulates code within objects.
o Python is a Beginner's Language: Python is a great language for the beginner-level
programmers and supports the development of a wide range of applications from simple text
processing to WWW browsers to games.
15
Window system of UNIX.
Scalable: Python provides a better structure and support for large programs than shell scripting.
Apart from the above-mentioned features, Python has a big list of good features, few are listed
below:
Python is available on a wide variety of platforms including Linux and Mac OS X. Let's understand
how to set up our Python environment.
6. ANACONDA NAVIGATOR
Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda distribution
that allows you to launch applications and easily manage conda packages, environments and
channels without using command-line commands. Navigator can search for packages on Anaconda
Cloud or in a local Anaconda Repository. It is available for Windows, Mac OS and Linux.
In order to run, many scientific packages depend on specific versions of other packages.
Data scientists often use multiple versions of many packages, and use multiple environments to
separate these different versions.
The command line program condo is both a package manager and an environment manager, to help
data scientists ensure that each version of each package has all the dependencies it requires and
works correctly.
Navigator is an easy, point-and-click way to work with packages and environments without needing
to type condo commands in a terminal window. You can use it to find the packages you want, install
them in an environment, run the packages and update them, all inside Navigator.
15
6.2 What applications can I access using navigator?
The following applications are available by default in Navigator:
● JupyterLab
● Jupyter Notebook
● QT Console
● Spyder
● VS Code
● Glue viz
● Orange 3 App
● Rodeo
● RStudio
Advanced condo users can also build your own Navigator applications
The simplest way is with Spyder. From the Navigator Home tab, click Spyder, and write and execute
your code.
You can also use Jupyter Notebooks the same way. Jupyter Notebooks are an increasingly popular
system that combine your code, descriptive text, output, images and interactive interfaces into a
single notebook file that is edited, viewed and used in a web browser.
16
7. SYSTEM ARCHITECTURE
ML Evaluate
modelling Result
17
7.2 Methodology
Step 1:-Initialize the dataset containing training data wholesale price index
Step 2:-Select all the rows and column 1 from dataset to “x” Which is independent variable
Step 3:-Select all of the rows and column 2 from dataset to “y” Which is dependent variable
8. System Modules
8.1 Data Ingestion
Data ingestion is the transportation of data from assorted sources to a storage medium where it can
be accessed, used, and analyzed by an organization. The destination is typically a data warehouse,
data mart, database, or a document store. Sources may be almost anything – including SaaS data,
in-house apps, databases, spreadsheets, or even information scraped from the internet. The data
ingestion layer is the backbone of any analytics architecture. Downstream reporting and analytics
systems rely on consistent and accessible data. There are different ways of ingesting data, and the
design of a particular data ingestion layer can be based on various models or architectures.
18
8.2 Data Preprocessing:
Data Preprocessing is a data mining technique used to transform the raw data into useful and
efficient format. The data here goes through 2 stages 1. Data Cleaning: It is very important for data
to be error free and free of unwanted data. So, the data is cleansed before performing the next steps.
Cleansing of data includes checking for missing values, duplicate records and invalid formatting
and removing them. 2. Data Transformation: Data Transformation is transformation of the datasets
mathematically; data is transformed into appropriate forms suitable for data mining process. This
allows us to understand the data more keenly by arranging the 100‟s of records in an orderly way.
Exploratory data analysis (EDA) is an approach to understand the datasets more keenly by means
of visual elements like scatter plots, bar plots, etc. This allows us to identify the trends in the data
more accurately and to perform analysis accordingly. From the yearly trends graphs it is observed
that US Exports depend on and follow the areas planted and harvested annually. A sudden drop in
China‟s Exports in the year 2009 is observed and in the meantime its imports kept increasing in the
last 12 years regardless of the global yield, which implies China has a huge and lasting demand of
soybean crop but now it relies on the global supply to meet the needs.
20
9. SOURCE CODING
9.1 App.py
import numpy as np
import pandas as pd
from flask import Flask, request, jsonify, render_template
import pickle
@app.route('/')
def home():
return render_template('index.html')
@app.route('/predict',methods=['POST'])
def predict():
'''
For rendering results on HTML GUI
'''
float_features = [float(x) for x in request.form.values()]
final_features = [np.array(float_features)]
prediction = model.predict( final_features )
output=round(prediction[0],1)
9.2 medical_cost_checkpoint.ipynb
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data=pd.read_csv('insurance.csv')
data.head()
data.info()
21
data.shape
data.columns
## Pre-processing
lab=LabelEncoder()
data['sex']=lab.fit_transform(data['sex'])
data['smoker']=lab.fit_transform(data['smoker'])
data['region']=lab.fit_transform(data['region'])
data.head()
## Data Exploration
sns.countplot(x='smoker',data=data)
sns.countplot(x='sex',data=data)
## Data Splitting
x=data.iloc[:,data.columns!='charges']
y=data.iloc[:,data.columns=='charges']
#x.shape
#x.head()
x.head()
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.3)
xtrain.head()
ytrain.head()
## RandomForestRegressor
y_pred = regressor.predict(xtest)
#y_pred
alg = LinearRegression()
alg.fit(xtrain, ytrain)
y_predict = alg.predict(xtest)
## DecisionTreeRegressor
dt=tree.DecisionTreeRegressor()
dt.fit(xtrain, ytrain)
x_predicted=dt.predict(xtest)
print("\n\nr2_score is " , r2_score(x_predicted,ytest))
## SVR
test_vector = np.reshape(np.asarray([19,0,27.900,0,1,3]),(1,6))
p = int(regressor.predict(test_vector)[0])
p
import pickle
pickle.dump(model, open('model.pkl','wb'))
23
10. RESULT
10.1 Dataset
24
10.3 Data Exploration
25
10.5 LR Model Result
26
10.7 Home page:
27
11 CONCLUSION
In this internship , we proposed a machine learning model for predicting medical costs.. We applied
four regression techniques: Linear Regression, Support Vector Regression, Decision Tree
Regression, and Random Forest Regression. We also applied RF model and observed that age,bmi
are features which decides the dependent variable. Out of all experiments, the Random Forest model
The best indicator for future health care costs are previous costs: the additional history of health
care expenses is known to improve the prediction. Based on this fact, prediction of future health
care costs is better done when patients’ data is known for consecutive periods. At least a two years
history is needed when trying to predict the costs for one year. Our dataset had patients’ monthly
history for 2016 and 2017. However, there are many missing values because most patients had few
claims each year. Therefore, we chose to group claims yearly so that we could have fewer missing
values. This strategy did not work as expected since many patients had data only for 2016. We then
filtered these patients out, and thus, the final set of patients are those who have clinical history for
both 2016 and 2017.so for, future work we can apply this classification process to obtain a patient
risk class as first step to improve the performance of our IEVREG model, and to continue comparing
our model to even more sophisticated methods we could try to solve the prediction of health care
cost using deep learning methods but for this to be feasible we need a larger dataset. We also plan
to apply this model to other regression problems in the health care domain, for example, predicting
the hospital length of stay and predicting the days of readmission based on each patient’s diagnosis
and history, which are two classic prediction problems for this domain.
28
REFERENCES
● https://www.ibm.com/topics/machine-learning
● https://www.analyticsvidhya.com/blog/2021/05/prediction-of-health-
expense/
● https://productcoalition.com/difference-between-traditional-programming-
versus-machine-learning-from-a-pm-perspective-3802b02bc7f6
● https://medium.com/@priyapareek0205/machine-learning-algorithms-and-
where-they-are-used-c74de1441e1
● https://www.tutorialspoint.com/python/python_overview.htm
● https://docs.anaconda.com/free/navigator/index.html
● https://www.kaggle.com/datasets/mirichoi0218/insurance
29