100% found this document useful (1 vote)
412 views25 pages

Project Notes 1: Purpose of Document

This document provides an overview and initial analysis of a dataset containing information on 14,178 auto insurance policies. It describes the purpose of collecting the data and performing exploratory analysis. It then analyzes various policy attributes like claim status, premium amounts, coverage details, and driver characteristics through univariate and bivariate analyses. The goal is to build a risk analytics model to understand customer renewal potential and claim propensity.

Uploaded by

Aadesh Srivastav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
412 views25 pages

Project Notes 1: Purpose of Document

This document provides an overview and initial analysis of a dataset containing information on 14,178 auto insurance policies. It describes the purpose of collecting the data and performing exploratory analysis. It then analyzes various policy attributes like claim status, premium amounts, coverage details, and driver characteristics through univariate and bivariate analyses. The goal is to build a risk analytics model to understand customer renewal potential and claim propensity.

Uploaded by

Aadesh Srivastav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Project Notes 1

Purpose of document:

 Data Report
o Understand how data was collected in terms of time, frequency, methodology
o Visual inspection of data (rows, columns, descriptive details)
o Understanding of attributes (variable info, renaming if required)
 Initial Exploratory Data Analysis
o Univariate analysis (distribution and spread for every continuous attribute, distribution of
the data in categories for categorical ones)
o Bivariate analysis (relationship between different variables, correlations)

Problem Statement:

The objective of this project is to build a “Risk Analytics model” to understand the renewal potential and
claim propensity of Existing Customers under Personal Auto Insurance Lines.
Will the policy holder initiate a claim (Yes/No) for this policy in the next policy year?

Need of this Study:

Auto insurance is important because it not only covers any physical damage that may occur in an
accident, but also any damage or injury that might be caused because of a vehicular accident or which
may be done upon oneself or one’s vehicle by another vehicle or accident – a falling tree for example.
By analyzing the data set we can understand about the various factors that result in claim renewal by
existing customer base and hence can provide important insights as to how to target existing customer
base to get auto insurance thereby improving safety and security for the existing auto-owners.

The dataset is a primary source of data collected during the year 2015-2016 for an auto insurance
company. Cars manufactured through the year 1957-2000 are recorded in this data set.
The data consists of 127 columns, 14178 rows and several missing and incorrect values

Univariate Analysis:

ClaimStatus: Indicates whether the policy holder has made a claim or not. 1 indicates a claim
and 0 indicates no claim
Initial DataType: Numeric
Type converted To: Factor

0 1
13399 778

Out of 14177 records/policies only 778 users~ 5% have claimed the insurance.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`

ClaimFrequency: Gives the number of claims claimed.

Initial DataType: Numeric


Type converted To: Factor

0 1 2 3 4 5
13399 580 144 36 16 2
From claim frequency we can say user mostly claim once or twice in a year and the no. of users claiming
for more than 2 time in is very less 2 users.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`

Premium: Premium in $1000

Type: Numeric
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 40.0 71.0 191.2 239.0 2869.0
Contains Outliers.

Most of the premiums are bought for the under 50 thousand.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
Billing Term: How often the premium is paid, i.e. once a year = 1, Three times in a year = 3, or 6 times in
a year = 6.

Initial DataType: Numeric


Type converted To: Factor

1 3 6
6724 482 6971

Looking at the data we can say people prefer premium paid 6 times and once in a year.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`

Renewed: Indicates whether the policy has been renewed or not. 1= Renewed, 0 = not renewed

Initial DataType: Numeric


Type converted To: Factor

0 1
7207 6970

Policy renewal has almost equal ratio of zeros and ones.


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

DOB1 to DOB5: Date of Birth of the main driver(policy holder), second driver till fifth driver.

DOB1 to DOB5 need to be changed in one single format.

Analysis can be done after removing impurities and getting all DOB’s in single format.

*************************************************************************************

Number of Drivers: Count of the number of drivers in the policy

Initial DataType: Numeric


Type converted To: Factor

1 2 3 4 5
8624 4636 733 150 34

No. of drivers registered into the policy ranges from 1-5 and it seems user prefer to insure only first 2
drivers in most of the case.
AgeUSdriving_1 to AgeUSdriving_5: How long the driver has been driving

Type: Numeric

AgeUSDriving_1 experience ranges from 17 to 79 years and


the distribution of data is left skewed with most of the drivers having experience of 25-35 year.

AgeUSDriving_2 to AgeUSDriving_5 contain impurities and missing values.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Amendment: Number of changes made to the policy during the year (may it is: No. of changes made to
the policy till date from the date of buying)

Initial DataType: Numeric


Type converted To: Factor

0 1 2 3 4 5 6 7 8 10
13622 346 135 42 14 11 3 1 1 2

Amendment in a year ranges from 1 to 10 amendments with mostly no amendments per year and 1-2
amendments/year

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
Coverage Liability: Coverage liability represents the state's financial responsibility law i.e. the minimum
requirement. the first figure is the amount to be paid for injuries per head, the second figure is the
amount for all injuries and the third is the amount per vehicle damaged.

25/50/25 and 20/40/15 is the preferred package by the policy holders.

table(car_insu$CoverageLiability)

20/40/15 25/50/25 30/60/25 None


7061 7077 38 1
table(car_insu$CoveragePD_1)

1000/1000 500/500 None


1 3094 11030

Coverage PD(physical damage):most of the policy holders have opted out for it however there are user
opting for 500/500 and only 1 user have opted for 1000/1000

CoverageMP(medical payment): is none though there is one user with coverage MP of 535 thousand
table(car_insu$CoverageMP)

535 None
1 14119
sum(is.na(car_insu$CoverageMP))
[1] 57

CoverageUMBI and CoverageUMPD(for uninsured or underinsured motorist) is only accepted by 316


policy holder

table(car_insu$CoverageUMBI)

Accepted None
316 13855
> table(car_insu$CoverageUMPD)

Accepted None
316 13855
sum(is.na(car_insu$CoverageUMBI))
[1] 6
sum(is.na(car_insu$CoverageUMPD))
[1] 6

Those who have opted for CoveragePIP_CDW(personal inury protection) have package of around 2569
and 2535 thousand.

table(car_insu$CoveragePIP_CDW)

2535 2569 None


1 90 14036
sum(is.na(car_insu$CoveragePIP_CDW))
[1] 50
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
Distance to work:

For driver 1 distance to work for most of the driver is within 100 miles and for 9 of the drivers distance
to work exceeds the 100 miles and ranges upto 500 miles

Same is the case with driver 2

For driver 3 most of the driver have distance to work as in range of 0-5 miles with maximum miles upto
176
Driver 4 has miles in range 0-20 miles while for driver 5 miles are in range 0-5
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No. of driver assigned to first vehicle is max upto 2 driver followed by 3 and then 4,5
table(car_insu$DriverAssigned_1)

1 2 3 4 5
12495 1400 248 30 4

>
DOBs of driver1-3

Summary for DOB for driver (1-3)

Moreover there are large NA values for DOBs of driver 4 and 5


sum(is.na(car_insu$DOB4))
[1] 13992
sum(is.na(car_insu$DOB5))
[1] 14135

Data for Engine variable required to be clean as some of data is also provided into CC which needs to be
converted into liter
Variable : ExcludedDriverName_01 - ExcludedDriverName_20

Meaning: An excluded driver is someone you intentionally remove from your auto insurance policy. If
that person uses your car, either with or without your permission, and has an accident, your car
insurance will not provide coverage.

First Person to twentied person declared as an excluded driver.

Type: Character

Number of records in each Variable(ExcludedDriverName)


**************************************************************************

Variable : GaragedZIP_1

Meaning: Zipcode of the place where the first vehicle is parked.

Type: Numeric

**************************************************************************

Variable: Marital Status1 to Marital Status5

Meaning: Marital status of the first driver to fifth driver M or S

Type: Character

New DataType: Factor

**************************************************************************

Variable: Occupation_1 to Occupation_5

Meaning: Occupation of the first driver to fifth driver

Type: Character
New DataType: Factor

**************************************************************************

Variable: Relation_1 to Relation_5

Meaning: Relationship of the driver with the main policy holder. (Relation_1 is onlyself)

Type: Character

NewDataType: Factor

**************************************************************************

Variable: Rental_1

Meaning: First vehicle (if Rental is allowed)

Type: Numeric

NewDataType: Factor (It should be yes or no.. 0 or 1)

**************************************************************************
Variable: Sex_1 to Sex_5

Meaning: Gender of the driver one to driver five

Type: Char

NewDataType: Factor

**************************************************************************

Variable: Surcharge1Unit_1 to Surcharge2Unit_1

Meaning: A surcharge is an amount that insurance companies typically add to a person’s insurance
premium, which is the amount he or she pays for insurance, due to some event or occurrence(accident
in which that driver was at fault)

Insurance companies apply a surcharge because they assume that a ticket for a moving violation, such as
reckless driving or speeding, makes it more likely that a person is going to be involved in an accident and
injure someone

First Surcharge of the first vehicle (Y/N) or Second Surcharge of the first vehicle etc.

Type: Character

NewDataType: Factor
MaritalStatus_1 to MaritalStatus_5

Levels are only "M" and "S"

Occupation_1 to Occupation_5

levels in occupation are around 2000.

Relation_1 to Relation_5
Levels are around 200 for relations.

Sex_1 to Sex_5

Levels are only "F" and "M"


Units is a continuous variable hence value is Numerical.
Vehicle Inspected is a categorical variable hence has to be converted to Factor.
The remaining ViolPointsDriver variables has to be converted to a Factor

Units: Number of vehicles covered for that Policy


VehicleInspected_1: First Vehicle inspected or not(Categorical) 1-Inspected 0-Not Inspected
ViolPointsDriver: Nth time the mth driver is scoring a violation point categorical variable.
On analyzing variables from Units to ViolPoints8Driver_5 it is observed that majority of the
ViolPointsDriver columns are all 0’s and hence can be eliminated from the dataset.

Units: Analyzing Units we see that the maximum number of Units occur for 1
Majority of the vehicles have not been inspected

Majority of the vehicles have not been inspected and the maximum number of vehicles are inspected for
units 1 and 2.
From the above Violation Points by Drivers it is apparent that for majority of the policies violation has not
been done and maximum Violation has been done for ViolationPoints1 Driver1.

Year_1:

The variable Year_1 denotes the year in which the first unit (vehicle) was manufactured. Further looking
at the data we find that there are some impurities in the data and the range of valid values is from 1957
to 2011:
Looking at the NULLs/blanks in the data, we see :

All records have some Year_1 values and there is no Null/blanks value.

Make_1:

Make_1 in the dataset denotes the marque/brand of the first unit (vehicle) registered in the policy. This
variable is essential for the study since overall quality of the brand line is set. Some brand vehicles could
be more prone to accidents than others.

Analysing the data in the dataset, we find that:

There are many impurities in the data, the name of the brand are spelled in correctly, moreover there
are some data points with no owner as make_1. We look the data availability for the attribute to find
that 99.4% of the records have values. On a side note, after data cleansing we believe there are
approximately 48 brand in total in the dataset.

Model_1 :
Model_1 has the information pertaining to the model name of the first vehicle insured in the policy.
Looking at the data we find:

Same as the make_1 we see there are many impurities in the data. The Model_1 is particularly
important for determining the claim status since the overall quality of the design and the models being
produced have an impact on the safety factor on field. Looking at how much data we have in the column
we find that about 99 percentage of columns have data points.

Zip:
The zip variable presumably has the zip code information of the zip code where the policy is registered.
This could be important feature in the dataset to indicate how is the tendency of people in each area in
terms of claims being made. To extract meaning out of the zip codes, one reference dataset containing
the US pin code information was merged with car_insurance dataset. We see that:

All records have values for zip column.

All policies were registered in the state of Texas.

Total_Distance_To_Work:

Total_distance_to_work signifies the total distance to work of all the drivers combined in the dataset.
Looking at the dataset, we see that the distance ranges from 1 to 500 units.
All records have total distance to work values.

NoLossSigned:
No loss signed variable signifies if the no loss statement has been signed in the policy. The statement of
no loss basically states that there has been no loss in the period of lapse. Or the time the car was not
insured. This may prove to be an important variable in the model as the indicator could tell us about the
tendency of the customer.

All the records have an indicator value, and 653 of all the customers in the dataset have opted for this
declaration.

Type:
Type variable signifies the type of insurance the record signifies in the dataset. Looking at the dataset,
we observe that:
The nine type of insurance policies have been spread across all the records and there are no NULL values
in the columns. The meaning of each level is being explored for US market.

CancellationType:

The column cancellation type signifies the type of cancellation of a policy if any. Upon exploring the data
we find:

About 2.2% of policy have been cancelled in the dataset and majority have been of the type “NP”. The
significance and meaning of each level is being explored through secondary research.

You might also like