Data analytical review of the 911 Call incindents in 2016: Pennsylvania, US.

In this research I am analyzing the 911 call dataset.

Tools: Python, Numpy, Seaborn, Matplotlib, Pyplot

Data Source: Kaggle.

The data contains the following fields( all are declared as a String variable):

lat : Latitude
lng: Longitude
desc: Description of the Emergency
zip: Zipcode
title: Title
timeStamp: YYYY-MM-DD HH:MM:SS
twp: Township
addr: Address
e: Dummy variable (always 1)

For the data analysis and visualisation we import all the dependencies.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('whitegrid')

plt.rcParams['figure.figsize'] = (6, 4)

Reading the data.

df = pd.read_csv('data/911.csv')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99492 entries, 0 to 99491
Data columns (total 9 columns):
lat          99492 non-null float64
lng          99492 non-null float64
desc         99492 non-null object
zip          86637 non-null float64
title        99492 non-null object
timeStamp    99492 non-null object
twp          99449 non-null object
addr         98973 non-null object
e            99492 non-null int64
dtypes: float64(3), int64(1), object(5)
memory usage: 6.8+ MB

Checking the head of the dataframe

df.head()

	lat	lng	desc	zip	title	timeStamp	twp	addr	e
0	40.297876	-75.581294	REINDEER CT & DEAD END; NEW HANOVER; Station ...	19525.0	EMS: BACK PAINS/INJURY	2015-12-10 17:40:00	NEW HANOVER	REINDEER CT & DEAD END	1
1	40.258061	-75.264680	BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP...	19446.0	EMS: DIABETIC EMERGENCY	2015-12-10 17:40:00	HATFIELD TOWNSHIP	BRIAR PATH & WHITEMARSH LN	1
2	40.121182	-75.351975	HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St...	19401.0	Fire: GAS-ODOR/LEAK	2015-12-10 17:40:00	NORRISTOWN	HAWS AVE	1
3	40.116153	-75.343513	AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;...	19401.0	EMS: CARDIAC EMERGENCY	2015-12-10 17:40:01	NORRISTOWN	AIRY ST & SWEDE ST	1
4	40.251492	-75.603350	CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S...	NaN	EMS: DIZZINESS	2015-12-10 17:40:01	LOWER POTTSGROVE	CHERRYWOOD CT & DEAD END	1

Basic Analysis

Let's check out the top 5 zipcodes for calls.

df['zip'].value_counts().head(5)

19401.0    6979
19464.0    6643
19403.0    4854
19446.0    4748
19406.0    3174
Name: zip, dtype: int64

The top townships for the calls were as follows:

df['twp'].value_counts().head(5)

LOWER MERION    8443
ABINGTON        5977
NORRISTOWN      5890
UPPER MERION    5227
CHELTENHAM      4575
Name: twp, dtype: int64

For 90k + entries, how many unique call titles did we have?

df['title'].nunique()

Initial Data Wrangling for Feature Creation

I extract here some features from the columns in the already in-hand dataset for further analysis.

There is a 'reason for call' alloted to each entry in the title column which is denoted by the text before the colon.

From this assumption the timestamp column we further segregate into Year, Month and Day of Week too.

Over here I create a feature 'Reason' for each call.

df['Reason'] = df['title'].apply(lambda x: x.split(':')[0])

df.tail()

	lat	lng	desc	zip	title	timeStamp	twp	addr	e	Reason
99487	40.132869	-75.333515	MARKLEY ST & W LOGAN ST; NORRISTOWN; 2016-08-2...	19401.0	Traffic: VEHICLE ACCIDENT -	2016-08-24 11:06:00	NORRISTOWN	MARKLEY ST & W LOGAN ST	1	Traffic
99488	40.006974	-75.289080	LANCASTER AVE & RITTENHOUSE PL; LOWER MERION; ...	19003.0	Traffic: VEHICLE ACCIDENT -	2016-08-24 11:07:02	LOWER MERION	LANCASTER AVE & RITTENHOUSE PL	1	Traffic
99489	40.115429	-75.334679	CHESTNUT ST & WALNUT ST; NORRISTOWN; Station ...	19401.0	EMS: FALL VICTIM	2016-08-24 11:12:00	NORRISTOWN	CHESTNUT ST & WALNUT ST	1	EMS
99490	40.186431	-75.192555	WELSH RD & WEBSTER LN; HORSHAM; Station 352; ...	19002.0	EMS: NAUSEA/VOMITING	2016-08-24 11:17:01	HORSHAM	WELSH RD & WEBSTER LN	1	EMS
99491	40.207055	-75.317952	MORRIS RD & S BROAD ST; UPPER GWYNEDD; 2016-08...	19446.0	Traffic: VEHICLE ACCIDENT -	2016-08-24 11:17:02	UPPER GWYNEDD	MORRIS RD & S BROAD ST	1	Traffic

Once here, all my effor is addressed to finding out the most common reason for 911 calls, using the dataset as main osurce of information.

df['Reason'].value_counts()

EMS        48877
Traffic    35695
Fire       14920
Name: Reason, dtype: int64

sns.countplot(df['Reason'])

<matplotlib.axes._subplots.AxesSubplot at 0x1165ad710>

Next challenge, organizing the time information by checking the datatype of the timestamp column.

type(df['timeStamp'][0])

str

As the timestamps are still string types, I convert it to a python DateTime object, so I can extract the year, month, and day information.

df['timeStamp'] = pd.to_datetime(df['timeStamp'])

I extract information for a single DateTime object, as follows.

time = df['timeStamp'].iloc[0]

print('Hour:',time.hour)
print('Month:',time.month)
print('Day of Week:',time.dayofweek)

Hour: 17
Month: 12
Day of Week: 3

Over here I create new features for the above pieces of information.

df['Hour'] = df['timeStamp'].apply(lambda x: x.hour)
df['Month'] = df['timeStamp'].apply(lambda x: x.month)
df['Day of Week'] = df['timeStamp'].apply(lambda x: x.dayofweek)

df.head(3)

	lat	lng	desc	zip	title	timeStamp	twp	addr	e	Reason	Hour	Month	Day of Week
0	40.297876	-75.581294	REINDEER CT & DEAD END; NEW HANOVER; Station ...	19525.0	EMS: BACK PAINS/INJURY	2015-12-10 17:40:00	NEW HANOVER	REINDEER CT & DEAD END	1	EMS	17	12	3
1	40.258061	-75.264680	BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP...	19446.0	EMS: DIABETIC EMERGENCY	2015-12-10 17:40:00	HATFIELD TOWNSHIP	BRIAR PATH & WHITEMARSH LN	1	EMS	17	12	3
2	40.121182	-75.351975	HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St...	19401.0	Fire: GAS-ODOR/LEAK	2015-12-10 17:40:00	NORRISTOWN	HAWS AVE	1	Fire	17	12	3

The Day of the Week is an integer and it might not be instantly clear which number refers to which Day. We can map that information to a Mon-Sun string.

dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}

df['Day of Week'] = df['Day of Week'].map(dmap)

df.tail(3)

	lat	lng	desc	zip	title	timeStamp	twp	addr	e	Reason	Hour	Month	Day of Week
99489	40.115429	-75.334679	CHESTNUT ST & WALNUT ST; NORRISTOWN; Station ...	19401.0	EMS: FALL VICTIM	2016-08-24 11:12:00	NORRISTOWN	CHESTNUT ST & WALNUT ST	1	EMS	11	8	Wed
99490	40.186431	-75.192555	WELSH RD & WEBSTER LN; HORSHAM; Station 352; ...	19002.0	EMS: NAUSEA/VOMITING	2016-08-24 11:17:01	HORSHAM	WELSH RD & WEBSTER LN	1	EMS	11	8	Wed
99491	40.207055	-75.317952	MORRIS RD & S BROAD ST; UPPER GWYNEDD; 2016-08...	19446.0	Traffic: VEHICLE ACCIDENT -	2016-08-24 11:17:02	UPPER GWYNEDD	MORRIS RD & S BROAD ST	1	Traffic	11	8	Wed

I, now, combine the newly created features, to check out the most common call reasons based on the day of the week.

sns.countplot(df['Day of Week'],hue=df['Reason'])

plt.legend(bbox_to_anchor=(1.25,1))

<matplotlib.legend.Legend at 0x116d9cbe0>

The first take away from the analysis is that the number of traffic related 911 calls trends to be the lowest during the weekends, and also the Emergency Service related calls are also low during the weekend.

sns.countplot(df['Month'],hue=df['Reason'])

plt.legend(bbox_to_anchor=(1.25,1))

<matplotlib.legend.Legend at 0x117dd1c88>

Now, I am checking out the relationship between the number of calls vs month.

byMonth = pd.groupby(df,by='Month').count()

byMonth['e'].plot.line(y='e')
plt.title('Calls per Month')
plt.ylabel('Number of Calls')

<matplotlib.text.Text at 0x1031d65c0>

I decided to use Seaborn here, fit the number of calls to a month and see if there's any concrete correlation between the two.

byMonth.reset_index(inplace=True)

sns.lmplot(x='Month',y='e',data=byMonth)
plt.ylabel('Number of Calls')

<matplotlib.text.Text at 0x109aa7fd0>

After the review, it seems that there are fewer emergency calls during the holiday seasons.

To see the behavior in more detail, I decided to extract the date from the timestamp

df['Date']=df['timeStamp'].apply(lambda x: x.date())

df.head(2)

	lat	lng	desc	zip	title	timeStamp	twp	addr	e	Reason	Hour	Month	Day of Week	Date
0	40.297876	-75.581294	REINDEER CT & DEAD END; NEW HANOVER; Station ...	19525.0	EMS: BACK PAINS/INJURY	2015-12-10 17:40:00	NEW HANOVER	REINDEER CT & DEAD END	1	EMS	17	12	Thu	2015-12-10
1	40.258061	-75.264680	BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP...	19446.0	EMS: DIABETIC EMERGENCY	2015-12-10 17:40:00	HATFIELD TOWNSHIP	BRIAR PATH & WHITEMARSH LN	1	EMS	17	12	Thu	2015-12-10

Here, I am grouping and plotting the data:

pd.groupby(df,'Date').count()['e'].plot.line(y='e')

plt.legend().remove()
plt.tight_layout()

I separately analyzed the data with the same plot for each reason.

pd.groupby(df[df['Reason']=='Traffic'],'Date').count().plot.line(y='e')
plt.title('Traffic')
plt.legend().remove()
plt.tight_layout()

pd.groupby(df[df['Reason']=='Fire'],'Date').count().plot.line(y='e')
plt.title('Fire')
plt.legend().remove()
plt.tight_layout()

pd.groupby(df[df['Reason']=='EMS'],'Date').count().plot.line(y='e')
plt.title('EMS')
plt.legend().remove()
plt.tight_layout()

For a better data visualization, I create a heatmap for the counts of calls on each hour, during a given day of the week.

day_hour = df.pivot_table(values='lat',index='Day of Week',columns='Hour',aggfunc='count')

day_hour

Hour	0	1	2	3	4	5	6	7	8	9	...	14	15	16	17	18	19	20	21	22	23
Day of Week
Fri	275	235	191	175	201	194	372	598	742	752	...	932	980	1039	980	820	696	667	559	514	474
Mon	282	221	201	194	204	267	397	653	819	786	...	869	913	989	997	885	746	613	497	472	325
Sat	375	301	263	260	224	231	257	391	459	640	...	789	796	848	757	778	696	628	572	506	467
Sun	383	306	286	268	242	240	300	402	483	620	...	684	691	663	714	670	655	537	461	415	330
Thu	278	202	233	159	182	203	362	570	777	828	...	876	969	935	1013	810	698	617	553	424	354
Tue	269	240	186	170	209	239	415	655	889	880	...	943	938	1026	1019	905	731	647	571	462	274
Wed	250	216	189	209	156	255	410	701	875	808	...	904	867	990	1037	894	686	668	575	490	335

7 rows × 24 columns

A HeatMap is created using the new DataFrame.

sns.heatmap(day_hour)

plt.tight_layout()

As a result, it is evident that most calls take place around the end of working hours dusing the week.

I decided to create a clustermap to pair up similar Hours and Days.

sns.clustermap(day_hour)

<seaborn.matrix.ClusterGrid at 0x11c49f320>

Conclusions

Most calls take place around the end of working hours dusing the week.
It seems that there are fewer emergency calls during the holiday seasons.
it seems to be better using Seaborn to fit the number of calls to a month and see if there's any concrete correlation between the two.
The number of traffic related 911 calls trends to be the lowest during the weekends
The Emergency Service related calls are also low during the weekend.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
prj_911_files		prj_911_files
911.csv		911.csv
README.md		README.md
prj_911.ipynb		prj_911.ipynb
prj_911.md		prj_911.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data analytical review of the 911 Call incindents in 2016: Pennsylvania, US.

Basic Analysis

Initial Data Wrangling for Feature Creation

Conclusions

About

Uh oh!

Releases

Packages

Languages

hicala/prj_911_kaggle

Folders and files

Latest commit

History

Repository files navigation

Data analytical review of the 911 Call incindents in 2016: Pennsylvania, US.

Basic Analysis

Initial Data Wrangling for Feature Creation

Conclusions

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages