Skip to content

Data analytical review of the 911 Call incindents in 2016

Notifications You must be signed in to change notification settings

hicala/prj_911_kaggle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data analytical review of the 911 Call incindents in 2016: Pennsylvania, US.

In this research I am analyzing the 911 call dataset.

Tools: Python, Numpy, Seaborn, Matplotlib, Pyplot

Data Source: Kaggle.

The data contains the following fields( all are declared as a String variable):

  • lat : Latitude
  • lng: Longitude
  • desc: Description of the Emergency
  • zip: Zipcode
  • title: Title
  • timeStamp: YYYY-MM-DD HH:MM:SS
  • twp: Township
  • addr: Address
  • e: Dummy variable (always 1)

For the data analysis and visualisation we import all the dependencies.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('whitegrid')

plt.rcParams['figure.figsize'] = (6, 4)

Reading the data.

df = pd.read_csv('data/911.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99492 entries, 0 to 99491
Data columns (total 9 columns):
lat          99492 non-null float64
lng          99492 non-null float64
desc         99492 non-null object
zip          86637 non-null float64
title        99492 non-null object
timeStamp    99492 non-null object
twp          99449 non-null object
addr         98973 non-null object
e            99492 non-null int64
dtypes: float64(3), int64(1), object(5)
memory usage: 6.8+ MB

Checking the head of the dataframe

df.head()
lat lng desc zip title timeStamp twp addr e
0 40.297876 -75.581294 REINDEER CT & DEAD END; NEW HANOVER; Station ... 19525.0 EMS: BACK PAINS/INJURY 2015-12-10 17:40:00 NEW HANOVER REINDEER CT & DEAD END 1
1 40.258061 -75.264680 BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP... 19446.0 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00 HATFIELD TOWNSHIP BRIAR PATH & WHITEMARSH LN 1
2 40.121182 -75.351975 HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St... 19401.0 Fire: GAS-ODOR/LEAK 2015-12-10 17:40:00 NORRISTOWN HAWS AVE 1
3 40.116153 -75.343513 AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;... 19401.0 EMS: CARDIAC EMERGENCY 2015-12-10 17:40:01 NORRISTOWN AIRY ST & SWEDE ST 1
4 40.251492 -75.603350 CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S... NaN EMS: DIZZINESS 2015-12-10 17:40:01 LOWER POTTSGROVE CHERRYWOOD CT & DEAD END 1

Basic Analysis

Let's check out the top 5 zipcodes for calls.

df['zip'].value_counts().head(5)
19401.0    6979
19464.0    6643
19403.0    4854
19446.0    4748
19406.0    3174
Name: zip, dtype: int64

The top townships for the calls were as follows:

df['twp'].value_counts().head(5)
LOWER MERION    8443
ABINGTON        5977
NORRISTOWN      5890
UPPER MERION    5227
CHELTENHAM      4575
Name: twp, dtype: int64

For 90k + entries, how many unique call titles did we have?

df['title'].nunique()
110

Initial Data Wrangling for Feature Creation

I extract here some features from the columns in the already in-hand dataset for further analysis.

There is a 'reason for call' alloted to each entry in the title column which is denoted by the text before the colon.

From this assumption the timestamp column we further segregate into Year, Month and Day of Week too.

Over here I create a feature 'Reason' for each call.

df['Reason'] = df['title'].apply(lambda x: x.split(':')[0])
df.tail()
lat lng desc zip title timeStamp twp addr e Reason
99487 40.132869 -75.333515 MARKLEY ST & W LOGAN ST; NORRISTOWN; 2016-08-2... 19401.0 Traffic: VEHICLE ACCIDENT - 2016-08-24 11:06:00 NORRISTOWN MARKLEY ST & W LOGAN ST 1 Traffic
99488 40.006974 -75.289080 LANCASTER AVE & RITTENHOUSE PL; LOWER MERION; ... 19003.0 Traffic: VEHICLE ACCIDENT - 2016-08-24 11:07:02 LOWER MERION LANCASTER AVE & RITTENHOUSE PL 1 Traffic
99489 40.115429 -75.334679 CHESTNUT ST & WALNUT ST; NORRISTOWN; Station ... 19401.0 EMS: FALL VICTIM 2016-08-24 11:12:00 NORRISTOWN CHESTNUT ST & WALNUT ST 1 EMS
99490 40.186431 -75.192555 WELSH RD & WEBSTER LN; HORSHAM; Station 352; ... 19002.0 EMS: NAUSEA/VOMITING 2016-08-24 11:17:01 HORSHAM WELSH RD & WEBSTER LN 1 EMS
99491 40.207055 -75.317952 MORRIS RD & S BROAD ST; UPPER GWYNEDD; 2016-08... 19446.0 Traffic: VEHICLE ACCIDENT - 2016-08-24 11:17:02 UPPER GWYNEDD MORRIS RD & S BROAD ST 1 Traffic

Once here, all my effor is addressed to finding out the most common reason for 911 calls, using the dataset as main osurce of information.

df['Reason'].value_counts()
EMS        48877
Traffic    35695
Fire       14920
Name: Reason, dtype: int64
sns.countplot(df['Reason'])
<matplotlib.axes._subplots.AxesSubplot at 0x1165ad710>

png

Next challenge, organizing the time information by checking the datatype of the timestamp column.

type(df['timeStamp'][0])
str

As the timestamps are still string types, I convert it to a python DateTime object, so I can extract the year, month, and day information.

df['timeStamp'] = pd.to_datetime(df['timeStamp'])

I extract information for a single DateTime object, as follows.

time = df['timeStamp'].iloc[0]

print('Hour:',time.hour)
print('Month:',time.month)
print('Day of Week:',time.dayofweek)
Hour: 17
Month: 12
Day of Week: 3

Over here I create new features for the above pieces of information.

df['Hour'] = df['timeStamp'].apply(lambda x: x.hour)
df['Month'] = df['timeStamp'].apply(lambda x: x.month)
df['Day of Week'] = df['timeStamp'].apply(lambda x: x.dayofweek)
df.head(3)
lat lng desc zip title timeStamp twp addr e Reason Hour Month Day of Week
0 40.297876 -75.581294 REINDEER CT & DEAD END; NEW HANOVER; Station ... 19525.0 EMS: BACK PAINS/INJURY 2015-12-10 17:40:00 NEW HANOVER REINDEER CT & DEAD END 1 EMS 17 12 3
1 40.258061 -75.264680 BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP... 19446.0 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00 HATFIELD TOWNSHIP BRIAR PATH & WHITEMARSH LN 1 EMS 17 12 3
2 40.121182 -75.351975 HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St... 19401.0 Fire: GAS-ODOR/LEAK 2015-12-10 17:40:00 NORRISTOWN HAWS AVE 1 Fire 17 12 3

The Day of the Week is an integer and it might not be instantly clear which number refers to which Day. We can map that information to a Mon-Sun string.

dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
df['Day of Week'] = df['Day of Week'].map(dmap)

df.tail(3)
lat lng desc zip title timeStamp twp addr e Reason Hour Month Day of Week
99489 40.115429 -75.334679 CHESTNUT ST & WALNUT ST; NORRISTOWN; Station ... 19401.0 EMS: FALL VICTIM 2016-08-24 11:12:00 NORRISTOWN CHESTNUT ST & WALNUT ST 1 EMS 11 8 Wed
99490 40.186431 -75.192555 WELSH RD & WEBSTER LN; HORSHAM; Station 352; ... 19002.0 EMS: NAUSEA/VOMITING 2016-08-24 11:17:01 HORSHAM WELSH RD & WEBSTER LN 1 EMS 11 8 Wed
99491 40.207055 -75.317952 MORRIS RD & S BROAD ST; UPPER GWYNEDD; 2016-08... 19446.0 Traffic: VEHICLE ACCIDENT - 2016-08-24 11:17:02 UPPER GWYNEDD MORRIS RD & S BROAD ST 1 Traffic 11 8 Wed

I, now, combine the newly created features, to check out the most common call reasons based on the day of the week.

sns.countplot(df['Day of Week'],hue=df['Reason'])

plt.legend(bbox_to_anchor=(1.25,1))
<matplotlib.legend.Legend at 0x116d9cbe0>

png

The first take away from the analysis is that the number of traffic related 911 calls trends to be the lowest during the weekends, and also the Emergency Service related calls are also low during the weekend.

sns.countplot(df['Month'],hue=df['Reason'])

plt.legend(bbox_to_anchor=(1.25,1))
<matplotlib.legend.Legend at 0x117dd1c88>

png

Now, I am checking out the relationship between the number of calls vs month.

byMonth = pd.groupby(df,by='Month').count()
byMonth['e'].plot.line(y='e')
plt.title('Calls per Month')
plt.ylabel('Number of Calls')
<matplotlib.text.Text at 0x1031d65c0>

png

I decided to use Seaborn here, fit the number of calls to a month and see if there's any concrete correlation between the two.

byMonth.reset_index(inplace=True)
sns.lmplot(x='Month',y='e',data=byMonth)
plt.ylabel('Number of Calls')
<matplotlib.text.Text at 0x109aa7fd0>

png

After the review, it seems that there are fewer emergency calls during the holiday seasons.

To see the behavior in more detail, I decided to extract the date from the timestamp

df['Date']=df['timeStamp'].apply(lambda x: x.date())
df.head(2)
lat lng desc zip title timeStamp twp addr e Reason Hour Month Day of Week Date
0 40.297876 -75.581294 REINDEER CT & DEAD END; NEW HANOVER; Station ... 19525.0 EMS: BACK PAINS/INJURY 2015-12-10 17:40:00 NEW HANOVER REINDEER CT & DEAD END 1 EMS 17 12 Thu 2015-12-10
1 40.258061 -75.264680 BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP... 19446.0 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00 HATFIELD TOWNSHIP BRIAR PATH & WHITEMARSH LN 1 EMS 17 12 Thu 2015-12-10

Here, I am grouping and plotting the data:

pd.groupby(df,'Date').count()['e'].plot.line(y='e')

plt.legend().remove()
plt.tight_layout()

png

I separately analyzed the data with the same plot for each reason.

pd.groupby(df[df['Reason']=='Traffic'],'Date').count().plot.line(y='e')
plt.title('Traffic')
plt.legend().remove()
plt.tight_layout()

png

pd.groupby(df[df['Reason']=='Fire'],'Date').count().plot.line(y='e')
plt.title('Fire')
plt.legend().remove()
plt.tight_layout()

png

pd.groupby(df[df['Reason']=='EMS'],'Date').count().plot.line(y='e')
plt.title('EMS')
plt.legend().remove()
plt.tight_layout()

png

For a better data visualization, I create a heatmap for the counts of calls on each hour, during a given day of the week.

day_hour = df.pivot_table(values='lat',index='Day of Week',columns='Hour',aggfunc='count')

day_hour
Hour 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
Day of Week
Fri 275 235 191 175 201 194 372 598 742 752 ... 932 980 1039 980 820 696 667 559 514 474
Mon 282 221 201 194 204 267 397 653 819 786 ... 869 913 989 997 885 746 613 497 472 325
Sat 375 301 263 260 224 231 257 391 459 640 ... 789 796 848 757 778 696 628 572 506 467
Sun 383 306 286 268 242 240 300 402 483 620 ... 684 691 663 714 670 655 537 461 415 330
Thu 278 202 233 159 182 203 362 570 777 828 ... 876 969 935 1013 810 698 617 553 424 354
Tue 269 240 186 170 209 239 415 655 889 880 ... 943 938 1026 1019 905 731 647 571 462 274
Wed 250 216 189 209 156 255 410 701 875 808 ... 904 867 990 1037 894 686 668 575 490 335

7 rows × 24 columns

A HeatMap is created using the new DataFrame.

sns.heatmap(day_hour)

plt.tight_layout()

png

As a result, it is evident that most calls take place around the end of working hours dusing the week.

I decided to create a clustermap to pair up similar Hours and Days.

sns.clustermap(day_hour)
<seaborn.matrix.ClusterGrid at 0x11c49f320>

png

Conclusions

  1. Most calls take place around the end of working hours dusing the week.
  2. It seems that there are fewer emergency calls during the holiday seasons.
  3. it seems to be better using Seaborn to fit the number of calls to a month and see if there's any concrete correlation between the two.
  4. The number of traffic related 911 calls trends to be the lowest during the weekends
  5. The Emergency Service related calls are also low during the weekend.

About

Data analytical review of the 911 Call incindents in 2016

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published