In this research I am analyzing the 911 call dataset.
Tools: Python, Numpy, Seaborn, Matplotlib, Pyplot
Data Source: Kaggle.
The data contains the following fields( all are declared as a String variable):
- lat : Latitude
- lng: Longitude
- desc: Description of the Emergency
- zip: Zipcode
- title: Title
- timeStamp: YYYY-MM-DD HH:MM:SS
- twp: Township
- addr: Address
- e: Dummy variable (always 1)
For the data analysis and visualisation we import all the dependencies.
import numpy as np
import pandas as pdimport matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (6, 4)Reading the data.
df = pd.read_csv('data/911.csv')df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99492 entries, 0 to 99491
Data columns (total 9 columns):
lat 99492 non-null float64
lng 99492 non-null float64
desc 99492 non-null object
zip 86637 non-null float64
title 99492 non-null object
timeStamp 99492 non-null object
twp 99449 non-null object
addr 98973 non-null object
e 99492 non-null int64
dtypes: float64(3), int64(1), object(5)
memory usage: 6.8+ MB
Checking the head of the dataframe
df.head()| lat | lng | desc | zip | title | timeStamp | twp | addr | e | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 40.297876 | -75.581294 | REINDEER CT & DEAD END; NEW HANOVER; Station ... | 19525.0 | EMS: BACK PAINS/INJURY | 2015-12-10 17:40:00 | NEW HANOVER | REINDEER CT & DEAD END | 1 |
| 1 | 40.258061 | -75.264680 | BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP... | 19446.0 | EMS: DIABETIC EMERGENCY | 2015-12-10 17:40:00 | HATFIELD TOWNSHIP | BRIAR PATH & WHITEMARSH LN | 1 |
| 2 | 40.121182 | -75.351975 | HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St... | 19401.0 | Fire: GAS-ODOR/LEAK | 2015-12-10 17:40:00 | NORRISTOWN | HAWS AVE | 1 |
| 3 | 40.116153 | -75.343513 | AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;... | 19401.0 | EMS: CARDIAC EMERGENCY | 2015-12-10 17:40:01 | NORRISTOWN | AIRY ST & SWEDE ST | 1 |
| 4 | 40.251492 | -75.603350 | CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S... | NaN | EMS: DIZZINESS | 2015-12-10 17:40:01 | LOWER POTTSGROVE | CHERRYWOOD CT & DEAD END | 1 |
Let's check out the top 5 zipcodes for calls.
df['zip'].value_counts().head(5)19401.0 6979
19464.0 6643
19403.0 4854
19446.0 4748
19406.0 3174
Name: zip, dtype: int64
The top townships for the calls were as follows:
df['twp'].value_counts().head(5)LOWER MERION 8443
ABINGTON 5977
NORRISTOWN 5890
UPPER MERION 5227
CHELTENHAM 4575
Name: twp, dtype: int64
For 90k + entries, how many unique call titles did we have?
df['title'].nunique()110
I extract here some features from the columns in the already in-hand dataset for further analysis.
There is a 'reason for call' alloted to each entry in the title column which is denoted by the text before the colon.
From this assumption the timestamp column we further segregate into Year, Month and Day of Week too.
Over here I create a feature 'Reason' for each call.
df['Reason'] = df['title'].apply(lambda x: x.split(':')[0])df.tail()| lat | lng | desc | zip | title | timeStamp | twp | addr | e | Reason | |
|---|---|---|---|---|---|---|---|---|---|---|
| 99487 | 40.132869 | -75.333515 | MARKLEY ST & W LOGAN ST; NORRISTOWN; 2016-08-2... | 19401.0 | Traffic: VEHICLE ACCIDENT - | 2016-08-24 11:06:00 | NORRISTOWN | MARKLEY ST & W LOGAN ST | 1 | Traffic |
| 99488 | 40.006974 | -75.289080 | LANCASTER AVE & RITTENHOUSE PL; LOWER MERION; ... | 19003.0 | Traffic: VEHICLE ACCIDENT - | 2016-08-24 11:07:02 | LOWER MERION | LANCASTER AVE & RITTENHOUSE PL | 1 | Traffic |
| 99489 | 40.115429 | -75.334679 | CHESTNUT ST & WALNUT ST; NORRISTOWN; Station ... | 19401.0 | EMS: FALL VICTIM | 2016-08-24 11:12:00 | NORRISTOWN | CHESTNUT ST & WALNUT ST | 1 | EMS |
| 99490 | 40.186431 | -75.192555 | WELSH RD & WEBSTER LN; HORSHAM; Station 352; ... | 19002.0 | EMS: NAUSEA/VOMITING | 2016-08-24 11:17:01 | HORSHAM | WELSH RD & WEBSTER LN | 1 | EMS |
| 99491 | 40.207055 | -75.317952 | MORRIS RD & S BROAD ST; UPPER GWYNEDD; 2016-08... | 19446.0 | Traffic: VEHICLE ACCIDENT - | 2016-08-24 11:17:02 | UPPER GWYNEDD | MORRIS RD & S BROAD ST | 1 | Traffic |
Once here, all my effor is addressed to finding out the most common reason for 911 calls, using the dataset as main osurce of information.
df['Reason'].value_counts()EMS 48877
Traffic 35695
Fire 14920
Name: Reason, dtype: int64
sns.countplot(df['Reason'])<matplotlib.axes._subplots.AxesSubplot at 0x1165ad710>
Next challenge, organizing the time information by checking the datatype of the timestamp column.
type(df['timeStamp'][0])str
As the timestamps are still string types, I convert it to a python DateTime object, so I can extract the year, month, and day information.
df['timeStamp'] = pd.to_datetime(df['timeStamp'])I extract information for a single DateTime object, as follows.
time = df['timeStamp'].iloc[0]
print('Hour:',time.hour)
print('Month:',time.month)
print('Day of Week:',time.dayofweek)Hour: 17
Month: 12
Day of Week: 3
Over here I create new features for the above pieces of information.
df['Hour'] = df['timeStamp'].apply(lambda x: x.hour)
df['Month'] = df['timeStamp'].apply(lambda x: x.month)
df['Day of Week'] = df['timeStamp'].apply(lambda x: x.dayofweek)df.head(3)| lat | lng | desc | zip | title | timeStamp | twp | addr | e | Reason | Hour | Month | Day of Week | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40.297876 | -75.581294 | REINDEER CT & DEAD END; NEW HANOVER; Station ... | 19525.0 | EMS: BACK PAINS/INJURY | 2015-12-10 17:40:00 | NEW HANOVER | REINDEER CT & DEAD END | 1 | EMS | 17 | 12 | 3 |
| 1 | 40.258061 | -75.264680 | BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP... | 19446.0 | EMS: DIABETIC EMERGENCY | 2015-12-10 17:40:00 | HATFIELD TOWNSHIP | BRIAR PATH & WHITEMARSH LN | 1 | EMS | 17 | 12 | 3 |
| 2 | 40.121182 | -75.351975 | HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St... | 19401.0 | Fire: GAS-ODOR/LEAK | 2015-12-10 17:40:00 | NORRISTOWN | HAWS AVE | 1 | Fire | 17 | 12 | 3 |
The Day of the Week is an integer and it might not be instantly clear which number refers to which Day. We can map that information to a Mon-Sun string.
dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}df['Day of Week'] = df['Day of Week'].map(dmap)
df.tail(3)| lat | lng | desc | zip | title | timeStamp | twp | addr | e | Reason | Hour | Month | Day of Week | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 99489 | 40.115429 | -75.334679 | CHESTNUT ST & WALNUT ST; NORRISTOWN; Station ... | 19401.0 | EMS: FALL VICTIM | 2016-08-24 11:12:00 | NORRISTOWN | CHESTNUT ST & WALNUT ST | 1 | EMS | 11 | 8 | Wed |
| 99490 | 40.186431 | -75.192555 | WELSH RD & WEBSTER LN; HORSHAM; Station 352; ... | 19002.0 | EMS: NAUSEA/VOMITING | 2016-08-24 11:17:01 | HORSHAM | WELSH RD & WEBSTER LN | 1 | EMS | 11 | 8 | Wed |
| 99491 | 40.207055 | -75.317952 | MORRIS RD & S BROAD ST; UPPER GWYNEDD; 2016-08... | 19446.0 | Traffic: VEHICLE ACCIDENT - | 2016-08-24 11:17:02 | UPPER GWYNEDD | MORRIS RD & S BROAD ST | 1 | Traffic | 11 | 8 | Wed |
I, now, combine the newly created features, to check out the most common call reasons based on the day of the week.
sns.countplot(df['Day of Week'],hue=df['Reason'])
plt.legend(bbox_to_anchor=(1.25,1))<matplotlib.legend.Legend at 0x116d9cbe0>
The first take away from the analysis is that the number of traffic related 911 calls trends to be the lowest during the weekends, and also the Emergency Service related calls are also low during the weekend.
sns.countplot(df['Month'],hue=df['Reason'])
plt.legend(bbox_to_anchor=(1.25,1))<matplotlib.legend.Legend at 0x117dd1c88>
Now, I am checking out the relationship between the number of calls vs month.
byMonth = pd.groupby(df,by='Month').count()byMonth['e'].plot.line(y='e')
plt.title('Calls per Month')
plt.ylabel('Number of Calls')<matplotlib.text.Text at 0x1031d65c0>
I decided to use Seaborn here, fit the number of calls to a month and see if there's any concrete correlation between the two.
byMonth.reset_index(inplace=True)sns.lmplot(x='Month',y='e',data=byMonth)
plt.ylabel('Number of Calls')<matplotlib.text.Text at 0x109aa7fd0>
After the review, it seems that there are fewer emergency calls during the holiday seasons.
To see the behavior in more detail, I decided to extract the date from the timestamp
df['Date']=df['timeStamp'].apply(lambda x: x.date())df.head(2)| lat | lng | desc | zip | title | timeStamp | twp | addr | e | Reason | Hour | Month | Day of Week | Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40.297876 | -75.581294 | REINDEER CT & DEAD END; NEW HANOVER; Station ... | 19525.0 | EMS: BACK PAINS/INJURY | 2015-12-10 17:40:00 | NEW HANOVER | REINDEER CT & DEAD END | 1 | EMS | 17 | 12 | Thu | 2015-12-10 |
| 1 | 40.258061 | -75.264680 | BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP... | 19446.0 | EMS: DIABETIC EMERGENCY | 2015-12-10 17:40:00 | HATFIELD TOWNSHIP | BRIAR PATH & WHITEMARSH LN | 1 | EMS | 17 | 12 | Thu | 2015-12-10 |
Here, I am grouping and plotting the data:
pd.groupby(df,'Date').count()['e'].plot.line(y='e')
plt.legend().remove()
plt.tight_layout()I separately analyzed the data with the same plot for each reason.
pd.groupby(df[df['Reason']=='Traffic'],'Date').count().plot.line(y='e')
plt.title('Traffic')
plt.legend().remove()
plt.tight_layout()pd.groupby(df[df['Reason']=='Fire'],'Date').count().plot.line(y='e')
plt.title('Fire')
plt.legend().remove()
plt.tight_layout()pd.groupby(df[df['Reason']=='EMS'],'Date').count().plot.line(y='e')
plt.title('EMS')
plt.legend().remove()
plt.tight_layout()For a better data visualization, I create a heatmap for the counts of calls on each hour, during a given day of the week.
day_hour = df.pivot_table(values='lat',index='Day of Week',columns='Hour',aggfunc='count')
day_hour| Hour | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Day of Week | |||||||||||||||||||||
| Fri | 275 | 235 | 191 | 175 | 201 | 194 | 372 | 598 | 742 | 752 | ... | 932 | 980 | 1039 | 980 | 820 | 696 | 667 | 559 | 514 | 474 |
| Mon | 282 | 221 | 201 | 194 | 204 | 267 | 397 | 653 | 819 | 786 | ... | 869 | 913 | 989 | 997 | 885 | 746 | 613 | 497 | 472 | 325 |
| Sat | 375 | 301 | 263 | 260 | 224 | 231 | 257 | 391 | 459 | 640 | ... | 789 | 796 | 848 | 757 | 778 | 696 | 628 | 572 | 506 | 467 |
| Sun | 383 | 306 | 286 | 268 | 242 | 240 | 300 | 402 | 483 | 620 | ... | 684 | 691 | 663 | 714 | 670 | 655 | 537 | 461 | 415 | 330 |
| Thu | 278 | 202 | 233 | 159 | 182 | 203 | 362 | 570 | 777 | 828 | ... | 876 | 969 | 935 | 1013 | 810 | 698 | 617 | 553 | 424 | 354 |
| Tue | 269 | 240 | 186 | 170 | 209 | 239 | 415 | 655 | 889 | 880 | ... | 943 | 938 | 1026 | 1019 | 905 | 731 | 647 | 571 | 462 | 274 |
| Wed | 250 | 216 | 189 | 209 | 156 | 255 | 410 | 701 | 875 | 808 | ... | 904 | 867 | 990 | 1037 | 894 | 686 | 668 | 575 | 490 | 335 |
7 rows × 24 columns
A HeatMap is created using the new DataFrame.
sns.heatmap(day_hour)
plt.tight_layout()As a result, it is evident that most calls take place around the end of working hours dusing the week.
I decided to create a clustermap to pair up similar Hours and Days.
sns.clustermap(day_hour)<seaborn.matrix.ClusterGrid at 0x11c49f320>
- Most calls take place around the end of working hours dusing the week.
- It seems that there are fewer emergency calls during the holiday seasons.
- it seems to be better using Seaborn to fit the number of calls to a month and see if there's any concrete correlation between the two.
- The number of traffic related 911 calls trends to be the lowest during the weekends
- The Emergency Service related calls are also low during the weekend.