0% found this document useful (0 votes)
20 views37 pages

Chapter4 3

This document provides an introduction to using pandas for data analysis, focusing on efficient iteration over DataFrames. It covers various methods for calculating statistics, such as win percentages and run differentials, and compares the performance of different iteration techniques including .iloc, .iterrows(), and .itertuples(). The document emphasizes the importance of vectorization and using NumPy for optimal performance in data manipulation.

Uploaded by

Ousmane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views37 pages

Chapter4 3

This document provides an introduction to using pandas for data analysis, focusing on efficient iteration over DataFrames. It covers various methods for calculating statistics, such as win percentages and run differentials, and compares the performance of different iteration techniques including .iloc, .iterrows(), and .itertuples(). The document emphasizes the importance of vectorization and using NumPy for optimal performance in data manipulation.

Uploaded by

Ousmane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Intro to pandas

DataFrame iteration
W RITIN G EF F ICIEN T P YTH ON CODE

Logan Thomas
Senior Data Scientist, Protection
Engineering Consultants
pandas recap
See pandas overview in Intermediate Python for Data Science

Library used for data analysis

Main data structure is the DataFrame


Tabular data with labeled rows and columns

Built on top of the NumPy array structure

Chapter Objective:
Best practice for iterating over a pandas DataFrame

WRITING EFFICIENT PYTHON CODE


Baseball stats
import pandas as pd

baseball_df = pd.read_csv('baseball_stats.csv')
print(baseball_df.head())

Team League Year RS RA W G Playoffs


0 ARI NL 2012 734 688 81 162 0
1 ATL NL 2012 700 600 94 162 1
2 BAL AL 2012 712 705 93 162 1
3 BOS AL 2012 734 806 69 162 0
4 CHC NL 2012 613 759 61 162 0

WRITING EFFICIENT PYTHON CODE


Baseball stats
Team
0 ARI
1 ATL
2 BAL
3 BOS
4 CHC

WRITING EFFICIENT PYTHON CODE


Baseball stats
Team League Year RS RA W G Playoffs
0 ARI NL 2012 734 688 81 162 0
1 ATL NL 2012 700 600 94 162 1
2 BAL AL 2012 712 705 93 162 1
3 BOS AL 2012 734 806 69 162 0
4 CHC NL 2012 613 759 61 162 0

WRITING EFFICIENT PYTHON CODE


Calculating win percentage
import numpy as np

def calc_win_perc(wins, games_played):

win_perc = wins / games_played

return np.round(win_perc,2)

win_perc = calc_win_perc(50, 100)


print(win_perc)

0.5

WRITING EFFICIENT PYTHON CODE


Adding win percentage to DataFrame
win_perc_list = []

for i in range(len(baseball_df)):
row = baseball_df.iloc[i]

wins = row['W']
games_played = row['G']

win_perc = calc_win_perc(wins, games_played)

win_perc_list.append(win_perc)

baseball_df['WP'] = win_perc_list

WRITING EFFICIENT PYTHON CODE


Adding win percentage to DataFrame
print(baseball_df.head())

Team League Year RS RA W G Playoffs WP


0 ARI NL 2012 734 688 81 162 0 0.50
1 ATL NL 2012 700 600 94 162 1 0.58
2 BAL AL 2012 712 705 93 162 1 0.57
3 BOS AL 2012 734 806 69 162 0 0.43
4 CHC NL 2012 613 759 61 162 0 0.38

WRITING EFFICIENT PYTHON CODE


Iterating with .iloc
%%timeit
win_perc_list = []

for i in range(len(baseball_df)):
row = baseball_df.iloc[i]

wins = row['W']
games_played = row['G']

win_perc = calc_win_perc(wins, games_played)


win_perc_list.append(win_perc)

baseball_df['WP'] = win_perc_list

183 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

WRITING EFFICIENT PYTHON CODE


Iterating with .iterrows()
win_perc_list = []

for i,row in baseball_df.iterrows():

wins = row['W']
games_played = row['G']

win_perc = calc_win_perc(wins, games_played)

win_perc_list.append(win_perc)

baseball_df['WP'] = win_perc_list

WRITING EFFICIENT PYTHON CODE


Iterating with .iterrows()
%%timeit
win_perc_list = []

for i,row in baseball_df.iterrows():

wins = row['W']
games_played = row['G']

win_perc = calc_win_perc(wins, games_played)


win_perc_list.append(win_perc)

baseball_df['WP'] = win_perc_list

95.3 ms ± 3.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

WRITING EFFICIENT PYTHON CODE


Practice DataFrame
iterating with
.iterrows()
W RITIN G EF F ICIEN T P YTH ON CODE
Another iterator
method: .itertuples()
W RITIN G EF F ICIEN T P YTH ON CODE

Logan Thomas
Senior Data Scientist, Protection
Engineering Consultants
Team wins data
print(team_wins_df)

Team Year W
0 ARI 2012 81
1 ATL 2012 94
2 BAL 2012 93
3 BOS 2012 69
4 CHC 2012 61
...

WRITING EFFICIENT PYTHON CODE


for row_tuple in team_wins_df.iterrows():
print(row_tuple)
print(type(row_tuple[1]))

(0, Team ARI


Year 2012
W 81
Name: 0, dtype: object)
<class 'pandas.core.series.Series'>

(1, Team ATL


Year 2012
W 94
Name: 1, dtype: object)
<class 'pandas.core.series.Series'>
...

WRITING EFFICIENT PYTHON CODE


Iterating with .itertuples()
for row_namedtuple in team_wins_df.itertuples():
print(row_namedtuple)

Pandas(Index=0, Team='ARI', Year=2012, W=81)


Pandas(Index=1, Team='ATL', Year=2012, W=94)
...

print(row_namedtuple.Index)

print(row_namedtuple.Team)

ATL

WRITING EFFICIENT PYTHON CODE


Comparing methods
%%timeit
for row_tuple in team_wins_df.iterrows():
print(row_tuple)

527 ms ± 41.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
for row_namedtuple in team_wins_df.itertuples():
print(row_namedtuple)

7.48 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

WRITING EFFICIENT PYTHON CODE


for row_tuple in team_wins_df.iterrows():
print(row_tuple[1]['Team'])

ARI
ATL
...

for row_namedtuple in team_wins_df.itertuples():


print(row_namedtuple['Team'])

TypeError: tuple indices must be integers or slices, not str

for row_namedtuple in team_wins_df.itertuples():


print(row_namedtuple.Team)

ARI
ATL
...

WRITING EFFICIENT PYTHON CODE


Let's keep iterating!
W RITIN G EF F ICIEN T P YTH ON CODE
pandas alternative to
looping
W RITIN G EF F ICIEN T P YTH ON CODE

Logan Thomas
Senior Data Scientist, Protection
Engineering Consultants
print(baseball_df.head())

Team League Year RS RA W G Playoffs


0 ARI NL 2012 734 688 81 162 0
1 ATL NL 2012 700 600 94 162 1
2 BAL AL 2012 712 705 93 162 1
3 BOS AL 2012 734 806 69 162 0
4 CHC NL 2012 613 759 61 162 0

def calc_run_diff(runs_scored, runs_allowed):

run_diff = runs_scored - runs_allowed

return run_diff

WRITING EFFICIENT PYTHON CODE


Run differentials with a loop
run_diffs_iterrows = []

for i,row in baseball_df.iterrows():


run_diff = calc_run_diff(row['RS'], row['RA'])
run_diffs_iterrows.append(run_diff)

baseball_df['RD'] = run_diffs_iterrows
print(baseball_df)

Team League Year RS RA W G Playoffs RD


0 ARI NL 2012 734 688 81 162 0 46
1 ATL NL 2012 700 600 94 162 1 100
2 BAL AL 2012 712 705 93 162 1 7
...

WRITING EFFICIENT PYTHON CODE


pandas .apply() method
Takes a function and applies it to a DataFrame
Must specify an axis to apply ( 0 for columns; 1 for rows)

Can be used with anonymous functions ( lambda functions)

Example:

baseball_df.apply(

lambda row: calc_run_diff(row['RS'], row['RA']),

axis=1
)

WRITING EFFICIENT PYTHON CODE


Run differentials with .apply()
run_diffs_apply = baseball_df.apply(
lambda row: calc_run_diff(row['RS'], row['RA']),
axis=1)

baseball_df['RD'] = run_diffs_apply
print(baseball_df)

Team League Year RS RA W G Playoffs RD


0 ARI NL 2012 734 688 81 162 0 46
1 ATL NL 2012 700 600 94 162 1 100
2 BAL AL 2012 712 705 93 162 1 7
...

WRITING EFFICIENT PYTHON CODE


Comparing approaches
%%timeit
run_diffs_iterrows = []

for i,row in baseball_df.iterrows():


run_diff = calc_run_diff(row['RS'], row['RA'])
run_diffs_iterrows.append(run_diff)

baseball_df['RD'] = run_diffs_iterrows

86.8 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

WRITING EFFICIENT PYTHON CODE


Comparing approaches
%%timeit
run_diffs_apply = baseball_df.apply(
lambda row: calc_run_diff(row['RS'], row['RA']),
axis=1)

baseball_df['RD'] = run_diffs_apply

30.1 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

WRITING EFFICIENT PYTHON CODE


Let's practice using
pandas .apply()
method!
W RITIN G EF F ICIEN T P YTH ON CODE
Optimal pandas
iterating
W RITIN G EF F ICIEN T P YTH ON CODE

Logan Thomas
Senior Data Scientist, Protection
Engineering Consultants
pandas internals
Eliminating loops applies to using pandas as well

pandas is built on NumPy


Take advantage of NumPy array ef ciencies

WRITING EFFICIENT PYTHON CODE


print(baseball_df)

Team League Year RS RA W G Playoffs


0 ARI NL 2012 734 688 81 162 0
1 ATL NL 2012 700 600 94 162 1
2 BAL AL 2012 712 705 93 162 1
...

wins_np = baseball_df['W'].values

print(type(wins_np))

<class 'numpy.ndarray'>

print(wins_np)

[ 81 94 93 ...]

WRITING EFFICIENT PYTHON CODE


Power of vectorization
Broadcasting (vectorizing) is extremely ef cient!

baseball_df['RS'].values - baseball_df['RA'].values

array([ 46, 100, 7, ..., 188, 110, -117])

WRITING EFFICIENT PYTHON CODE


Run differentials with arrays
run_diffs_np = baseball_df['RS'].values - baseball_df['RA'].values

baseball_df['RD'] = run_diffs_np
print(baseball_df)

Team League Year RS RA W G Playoffs RD


0 ARI NL 2012 734 688 81 162 0 46
1 ATL NL 2012 700 600 94 162 1 100
2 BAL AL 2012 712 705 93 162 1 7
3 BOS AL 2012 734 806 69 162 0 -72
4 CHC NL 2012 613 759 61 162 0 -146
...

WRITING EFFICIENT PYTHON CODE


Comparing approaches
%%timeit
run_diffs_np = baseball_df['RS'].values - baseball_df['RA'].values

baseball_df['RD'] = run_diffs_np

124 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

WRITING EFFICIENT PYTHON CODE


Let's put our skills
into practice!
W RITIN G EF F ICIEN T P YTH ON CODE
Congratulations!
W RITIN G EF F ICIEN T P YTH ON CODE

Logan Thomas
Senior Data Scientist, Protection
Engineering Consultants
What you have learned
The de nition of ef cient and Pythonic code

How to use Python's powerful built-in library

The advantages of NumPy arrays

Some handy magic commands to pro le code

How to deploy ef cient solutions with zip() , itertools , collections , and set theory

The cost of looping and how to eliminate loops

Best practices for iterating with pandas DataFrames

WRITING EFFICIENT PYTHON CODE


Well done!
W RITIN G EF F ICIEN T P YTH ON CODE

You might also like