1.
Demonstrate how to construct a multiply indexed Pandas Series using a
straightforward dictionary-based method and explicit MultiIndex constructors such as
from arrays, from tuples, from product, and from levels and codes. Include
code examples for each method and explain. —- 10M
The most straightforward way to construct a multiply indexed Series or DataFrame is to simply
pass a list of two or more index arrays to the constructor.
For example:
if you pass a dictionary with appropriate tuples as keys, Pandas will automatically
recognize this and use a MultiIndex by default:
In [13]: data = {('California', 2010): 37253956,
('California', 2020): 39538223,
('New York', 2010): 19378102,
('New York', 2020): 20201249,
('Texas', 2010): 25145561,
('Texas', 2020): 29145505}
pd.Series(data)
Out[13]: California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
Texas 2010 25145561
2020 29145505
dtype: int64
Nevertheless, it is sometimes useful to explicitly create a MultiIndex; we’ll look at a
couple of methods for doing this next.
Explicit MultiIndex Constructors
For more flexibility in how the index is constructed, you can instead use the constructor
methods available in the pd.MultiIndex class. For example, as we did before,
you can construct a MultiIndex from a simple list of arrays giving the index values
within each level:
In [14]: pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])
Out[14]: MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
Or you can construct it from a list of tuples giving the multiple index values of each
point:
In [15]: pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])
Out[15]: MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
You can even construct it from a Cartesian product of single indices:
In [16]: pd.MultiIndex.from_product([['a', 'b'], [1, 2]])
Out[16]: MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
Similarly, you can construct a MultiIndex directly using its internal encoding by
passing levels (a list of lists containing available index values for each level) and
codes (a list of lists that reference these labels):
In [17]: pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
codes=[[0, 0, 1, 1], [0, 1, 0, 1]])
Out[17]: MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
Any of these objects can be passed as the index argument when creating a Series or
DataFrame, or be passed to the reindex method of an existing Series or DataFrame.
2. Develop a multiply indexed Pandas DataFrame with hierarchical row and column
indices to represent mock medical data. Make use of MultiIndex.from_product for
both axes, and utilize Pandas indexing tools like loc, iloc, and IndexSlice to
experiment with retrieving specific slices of the data. —10M
In a DataFrame, the rows and columns are completely symmetric, and just as the rows
can have multiple levels of indices, the columns can have multiple levels as well. Consider
the following, which is a mock-up of some (somewhat realistic) medical data:
In [19]: # hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'],
['HR', 'Temp']],
names=['subject', 'type'])
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data
Out[19]:
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 30.0 38.0 56.0 38.3 45.0 35.8
2 47.0 37.1 27.0 36.0 37.0 36.4
2014 1 51.0 35.9 24.0 36.7 32.0 36.2
2 49.0 36.3 48.0 39.2 31.0 35.7
A multiply indexed DataFrame behaves in a similar manner. Consider our toy medical
DataFrame from before:
In [28]: health_data
Out[28]: subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 30.0 38.0 56.0 38.3 45.0 35.8
2 47.0 37.1 27.0 36.0 37.0 36.4
2014 1 51.0 35.9 24.0 36.7 32.0 36.2
2 49.0 36.3 48.0 39.2 31.0 35.7
Remember that columns are primary in a DataFrame, and the syntax used for multiply
indexed Series applies to the columns. For example, we can recover Guido’s heart
rate data with a simple operation:
In [29]: health_data['Guido', 'HR']
Out[29]: year visit
2013 1 56.0
2 27.0
2014 1 24.0
2 48.0
Name: (Guido, HR), dtype: float64
Also, as with the single-index case, we can use the loc, iloc, and ix indexers introduced
in Chapter 14. For example:
In [30]: health_data.iloc[:2, :2]
Out[30]: subject Bob
type HR Temp
year visit
2013 1 30.0 38.0
2 47.0 37.1
140 | Chapter 17: Hierarchical Indexing
These indexers provide an array-like view of the underlying two-dimensional data,
but each individual index in loc or iloc can be passed a tuple of multiple indices. For
example:
In [31]: health_data.loc[:, ('Bob', 'HR')]
Out[31]: year visit
2013 1 30.0
2 47.0
2014 1 51.0
2 49.0
Name: (Bob, HR), dtype: float64
Working with slices within these index tuples is not especially convenient; trying to
create a slice within a tuple will lead to a syntax error:
In [32]: health_data.loc[(:, 1), (:, 'HR')]
SyntaxError: invalid syntax (3311942670.py, line 1)
You could get around this by building the desired slice explicitly using Python’s builtin
slice function, but a better way in this context is to use an IndexSlice object,
which Pandas provides for precisely this situation. For example:
In [33]: idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]
Out[33]: subject Bob Guido Sue
type HR HR HR
year visit
2013 1 30.0 56.0 45.0
2014 1 51.0 24.0 32.0
3. Build a hierarchical Pandas Series using state and year tuples as a multi-level index.
Utilize unstack(), stack(), reset_index(), and set_index() methods to experiment
with reshaping the data between Series and DataFrame formats. –10m
In [2]: index = [('California', 2010), ('California', 2020),
('New York', 2010), ('New York', 2020),
('Texas', 2010), ('Texas', 2020)]
populations = [37253956, 39538223,
19378102, 20201249,
25145561, 29145505]
In [5]: index = pd.MultiIndex.from_tuples(index)
In [6]: pop = pop.reindex(index)
pop
Out[6]: California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
Texas 2010 25145561
2020 29145505
dtype: int64
it is possible to convert a dataset from a stacked multi-index
to a simple two-dimensional representation, optionally specifying the level to use:
In [38]: pop.unstack(level=1)
Out[38]:
year 2010 2020
state
California 37253956 39538223
New York 19378102 20201249
Texas 25145561 29145505
The opposite of unstack is stack, which here can be used to recover the original
series:
In [40]: pop.unstack().stack()
Out[40]:
state year
California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
Texas 2010 25145561
2020 29145505
dtype: int64
Another way to rearrange hierarchical data is to turn the index labels into columns;
this can be accomplished with the reset_index method. Calling this on the population
dictionary will result in a DataFrame with state and year columns holding the
information that was formerly in the index. For clarity, we can optionally specify the
name of the data for the column representation:
In [41]: pop_flat = pop.reset_index(name='population')
pop_flat
Out[41]: state year population
0 California 2010 37253956
1 California 2020 39538223
2 New York 2010 19378102
3 New York 2020 20201249
4 Texas 2010 25145561
5 Texas 2020 29145505
A common pattern is to build a MultiIndex from the column values. This can be
done with the set_index method of the DataFrame, which returns a multiply indexed
DataFrame:
In [42]: pop_flat.set_index(['state', 'year'])
Out[42]: population
state year
California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
Texas 2010 25145561
2020 29145505
In practice, this type of reindexing is one of the more useful patterns when exploring
real-world datasets.
4. Develop a Pandas-based analysis using the Titanic dataset from the Seaborn library.
Construct group-based summaries to explore survival rates by sex and passenger class
using groupby and pivot_table. Make use of aggregation and reshaping methods to
simplify complex groupings, and identify which method offers clearer insight or
readability when analyzing multidimensional patterns in the data. –10m
For the examples in this section, we’ll use the database of passengers on the Titanic,
available through the Seaborn library :
In [1]: import numpy as np
import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')
In [2]: titanic.head()
Out[2]: survived pclass sex age sibsp parch fare embarked class \
0 0 3 male 22.0 1 0 7.2500 S Third
1 1 1 female 38.0 1 0 71.2833 C First
2 1 3 female 26.0 0 0 7.9250 S Third
3 1 1 female 35.0 1 0 53.1000 S First
4 0 3 male 35.0 0 0 8.0500 S Third
176
who adult_male deck embark_town alive alone
0 man True NaN Southampton no False
1 woman False C Cherbourg yes False
2 woman False NaN Southampton yes True
3 woman False C Southampton yes False
4 man True NaN Southampton no True
As the output shows, this contains a number of data points on each passenger on that
ill-fated voyage, including sex, age, class, fare paid, and much more.
Pivot Tables by Hand
To start learning more about this data, we might begin by grouping according to sex,
survival status, or some combination thereof. If you read the previous chapter, you
might be tempted to apply a groupby operation—for example, let’s look at survival
rate by sex:
In [3]: titanic.groupby('sex')[['survived']].mean()
Out[3]: survived
sex
female 0.742038
male 0.188908
This gives us some initial insight: overall, three of every four females on board survived,
while only one in five males survived!
This is useful, but we might like to go one step deeper and look at survival rates by
both sex and, say, class. Using the vocabulary of groupby, we might proceed using a
process like this: we first group by class and sex, then select survival, apply a mean
aggregate, combine the resulting groups, and finally unstack the hierarchical index to
reveal the hidden multidimensionality. In code:
In [4]: titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
Out[4]: class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447
This gives us a better idea of how both sex and class affected survival, but the code is
starting to look a bit garbled. While each step of this pipeline makes sense in light of
the tools we’ve previously discussed, the long string of code is not particularly easy to
read or use. This two-dimensional groupby is common enough that Pandas includes
a convenience routine, pivot_table, which succinctly handles this type of multidimensional
Aggregation.
Here is the equivalent to the preceding operation using the DataFrame.pivot_table
method:
In [5]: titanic.pivot_table('survived', index='sex', columns='class', aggfunc='mean')
Out[5]: class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447
This is eminently more readable than the manual groupby approach, and produces
the same result.
5. Construct a multi-dimensional analysis of the Titanic dataset using pivot tables in
Pandas. Utilize pd.cut() and pd.qcut() to group passengers by age and fare
brackets, and develop pivot tables that display survival rates across combinations of
sex, class, age, and fare. Apply different aggregation functions using the aggfunc
parameter and experiment with the margins option to include totals in your analysis.
—10m
In [1]: import numpy as np
import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')
In [2]: titanic.head()
Out[2]: survived pclass sex age sibsp parch fare embarked class \
0 0 3 male 22.0 1 0 7.2500 S Third
1 1 1 female 38.0 1 0 71.2833 C First
2 1 3 female 26.0 0 0 7.9250 S Third
3 1 1 female 35.0 1 0 53.1000 S First
4 0 3 male 35.0 0 0 8.0500 S Third
who adult_male deck embark_town alive alone
0 man True NaN Southampton no False
1 woman False C Cherbourg yes False
2 woman False NaN Southampton yes True
3 woman False C Southampton yes False
4 man True NaN Southampton no True
Just as in a groupby, the grouping in pivot tables can be specified with multiple levels
and via a number of options. For example, we might be interested in looking at age as
a third dimension. We’ll bin the age using the pd.cut function:
In [6]: age = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', ['sex', age], 'class')
Out[6]: class First Second Third
sex age
female (0, 18] 0.909091 1.000000 0.511628
(18, 80] 0.972973 0.900000 0.423729
male (0, 18] 0.800000 0.600000 0.215686
(18, 80] 0.375000 0.071429 0.133663
We can apply the same strategy when working with the columns as well; let’s add info
on the fare paid, using pd.qcut to automatically compute quantiles:
In [7]: fare = pd.qcut(titanic['fare'], 2)
titanic.pivot_table('survived', ['sex', age], [fare, 'class'])
Out[7]: fare (-0.001, 14.454] (14.454, 512.329] \
class First Second Third First
sex age
female (0, 18] NaN 1.000000 0.714286 0.909091
(18, 80] NaN 0.880000 0.444444 0.972973
male (0, 18] NaN 0.000000 0.260870 0.800000
(18, 80] 0.0 0.098039 0.125000 0.391304
fare
class Second Third
sex age
female (0, 18] 1.000000 0.318182
(18, 80] 0.914286 0.391304
178 | Chapter 21: Pivot Tables
male (0, 18] 0.818182 0.178571
(18, 80] 0.030303 0.192308
The result is a four-dimensional aggregation with hierarchical indices
The aggfunc keyword controls what type of aggregation is applied, which is a mean
by default. As with groupby, the aggregation specification can be a string representing
one of several common choices ('sum', 'mean', 'count', 'min', 'max', etc.) or a
function that implements an aggregation (e.g., np.sum(), min(), sum(), etc.). Additionally,
it can be specified as a dictionary mapping a column to any of the desired
options:
In [8]: titanic.pivot_table(index='sex', columns='class',
aggfunc={'survived':sum, 'fare':'mean'})
Out[8]: fare survived
class First Second Third First Second Third
sex
female 106.125798 21.970121 16.118810 91 70 72
male 67.226127 19.741782 12.661633 45 17 47
Notice also here that we’ve omitted the values keyword; when specifying a mapping
for aggfunc, this is determined automatically.
At times it’s useful to compute totals along each grouping. This can be done via the
margins keyword:
In [9]: titanic.pivot_table('survived', index='sex', columns='class', margins=True)
Out[9]: class First Second Third All
sex
female 0.968085 0.921053 0.500000 0.742038
male 0.368852 0.157407 0.135447 0.188908
All 0.629630 0.472826 0.242363 0.383838
Here, this automatically gives us information about the class-agnostic survival rate by
sex, the sex-agnostic survival rate by class, and the overall survival rate of 38%. The
margin label can be specified with the margins_name keyword; it defaults to "All".
6. Build a Pandas Series containing different names. Utilize various vectorized string
methods such as .str.lower(), .str.len(), .str.startswith(), and
.str.split() to explore and transform the Series. Apply these methods to examine the
data and identify the type of result each method returns . —--10m
Using the following Series object:
In [5]: monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
'Eric Idle', 'Terry Jones', 'Michael Palin'])
Nearly all of Python’s built-in string methods are mirrored by a Pandas vectorized
string method
Based on various return values,
Some, like lower, return a series of strings:
In [6]: monte.str.lower()
Out[6]: 0 graham chapman
1 john cleese
2 terry gilliam
3 eric idle
4 terry jones
5 michael palin
dtype: object
But some others return numbers:
In [7]: monte.str.len()
Out[7]: 0 14
1 11
2 13
39
4 11
5 13
dtype: int64
Or Boolean values:
In [8]: monte.str.startswith('T')
Out[8]: 0 False
1 False
2 True
3 False
4 True
5 False
dtype: bool
Still others return lists or other compound values for each element:
In [9]: monte.str.split()
Out[9]: 0 [Graham, Chapman]
1 [John, Cleese]
2 [Terry, Gilliam]
3 [Eric, Idle]
4 [Terry, Jones]
5 [Michael, Palin]
dtype: object
7. Discuss how Pandas string methods can be applied to perform various operations on
a dataset. Explain how you can apply string access ,slicing and indicator methods with
example. —-10m
lists miscellaneous methods that enable other convenient operations.
Table 22-2. Other Pandas string methods
Method Description
get Indexes each element
slice Slices each element
slice_replace Replaces slice in each element with the passed value
cat Concatenates strings
repeat Repeats values
normalize Returns Unicode form of strings
pad Adds whitespace to left, right, or both sides of strings
wrap Splits long strings into lines with length less than a given width
join Joins strings in each element of the Series with the passed separator
get_dummies Extracts dummy variables as a DataFrame
Vectorized item access and slicing
The get and slice operations, in particular, enable vectorized element access from
each array. For example, we can get a slice of the first three characters of each array
using str.slice(0, 3). This behavior is also available through Python’s normal
indexing syntax; for example, df.str.slice(0, 3) is equivalent to df.str[0:3]:
In [12]: monte.str[0:3]
Out[12]: 0 Gra
1 Joh
2 Ter
3 Eri
4 Ter
5 Mic
dtype: object
Indexing via df.str.get(i) and df.str[i] are likewise similar.
These indexing methods also let you access elements of arrays returned by split. For
example, to extract the last name of each entry, combine split with str indexing:
In [13]: monte.str.split().str[-1]
Out[13]: 0 Chapman
1 Cleese
2 Gilliam
3 Idle
4 Jones
5 Palin
dtype: object
Indicator variables
Another method that requires a bit of extra explanation is the get_dummies method.
This is useful when your data has a column containing some sort of coded indicator.
For example, we might have a dataset that contains information in the form of codes,
such as A = “born in America,” B = “born in the United Kingdom,” C = “likes cheese,”
D = “likes spam”:
In [14]: full_monte = pd.DataFrame({'name': monte,
'info': ['B|C|D', 'B|D', 'A|C',
'B|D', 'B|C', 'B|C|D']})
full_monte
Out[14]: name info
0 Graham Chapman B|C|D
1 John Cleese B|D
2 Terry Gilliam A|C
3 Eric Idle B|D
4 Terry Jones B|C
5 Michael Palin B|C|D
The get_dummies routine lets us split out these indicator variables into a DataFrame:
In [15]: full_monte['info'].str.get_dummies('|')
Out[15]: A B C D
00111
10101
21010
30101
40110
50111
With these operations as building blocks, you can construct an endless range of string
processing procedures when cleaning your data.
8.Utilize Pandas’ datetime functionalities to develop a time-indexed Series using
Timestamp and DatetimeIndex. Apply vectorized operations on date ranges, and
construct slices of time series data using both specific dates and partial string indexing.
Build an example that demonstrates how time-based indexing enhances data selection in
Pandas. —--10m
Pandas builds upon all the tools just discussed to provide a Timestamp object, which
combines the ease of use of datetime and dateutil with the efficient storage and
vectorized interface of numpy.datetime64. From a group of these Timestamp objects,
Pandas can construct a DatetimeIndex that can be used to index data in a Series or
DataFrame.
For example, we can use Pandas tools to repeat the demonstration from earlier. We
can parse a flexibly formatted string date and use format codes to output the day of
the week, as follows:
In [9]: import pandas as pd
date = pd.to_datetime("4th of July, 2021")
date
Out[9]: Timestamp('2021-07-04 00:00:00')
In [10]: date.strftime('%A')
Out[10]: 'Sunday'
Dates and Times in Python | 197
Additionally, we can do NumPy-style vectorized operations directly on this same
object:
In [11]: date + pd.to_timedelta(np.arange(12), 'D')
Out[11]: DatetimeIndex(['2021-07-04', '2021-07-05', '2021-07-06', '2021-07-07',
'2021-07-08', '2021-07-09', '2021-07-10', '2021-07-11',
'2021-07-12', '2021-07-13', '2021-07-14', '2021-07-15'],
dtype='datetime64[ns]', freq=None)
In the next section, we will take a closer look at manipulating time series data with
the tools provided by Pandas.
Pandas Time Series: Indexing by Time
The Pandas time series tools really become useful when you begin to index data by
timestamps. For example, we can construct a Series object that has time-indexed
data:
In [12]: index = pd.DatetimeIndex(['2020-07-04', '2020-08-04',
'2021-07-04', '2021-08-04'])
data = pd.Series([0, 1, 2, 3], index=index)
data
Out[12]: 2020-07-04 0
2020-08-04 1
2021-07-04 2
2021-08-04 3
dtype: int64
And now that we have this data in a Series, we can make use of any of the Series
indexing patterns we discussed in previous chapters, passing values that can be
coerced into dates:
In [13]: data['2020-07-04':'2021-07-04']
Out[13]: 2020-07-04 0
2020-08-04 1
2021-07-04 2
dtype: int64
There are additional special date-only indexing operations, such as passing a year to
obtain a slice of all data from that year:
In [14]: data['2021']
Out[14]: 2021-07-04 2
2021-08-04 3
dtype: int64
9. Apply Pandas functions such as pd.date_range, pd.period_range, and
pd.timedelta_range to construct regular sequences of dates, periods, and time deltas.
Build examples that show how frequency codes and offset aliases can be used to
develop flexible and customized time-based ranges. Utilize frequency combinations and
the pd.tseries.offsets module to model business-specific time intervals. —-10m
To make creation of regular date sequences more convenient, Pandas offers a few
functions for this purpose: pd.date_range for timestamps, pd.period_range for
periods, and pd.timedelta_range for time deltas. We’ve seen that Python’s range and
NumPy’s np.arange take a start point, end point, and optional step size and return a
sequence. Similarly, pd.date_range accepts a start date, an end date, and an optional
frequency code to create a regular sequence of dates:
In [18]: pd.date_range('2015-07-03', '2015-07-10')
Out[18]: DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06',
'2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'],
dtype='datetime64[ns]', freq='D')
Alternatively, the date range can be specified not with a start and end point, but with
a start point and a number of periods:
In [19]: pd.date_range('2015-07-03', periods=8)
Out[19]: DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06',
'2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'],
dtype='datetime64[ns]', freq='D')
The spacing can be modified by altering the freq argument, which defaults to D. For
example, here we construct a range of hourly timestamps:
In [20]: pd.date_range('2015-07-03', periods=8, freq='H')
Out[20]: DatetimeIndex(['2015-07-03 00:00:00', '2015-07-03 01:00:00',
'2015-07-03 02:00:00', '2015-07-03 03:00:00',
'2015-07-03 04:00:00', '2015-07-03 05:00:00',
'2015-07-03 06:00:00', '2015-07-03 07:00:00'],
dtype='datetime64[ns]', freq='H')
To create regular sequences of Period or Timedelta values, the similar
pd.period_range and pd.timedelta_range functions are useful. Here are some
monthly periods:
In [21]: pd.period_range('2015-07', periods=8, freq='M')
Out[21]: PeriodIndex(['2015-07', '2015-08', '2015-09',
'2015-10', '2015-11', '2015-12',
'2016-01', '2016-02'],
dtype='period[M]')
And a sequence of durations increasing by an hour:
In [22]: pd.timedelta_range(0, periods=6, freq='H')
Out[22]: TimedeltaIndex(['0 days 00:00:00', '0 days 01:00:00', '0 days 02:00:00',
'0 days 03:00:00', '0 days 04:00:00', '0 days 05:00:00'],
dtype='timedelta64[ns]', freq='H')
All of these require an understanding of Pandas frequency codes
codes can be combined with numbers to specify other frequencies. For
example, for a frequency of 2 hours and 30 minutes, we can combine the hour (H) and
minute (T) codes as follows:
In [23]: pd.timedelta_range(0, periods=6, freq="2H30T")
Out[23]: TimedeltaIndex(['0 days 00:00:00', '0 days 02:30:00', '0 days 05:00:00',
'0 days 07:30:00', '0 days 10:00:00', '0 days 12:30:00'],
dtype='timedelta64[ns]', freq='150T')
All of these short codes refer to specific instances of Pandas time series offsets, which
can be found in the pd.tseries.offsets module. For example, we can create a business
day offset directly as follows:
In [24]: from pandas.tseries.offsets import BDay
pd.date_range('2015-07-01', periods=6, freq=BDay())
Out[24]: DatetimeIndex(['2015-07-01', '2015-07-02', '2015-07-03', '2015-07-06',
'2015-07-07', '2015-07-08'],
dtype='datetime64[ns]', freq='B')
10. Apply your knowledge of Pandas to identify and explain key time series data
structures. Build a DatetimeIndex using pd.to_datetime() from a list of mixed-format
dates. Construct examples to convert it into a PeriodIndex, and experiment with date
subtraction to develop a TimedeltaIndex. Make use of appropriate code snippets to
model each concept clearly.
Pandas Time Series Data Structures
This section will introduce the fundamental Pandas data structures for working with
time series data:
• For timestamps, Pandas provides the Timestamp type. As mentioned before, this
is essentially a replacement for Python’s native datetime, but it’s based on the
more efficient numpy.datetime64 data type. The associated Index structure is
DatetimeIndex.
• For time periods, Pandas provides the Period type. This encodes a fixedfrequency
interval based on numpy.datetime64. The associated index structure is
PeriodIndex.
• For time deltas or durations, Pandas provides the Timedelta type. Timedelta is a
more efficient replacement for Python’s native datetime.timedelta type, and is
based on numpy.timedelta64. The associated index structure is TimedeltaIndex.
The most fundamental of these date/time objects are the Timestamp and Datetime
Index objects. While these class objects can be invoked directly, it is more common to
use the pd.to_datetime function, which can parse a wide variety of formats. Passing
a single date to pd.to_datetime yields a Timestamp; passing a series of dates by
default yields a DatetimeIndex, as you can see here:
In [15]: dates = pd.to_datetime([datetime(2021, 7, 3), '4th of July, 2021',
'2021-Jul-6', '07-07-2021', '20210708'])
dates
Out[15]: DatetimeIndex(['2021-07-03', '2021-07-04', '2021-07-06', '2021-07-07',
'2021-07-08'],
dtype='datetime64[ns]', freq=None)
Any DatetimeIndex can be converted to a PeriodIndex with the to_period function,
with the addition of a frequency code; here we’ll use 'D' to indicate daily frequency:
In [16]: dates.to_period('D')
Out[16]: PeriodIndex(['2021-07-03', '2021-07-04', '2021-07-06', '2021-07-07',
'2021-07-08'],
dtype='period[D]')
A TimedeltaIndex is created, for example, when a date is subtracted from another:
In [17]: dates - dates[0]
Out[17]: TimedeltaIndex(['0 days', '1 days', '3 days', '4 days', '5 days'],
> dtype='timedelta64[ns]', freq=None)
Criteria Max Description
Marks
Identify and Explain Time 2 Correctly identifies and explains Timestamp,
Series Data Structures Period, Timedelta, and their corresponding
index structures.
Apply pd.to_datetime() 2 Demonstrates creating a DatetimeIndex from a
to Create DatetimeIndex list of dates using pd.to_datetime().
Construct and Convert to 2 Uses to_period() with appropriate frequency
PeriodIndex to convert DatetimeIndex to PeriodIndex.
Develop TimedeltaIndex 2 Demonstrates date subtraction and shows
through Date Arithmetic resulting TimedeltaIndex with correct
interpretation.
Organize Examples Clearly 2 Provides clear, correctly formatted code
with Proper Syntax examples with brief explanations of the output
and purpose.
11. Identify and explain how pandas.eval() enhances performance in DataFrame
operations. Apply and experiment with pd.eval() to construct arithmetic, comparison,
and bitwise operations using string expressions. Utilize examples to model efficient
computations on large datasets, and make use of object attribute access within
evaluation strings. —---10m
pandas.eval for Efficient Operations
The eval function in Pandas uses string expressions to efficiently compute operations
on DataFrame objects. For example, consider the following data:
In [6]: import pandas as pd
nrows, ncols = 100000, 100
df1, df2, df3, df4 = (pd.DataFrame(rng.random((nrows, ncols)))
for i in range(4))
To compute the sum of all four DataFrames using the typical Pandas approach, we can
just write the sum:
216 | Chapter 24: High-Performance Pandas: eval and query
In [7]: %timeit df1 + df2 + df3 + df4
Out[7]: 73.2 ms ± 6.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The same result can be computed via pd.eval by constructing the expression as a
string:
In [8]: %timeit pd.eval('df1 + df2 + df3 + df4')
Out[8]: 34 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The eval version of this expression is about 50% faster (and uses much less memory),
while giving the same result:
In [9]: np.allclose(df1 + df2 + df3 + df4,
pd.eval('df1 + df2 + df3 + df4'))
Out[9]: True
pd.eval supports a wide range of operations. To demonstrate these, we’ll use the following
integer data:
In [10]: df1, df2, df3, df4, df5 = (pd.DataFrame(rng.integers(0, 1000, (100, 3)))
for i in range(5))
Here’s a summary of the operations pd.eval supports:
Arithmetic operators
pd.eval supports all arithmetic operators. For example:
In [11]: result1 = -df1 * df2 / (df3 + df4) - df5
result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')
np.allclose(result1, result2)
Out[11]: True
Comparison operators
pd.eval supports all comparison operators, including chained expressions:
In [12]: result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)
result2 = pd.eval('df1 < df2 <= df3 != df4')
np.allclose(result1, result2)
Out[12]: True
Bitwise operators
pd.eval supports the & and | bitwise operators:
In [13]: result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)
result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')
np.allclose(result1, result2)
Out[13]: True
In addition, it supports the use of the literal and and or in Boolean expressions:
In [14]: result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)')
np.allclose(result1, result3)
Out[14]: True
pandas.eval for Efficient Operations | 217
Object attributes and indices
pd.eval supports access to object attributes via the obj.attr syntax and indexes
via the obj[index] syntax:
In [15]: result1 = df2.T[0] + df3.iloc[1]
result2 = pd.eval('df2.T[0] + df3.iloc[1]')
np.allclose(result1, result2)
Out[15]: True
Other operations
Other operations, such as function calls, conditional statements, loops, and other
more involved constructs are currently not implemented in pd.eval. If you’d like
to execute these more complicated types of expressions, you can use the
NumExpr library itself.
12. Using the DataFrame.eval() method in Pandas, explain how it simplifies
column-wise operations compared to traditional methods. Provide an example to
demonstrate the calculation of an expression such as (A + B) / (C - 1) using
eval(), and compare it with the traditional approach. Additionally, explain how the
inplace parameter works when assigning values to columns and illustrate the use of the
@ symbol to reference local Python variables in a DataFrame.eval() expression. —10m
DataFrame.eval for Column-Wise Operations
Just as Pandas has a top-level pd.eval function, DataFrame objects have an eval
method that works in similar ways. The benefit of the eval method is that columns
can be referred to by name. We’ll use this labeled array as an example:
In [16]: df = pd.DataFrame(rng.random((1000, 3)), columns=['A', 'B', 'C'])
df.head()
Out[16]: A B C
0 0.850888 0.966709 0.958690
1 0.820126 0.385686 0.061402
2 0.059729 0.831768 0.652259
3 0.244774 0.140322 0.041711
4 0.818205 0.753384 0.578851
Using pd.eval as in the previous section, we can compute expressions with the three
columns like this:
In [17]: result1 = (df['A'] + df['B']) / (df['C'] - 1)
result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
np.allclose(result1, result2)
Out[17]: True
The DataFrame.eval method allows much more succinct evaluation of expressions
with the columns:
In [18]: result3 = df.eval('(A + B) / (C - 1)')
np.allclose(result1, result3)
Out[18]: True
Notice here that we treat column names as variables within the evaluated expression,
and the result is what we would wish.
Assignment in DataFrame.eval
In addition to the options just discussed, DataFrame.eval also allows assignment to
any column. Let’s use the DataFrame from before, which has columns 'A', 'B', and
'C':
In [19]: df.head()
Out[19]: A B C
0 0.850888 0.966709 0.958690
1 0.820126 0.385686 0.061402
2 0.059729 0.831768 0.652259
3 0.244774 0.140322 0.041711
4 0.818205 0.753384 0.578851
We can use df.eval to create a new column 'D' and assign to it a value computed
from the other columns:
In [20]: df.eval('D = (A + B) / C', inplace=True)
df.head()
Out[20]: A B C D
0 0.850888 0.966709 0.958690 1.895916
1 0.820126 0.385686 0.061402 19.638139
2 0.059729 0.831768 0.652259 1.366782
3 0.244774 0.140322 0.041711 9.232370
4 0.818205 0.753384 0.578851 2.715013
In the same way, any existing column can be modified:
In [21]: df.eval('D = (A - B) / C', inplace=True)
df.head()
Out[21]: A B C D
0 0.850888 0.966709 0.958690 -0.120812
1 0.820126 0.385686 0.061402 7.075399
2 0.059729 0.831768 0.652259 -1.183638
3 0.244774 0.140322 0.041711 2.504142
4 0.818205 0.753384 0.578851 0.111982
Local Variables in DataFrame.eval
The DataFrame.eval method supports an additional syntax that lets it work with
local Python variables. Consider the following:
In [22]: column_mean = df.mean(1)
result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')
np.allclose(result1, result2)
Out[22]: True
The @ character here marks a variable name rather than a column name, and lets you
efficiently evaluate expressions involving the two “namespaces”: the namespace of
columns, and the namespace of Python objects. Notice that this @ character is only
supported by the DataFrame.eval method, not by the pandas.eval function, because
the pandas.eval function only has access to the one (Python) namespace.
13. How would you apply the DataFrame.query method in Pandas to solve data filtering
problems? Build a comparison between this method and traditional filtering techniques.
Choose an appropriate scenario to model its use, and explain how to select the best
approach for performance. Identify the advantages of utilizing the query method and
organize your explanation by discussing its benefits in terms of memory efficiency and
readability —------10m
The DataFrame.query Method
The DataFrame has another method based on evaluated strings, called query. Consider
the following:
In [23]: result1 = df[(df.A < 0.5) & (df.B < 0.5)]
result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1, result2)
Out[23]: True
As with the example used in our discussion of DataFrame.eval, this is an expression
involving columns of the DataFrame. However, it cannot be expressed using the Data
Frame.eval syntax! Instead, for this type of filtering operation, you can use the query
method:
In [24]: result2 = df.query('A < 0.5 and B < 0.5')
np.allclose(result1, result2)
Out[24]: True
In addition to being a more efficient computation, compared to the masking expression
this is much easier to read and understand. Note that the query method also
accepts the @ flag to mark local variables:
In [25]: Cmean = df['C'].mean()
result1 = df[(df.A < Cmean) & (df.B < Cmean)]
result2 = df.query('A < @Cmean and B < @Cmean')
np.allclose(result1, result2)
Out[25]: True
Performance: When to Use These Functions
When considering whether to use eval and query, there are two considerations: computation
time and memory use. Memory use is the most predictable aspect. As already
mentioned, every compound expression involving NumPy arrays or Pandas Data
Frames will result in implicit creation of temporary arrays. For example, this:
In [26]: x = df[(df.A < 0.5) & (df.B < 0.5)]
is roughly equivalent to this:
In [27]: tmp1 = df.A < 0.5
tmp2 = df.B < 0.5
tmp3 = tmp1 & tmp2
x = df[tmp3]
220 | Chapter 24: High-Performance Pandas: eval and query
If the size of the temporary DataFrames is significant compared to your available system
memory (typically several gigabytes), then it’s a good idea to use an eval or
query expression. You can check the approximate size of your array in bytes using
this:
In [28]: df.values.nbytes
Out[28]: 32000
On the performance side, eval can be faster even when you are not maxing out your
system memory. The issue is how your temporary objects compare to the size of the
L1 or L2 CPU cache on your system (typically a few megabytes); if they are much bigger,
then eval can avoid some potentially slow movement of values between the different
memory caches. In practice, I find that the difference in computation time
between the traditional methods and the eval/query method is usually not significant—
if anything, the traditional method is faster for smaller arrays! The benefit of
eval/query is mainly in the saved memory, and the sometimes cleaner syntax they
offer.