Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
Series of data
Series of data
Series of data
Series of data
Series of data
Series of data
                                                                                                                                             +'USER:PASSWORD@HOST/DATABASE')
   (df.index)
                                                                                                                                         df = pd.read_sql_table('table', engine)
   Row index
Series object: an ordered, one-dimensional array of                                                                                      Get a DataFrame from a Python dictionary
data with an index. All the data in a Series is of the                                                                                   # default --- assume data is in columns
same data type. Series arithmetic is vectorised after first                                                                              df = DataFrame({
aligning the Series index for each of the operands.                                                                                                'col0' : [1.0, 2.0, 3.0, 4.0],
                                                                                                                                                   'col1' : [100, 200, 300, 400]
s1 = Series(range(0,4))                                              # -> 0, 1, 2, 3                                                          })
s2 = Series(range(1,5))                                              # -> 1, 2, 3, 4
s3 = s1 + s2                                                         # -> 1, 3, 5, 7
Version 15 January 2017 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
                                                                                                                                                                                            1
Get a DataFrame from data in a Python dictionary
# --- use helper method for data in rows                                    Working with the whole DataFrame
df = DataFrame.from_dict({ # data by row
          # rows as python dictionaries                             Peek at the DataFrame contents/structure
          'row0' : {'col0':0, 'col1':'A'},
                                                                    df.info()               # index & data types
          'row1' : {'col0':1, 'col1':'B'}
                                                                    dfh = df.head(i)        # get first i rows
     }, orient='index')
                                                                    dft = df.tail(i)        # get last i rows
                                                                    dfs = df.describe() # summary stats cols
df = DataFrame.from_dict({ # data by row
                                                                    top_left_corner_df = df.iloc[:4, :4]
        # rows as python lists
        'row0' : [1, 1+1j, 'A'],
        'row1' : [2, 2+2j, 'B']                                     DataFrame non-indexing attributes
    }, orient='index')                                              dfT = df.T      # transpose rows and cols
                                                                    l = df.axes     # list row and col indexes
Create play/fake data (useful for testing)                          (r_idx, c_idex) = df.axes         # from above
                                                                    s = df.dtypes # Series column data types
# --- simple - default integer indexes
                                                                    b = df.empty # True for empty DataFrame
df = DataFrame(np.random.rand(50,5))
                                                                    i = df.ndim     # number of axes (it is 2)
                                                                    t = df.shape # (row-count, column-count)
# --- with a time-stamp row index:
                                                                    i = df.size     # row-count * column-count
df = DataFrame(np.random.rand(500,5))
                                                                    a = df.values # get a numpy array for df
df.index = pd.date_range('1/1/2005',
       periods=len(df), freq='M')
                                                                    DataFrame utility methods
# --- with alphabetic row and col indexes                           df = df.copy() # copy a DataFrame
#          and a "groupable" variable                               df = df.rank() # rank each col (default)
import string                                                       df = df.sort_values(by=col)
import random                                                       df = df.sort_values(by=[col1, col2])
r = 52 # note: min r is 1; max r is 52                              df = df.sort_index()
c = 5                                                               df = df.astype(dtype) # type conversion
df = DataFrame(np.random.randn(r, c),
   columns = ['col'+str(i) for i in                                 DataFrame iteration methods
         range(c)],                                                 df.iteritems()# (col-index, Series) pairs
   index = list((string. ascii_uppercase+                           df.iterrows() # (row-index, Series) pairs
         string.ascii_lowercase)[0:r]))
df['group'] = list(                                                 # example ... iterating over columns
   ''.join(random.choice('abcde')                                   for (name, series) in df.iteritems():
         for _ in range(r)) )                                           print('Col name: ' + str(name))
                                                                        print('First value: ' +
                                                                            str(series.iat[0]) + '\n')
Apply numpy mathematical functions to columns                       Test if column index values are unique/monotonic
df['log_data'] = np.log(df['col1'])                                 if df.columns.is_unique: pass # ...
Note: many many more numpy math functions                           b = df.columns.is_monotonic_increasing
Hint: Prefer pandas math over numpy where you can.                  b = df.columns.is_monotonic_decreasing
Version 15 January 2017 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
                                                                                                                             3
                                                                    Select a slice of rows by label/index
                  Working with rows                                 [inclusive-from : inclusive–to [ : step]]
                                                                    df = df['a':'c'] # rows 'a' through 'c'
Get the row index and labels                                        Trap: doesn't work on integer labelled rows
idx = df.index               #      get row index
label = df.index[0]          #      first row label                 Append a row of column totals to a DataFrame
label = df.index[-1]         #      last row label                  # Option 1: use dictionary comprehension
l = df.index.tolist() #             get as a list                   sums = {col: df[col].sum() for col in df}
a = df.index.values          #      get as an array                 sums_df = DataFrame(sums,index=['Total'])
                                                                    df = df.append(sums_df)
Change the (row) index
df.index = idx         # new ad hoc index                           # Option 2: All done with pandas
df = df.set_index('A')# col A new index                             df = df.append(DataFrame(df.sum(),
df = df.set_index(['A', 'B'])# MultiIndex                                              columns=['Total']).T)
df = df.reset_index() # replace old w new
# note: old index stored as a col in df                             Iterating over DataFrame rows
df.index = range(len(df)) # set with list                           for (index, row) in df.iterrows(): # pass
df = df.reindex(index=range(len(df)))                               Trap: row data type may be coerced.
df = df.set_index(keys=['r1','r2','etc'])
df.rename(index={'old':'new'},                                      Sorting DataFrame rows values
                           inplace=True)                            df = df.sort(df.columns[0],
                                                                                                  ascending=False)
Adding rows                                                         df.sort(['col1', 'col2'], inplace=True)
df = original_df.append(more_rows_in_df)
Hint: convert row to a DataFrame and then append.                   Sort DataFrame by its row index
Both DataFrames should have same column labels.                     df.sort_index(inplace=True) # sort by row
                                                                    df = df.sort_index(ascending=False)
Dropping rows (by name)
df = df.drop('row_label')                                           Random selection of rows
df = df.drop(['row1','row2']) # multi-row                           import random as r
                                                                    k = 20 # pick a number
Boolean row selection by values in a column                         selection = r.sample(range(len(df)), k)
df = df[df['col2'] >= 0.0]                                          df_sample = df.iloc[selection, :]
df = df[(df['col3']>=1.0) |                                         Note: this sample is not sorted
                                  (df['col1']<0.0)]
df = df[df['col'].isin([1,2,5,7,11])]                               Drop duplicates in the row index
df = df[~df['col'].isin([1,2,5,7,11])]                              df['index'] = df.index # 1 create new col
df = df[df['col'].str.contains('hello')]                            df = df.drop_duplicates(cols='index',
Trap: bitwise "or", "and" “not; (ie. | & ~) co-opted to be                   take_last=True)# 2 use new col
Boolean operators on a Series of Boolean                            del df['index']              # 3 del the col
Trap: need parentheses around comparisons.                          df.sort_index(inplace=True)# 4 tidy up
Selecting rows using isin over multiple columns                     Test if two DataFrames have same row index
# fake up some data                                                 len(a)==len(b) and all(a.index==b.index)
data = {1:[1,2,3], 2:[1,4,9], 3:[1,8,27]}
df = DataFrame(data)                                                Get the integer position of a row or col index label
                                                                    i = df.index.get_loc('row_label')
# multi-column isin
lf = {1:[1, 3], 3:[8, 27]} # look for                               Trap: index.get_loc() returns an integer for a unique
f = df[df[list(lf)].isin(lf).all(axis=1)]                           match. If not a unique match, may return a slice/mask.
Selecting rows using an index                                       Get integer position of rows that meet condition
idx = df[df['col'] >= 2].index                                      a = np.where(df['col'] >= 2) #numpy array
print(df.ix[idx])
                                                                    Test if the row index values are unique/monotonic
Select a slice of rows by integer position                          if df.index.is_unique: pass # ...
[inclusive-from : exclusive-to [: step]]                            b = df.index.is_monotonic_increasing
default start is 0; default end is len(df)                          b = df.index.is_monotonic_decreasing
df = df[:]               # copy DataFrame
df = df[0:2]             # rows 0 and 1                             Find row index duplicates
df = df[-1:]             # the last row                             if df.index.has_duplicates:
df = df[2:3]             # row 2 (the third row)                         print(df.index.duplicated())
df = df[:-1]             # all but the last row                     Note: also similar for column label duplicates.
df = df[::2]             # every 2nd row (0 2 ..)
Trap: a single integer without a colon is a column label
for integer numbered columns.
Version 15 January 2017 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
                                                                                                                             4
                  Working with cells                                Summary: selecting using the DataFrame index
Selecting a cell by row and column labels                           Using the DataFrame index to select columns
value = df.at['row', 'col']                                         s = df['col_label'] # returns Series
value = df.loc['row', 'col']                                        df = df[['col_label']]# return DataFrame
value = df['col'].at['row']                   # tricky              df = df[['L1', 'L2']] # select with list
Note: .at[] fastest label based scalar lookup                       df = df[index]                  # select with index
                                                                    df = df[s]                      #select with Series
Setting a cell by row and column labels                             Note: the difference in return type with the first two
df.at['row', 'col'] = value                                         examples above based on argument type (scalar vs list).
df.loc['row', 'col'] = value
df['col'].at['row'] = value                    # tricky             Using the DataFrame index to select rows
                                                                    df = df['from':'inc_to']# label slice
Selecting and slicing on labels                                     df = df[3:7]                 # integer slice
df = df.loc['row1':'row3', 'col1':'col3']                           df = df[df['col'] > 0.5]# Boolean Series
                                                                    df = df.loc['label']         # single label
Note: the "to" on this slice is inclusive.
                                                                    df = df.loc[container] # lab list/Series
                                                                    df = df.loc['from':'to']# inclusive slice
Setting a cross-section by labels                                   df = df.loc[bs]              # Boolean Series
df.loc['A':'C', 'col1':'col3'] = np.nan                             df = df.iloc[0]              # single integer
df.loc[1:2,'col1':'col2']=np.zeros((2,2))                           df = df.iloc[container] # int list/Series
df.loc[1:2,'A':'C']=othr.loc[1:2,'A':'C']                           df = df.iloc[0:5]            # exclusive slice
Remember: inclusive "to" in the slice                               df = df.ix[x]                # loc then iloc
Selecting a cell by integer position                                Using the DataFrame index to select a cross-section
value = df.iat[9, 3]                 # [row, col]                   #      r and c can be scalar, list, slice
value = df.iloc[0, 0]                # [row, col]                   df.loc[r, c] # label accessor (row, col)
value = df.iloc[len(df)-1,                                          df.iloc[r, c]# integer accessor
                              len(df.columns)-1]                    df.ix[r, c] # label access int fallback
                                                                    df[c].iloc[r]# chained – also for .loc
Selecting a range of cells by int position
df = df.iloc[2:4, 2:4] # subset of the df                           Using the DataFrame index to select a cell
df = df.iloc[:5, :5]           # top left corner                    #      r and c must be label or integer
s = df.iloc[5, :] # returns row as Series                           df.at[r, c] # fast scalar label accessor
df = df.iloc[5:6, :] # returns row as row                           df.iat[r, c] # fast scalar int accessor
Note: exclusive "to" – same as python list slicing.                 df[c].iat[r] # chained – also for .at
Version 15 January 2017 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
                                                                                                                                5
                                                                    Grouping
          Joining/Combining DataFrames                              gb = df.groupby('cat') # by one columns
                                                                    gb = df.groupby(['c1','c2']) # by 2 cols
Three ways to join two DataFrames:                                  gb = df.groupby(level=0) # multi-index gb
                                                                    gb = df.groupby(level=['a','b']) # mi gb
• merge (a database/SQL-like join operation)
                                                                    print(gb.groups)
• concat (stack side by side or one on top of the other)
                                                                    Note: groupby() returns a pandas groupby object
• combine_first (splice the two together, choosing
                                                                    Note: the groupby object attribute .groups contains a
   values from one over the other)
                                                                    dictionary mapping of the groups.
                                                                    Trap: NaN values in the group key are automatically
Merge on indexes
                                                                    dropped – there will never be a NA group.
df_new = pd.merge(left=df1, right=df2,
                 how='outer', left_index=True,
                                                                    Iterating groups – usually not needed
                 right_index=True)
                                                                    for name, group in gb:
How: 'left', 'right', 'outer', 'inner'
                                                                          print (name)
How: outer=union/all; inner=intersection                                  print (group)
Merge on columns
                                                                    Selecting a group
df_new = pd.merge(left=df1, right=df2,
                                                                    dfa = df.groupby('cat').get_group('a')
             how='left', left_on='col1',
                                                                    dfb = df.groupby('cat').get_group('b')
             right_on='col2')
Trap: When joining on columns, the indexes on the
passed DataFrames are ignored.                                      Applying an aggregating function
Trap: many-to-many merges on a column can result in                 # apply to a column ...
an explosion of associated data.                                    s = df.groupby('cat')['col1'].sum()
                                                                    s = df.groupby('cat')['col1'].agg(np.sum)
                                                                    # apply to the every column in DataFrame
Join on indexes (another way of merging)
                                                                    s = df.groupby('cat').agg(np.sum)
df_new = df1.join(other=df2, on='col1',                             df_summary = df.groupby('cat').describe()
                         how='outer')                               df_row_1s = df.groupby('cat').head(1)
df_new = df1.join(other=df2,on=['a','b'],
                         how='outer')                               Note: aggregating functions reduce the dimension by
                                                                    one – they include: mean, sum, size, count, std, var,
Note: DataFrame.join() joins on indexes by default.
                                                                    sem, describe, first, last, min, max
DataFrame.merge() joins on common columns by
default.
                                                                    Applying multiple aggregating functions
                                                                    gb = df.groupby('cat')
Simple concatenation is often the best
df=pd.concat([df1,df2],axis=0)#top/bottom                           # apply multiple functions to one column
df = df1.append([df2, df3])             #top/bottom                 dfx = gb['col2'].agg([np.sum, np.mean])
df=pd.concat([df1,df2],axis=1)#left/right                           # apply to multiple fns to multiple cols
Trap: can end up with duplicate rows or cols                        dfy = gb.agg({
Note: concat has an ignore_index parameter                                'cat': np.count_nonzero,
                                                                          'col1': [np.sum, np.mean, np.std],
Combine_first                                                             'col2': [np.min, np.max]
df = df1.combine_first(other=df2)                                   })
                                                                    Note: gb['col2'] above is shorthand for
# multi-combine with python reduce()                                df.groupby('cat')['col2'], without the need for regrouping.
df = reduce(lambda x, y:
           x.combine_first(y),                                      Transforming functions
                [df1, df2, df3, df4, df5])                          # transform to group z-scores, which have
Uses the non-null values from df1. The index of the                 # a group mean of 0, and a std dev of 1.
combined DataFrame will be the union of the indexes                 zscore = lambda x: (x-x.mean())/x.std()
from df1 and df2.                                                   dfz = df.groupby('cat').transform(zscore)
Version 15 January 2017 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
                                                                                                                              6
Group by a row index (non-hierarchical index)
df = df.set_index(keys='cat')                                          Working with dates, times and their indexes
s = df.groupby(level=0)['col1'].sum()
dfg = df.groupby(level=0).sum()                                     Dates and time – points and spans
                                                                    With its focus on time-series data, pandas has a suite of
                                                                    tools for managing dates and time: either as a point in
                                                                    time (a Timestamp) or as a span of time (a Period).
 Pivot Tables: working with long and wide data
                                                                    t = pd.Timestamp('2013-01-01')
                                                                    t = pd.Timestamp('2013-01-01 21:15:06')
      These features work with and often create                     t = pd.Timestamp('2013-01-01 21:15:06.7')
         hierarchical or multi-level Indexes;                       p = pd.Period('2013-01-01', freq='M')
  (the pandas MultiIndex is powerful and complex).                  Note: Timestamps should be in range 1678 and 2261
                                                                    years. (Check Timestamp.max and Timestamp.min).
Pivot, unstack, stack and melt
Pivot tables move from long format to wide format data              A Series of Timestamps or Periods
# Let's start with data in long format                              ts = ['2015-04-01 13:17:27',
from StringIO import StringIO # python2.7                                     '2014-04-02 13:17:29']
#from io import StringIO                # python 3
data = """Date,Pollster,State,Party,Est                             # Series of Timestamps (good)
13/03/2014, Newspoll, NSW,             red,     25                  s = pd.to_datetime(pd.Series(ts))
13/03/2014, Newspoll, NSW,             blue, 28
13/03/2014, Newspoll, Vic,             red,     24                  # Series of Periods (often not so good)
13/03/2014, Newspoll, Vic,             blue, 23                     s = pd.Series( [pd.Period(x, freq='M')
13/03/2014, Galaxy,           NSW,     red,     23                             for x in ts] )
13/03/2014, Galaxy,           NSW,     blue, 24                     s = pd.Series(
13/03/2014, Galaxy,           Vic,     red,     26                             pd.PeriodIndex(ts,freq='S'))
13/03/2014, Galaxy,           Vic,     blue, 25                     Note: While Periods make a very useful index; they may
13/03/2014, Galaxy,           Qld,     red,     21                  be less useful in a Series.
13/03/2014, Galaxy,           Qld,     blue, 27"""
df = pd.read_csv(StringIO(data),                                    From non-standard strings to Timestamps
        header=0, skipinitialspace=True)                            t = ['09:08:55.7654-JAN092002',
                                                                              '15:42:02.6589-FEB082016']
# pivot to wide format on 'Party' column                            s = pd.Series(pd.to_datetime(t,
# 1st: set up a MultiIndex for other cols                                     format="%H:%M:%S.%f-%b%d%Y"))
df1 = df.set_index(['Date', 'Pollster',                             Also: %B = full month name; %m = numeric month;
     'State'])                                                      %y = year without century; and more …
# 2nd: do the pivot
wide1 = df1.pivot(columns='Party')                                  Dates and time – stamps and spans as indexes
                                                                    An index of Timestamps is a DatetimeIndex.
# unstack to wide format on State / Party
                                                                    An index of Periods is a PeriodIndex.
# 1st: MultiIndex all but the Values col
df2 = df.set_index(['Date', 'Pollster',                             date_strs = ['2014-01-01', '2014-04-01',
     'State', 'Party'])                                                              '2014-07-01', '2014-10-01']
# 2nd: unstack a column to go wide on it
wide2 = df2.unstack('State')                                        dti = pd.DatetimeIndex(date_strs)
wide3 = df2.unstack() # pop last index
                                                                    pid = pd.PeriodIndex(date_strs, freq='D')
# Use stack() to get back to long format                            pim = pd.PeriodIndex(date_strs, freq='M')
long1 = wide1.stack()                                               piq = pd.PeriodIndex(date_strs, freq='Q')
# Then use reset_index() to remove the
# MultiIndex.                                                       print (pid[1] - pid[0])             # 90 days
long2 = long1.reset_index()                                         print (pim[1] - pim[0])             # 3 months
                                                                    print (piq[1] - piq[0])             # 1 quarter
# Or melt() back to long format
# 1st: flatten the column index                                     time_strs = ['2015-01-01 02:10:40.12345',
wide1.columns = ['_'.join(col).strip()                                           '2015-01-01 02:10:50.67890']
     for col in wide1.columns.values]                               pis = pd.PeriodIndex(time_strs, freq='U')
# 2nd: remove the MultiIndex
wdf = wide1.reset_index()                                           df.index = pd.period_range('2015-01',
# 3rd: melt away                                                            periods=len(df), freq='M')
long3 = pd.melt(wdf, value_vars=
     ['Est_blue', 'Est_red'],                                       dti = pd.to_datetime(['04-01-2012'],
     var_name='Party', id_vars=['Date',                                dayfirst=True) # Australian date format
     'Pollster', 'State'])                                          pi = pd.period_range('1960-01-01',
                                                                       '2015-12-31', freq='M')
Note: See documentation, there are many arguments to
these methods.                                                      Hint: unless you are working in less than seconds,
                                                                    prefer PeriodIndex over DateTimeImdex.
Version 15 January 2017 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
                                                                                                                            7
Period frequency constants (not a complete list)                    Upsampling and downsampling
Name                Description                                     # upsample from quarterly to monthly
U                   Microsecond                                     pi = pd.period_range('1960Q1',
L                   Millisecond                                             periods=220, freq='Q')
S                   Second                                          df = DataFrame(np.random.rand(len(pi),5),
                                                                            index=pi)
T                   Minute
                                                                    dfm = df.resample('M', convention='end')
H                   Hour                                            # use ffill or bfill to fill with values
D                   Calendar day
B                   Business day                                    # downsample from monthly to quarterly
W-{MON, TUE, …}     Week ending on …                                dfq = dfm.resample('Q', how='sum')
MS                  Calendar start of month
M                   Calendar end of month                           Time zones
QS-{JAN, FEB, …}    Quarter start with year starting                t = ['2015-06-30 00:00:00',
                    (QS – December)                                        '2015-12-31 00:00:00']
Q-{JAN, FEB, …}     Quarter end with year ending (Q                 dti = pd.to_datetime(t
                    – December)                                            ).tz_localize('Australia/Canberra')
AS-{JAN, FEB, …}    Year start (AS - December)                      dti = dti.tz_convert('UTC')
                                                                    ts = pd.Timestamp('now',
A-{JAN, FEB, …}     Year end (A - December)                                tz='Europe/London')
From DatetimeIndex to Python datetime objects                       # get a list of all time zones
dti = pd.DatetimeIndex(pd.date_range(                               import pyzt
  start='1/1/2011', periods=4, freq='M'))                           for tz in pytz.all_timezones:
s = Series([1,2,3,4], index=dti)                                         print tz
na = dti.to_pydatetime()           #numpy array                     Note: by default, Timestamps are created without time
na = s.index.to_pydatetime() #numpy array                           zone information.
Frome Timestamps to Python dates or times                           Row selection with a time-series index
df['date'] = [x.date() for x in df['TS']]                           # start with the play data above
df['time'] = [x.time() for x in df['TS']]                           idx = pd.period_range('2015-01',
Note: converts to datatime.date or datetime.time. But                        periods=len(df), freq='M')
does not convert to datetime.datetime.                              df.index = idx
Line plot
df1 = df.cumsum()
ax = df1.plot()
plt.close(fig)
                                                                    Density plot
                                                                    ax = df.plot.kde()
Box plot
ax = df.plot.box(vert=False)
# followed by the same code as above
                                                                    Scatter plot
                                                                    ax = df.plot.scatter(x='A', y='C')
ax = df.plot.box(column='c1', by='c2')
Histogram
ax = df['A'].plot.hist(bins=20)                                     Pie chart
                                                                    s = pd.Series(data=[10, 20, 30],
                                                                         index = ['dogs', 'cats', 'birds'])
                                                                    ax = s.plot.pie(autopct='%.1f')
                                                                    ax.set_title('Pie Chart')
                                                                    ax.set_aspect(1) # make it round
                                                                    ax.set_ylabel('') # remove default
                                                                    fig = ax.figure
                                                                    fig.set_size_inches(8, 3)
Multiple histograms
                                                                    fig.savefig('filename.png', dpi=125)
ax = df.plot.hist(bins=25, alpha=0.5)
ax = df.plot.hist(bins=25, stacked=True)                            plt.close(fig)
Version 15 January 2017 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
                                                                                                                    9
                                                                    Change the range plotted
                                                                    ax.set_xlim([-5, 5])
                                                                    lower, upper = ax.get_ylim()
                                                                    ax.set_ylim([lower-1, upper+1])
                                                                    Add a footnote
                                                                    # after the fig.tight_layout(pad=1)
                                                                    fig.text(0.99, 0.01, 'Footnote',
A line and bar on the same chart                                        ha='right', va='bottom',
In matplotlib, bar charts visualise categorical or discrete             fontsize='x-small',
data. Line charts visualise continuous data. This makes                 fontstyle='italic', color='#999999')
it hard to get bars and lines on the same chart. Typically
combined charts either have too many labels, and/or the
lines and bars are misaligned or missing. You need to                   Working with missing and non-finite data
trick matplotlib a bit … pandas makes this tricking easier
# start with fake percentage growth data                            Working with missing data
s = pd.Series(np.random.normal(                                     Pandas uses the not-a-number construct (np.nan and
      1.02, 0.015, 40))                                             float('nan')) to indicate missing data. The Python None
s = s.cumprod()                                                     can arise in data as well. It is also treated as missing
dfg = (pd.concat([s / s.shift(1),
                                                                    data; as is the pandas not-a-time construct
      s / s.shift(4)], axis=1) * 100) - 100
                                                                    (pandas.NaT).
dfg.columns = ['Quarter', 'Annual']
dfg.index = pd.period_range('2010-Q1',
      periods=len(dfg), freq='Q')                                   Missing data in a Series
                                                                    s = Series( [8,None,float('nan'),np.nan])
# reindex with integers from 0; keep old                                          #[8,       NaN, NaN, NaN]
old = dfg.index                                                     s.isnull() #[False, True, True, True]
dfg.index = range(len(dfg))                                         s.notnull()#[True, False, False, False]
                                                                    s.fillna(0)#[8,          0,   0,   0]
# plot the line from pandas
ax = dfg['Annual'].plot(color='blue',                               Missing data in a DataFrame
    label='Year/Year Growth')                                       df = df.dropna() # drop all rows with NaN
                                                                    df = df.dropna(axis=1) # same for cols
# plot the bars from pandas                                         df=df.dropna(how='all') #drop all NaN row
dfg['Quarter'].plot.bar(ax=ax,                                      df=df.dropna(thresh=2) # drop 2+ NaN in r
    label='Q/Q Growth', width=0.8)                                  # only drop row if NaN in a specified col
                                                                    df = df.dropna(df['col'].notnull())
# relabel the x-axis more appropriately
ticks = dfg.index[((dfg.index+0)%4)==0]                             Recoding missing data
labs = pd.Series(old[ticks]).astype(str)
                                                                    df.fillna(0, inplace=True) # np.nan à 0
ax.set_xticks(ticks)
                                                                    s = df['col'].fillna(0)    # np.nan à 0
ax.set_xticklabels(labs.str.replace('Q',
                                                                    df = df.replace(r'\s+', np.nan,
    '\nQ'), rotation=0)
                                                                          regex=True) # white space à np.nan
# fix the range of the x-axis … skip 1st
ax.set_xlim([0.5,len(dfg)-0.5])                                     Non-finite numbers
                                                                    With floating point numbers, pandas provides for
# add the legend                                                    positive and negative infinity.
l=ax.legend(loc='best',fontsize='small')                            s = Series([float('inf'), float('-inf'),
                                                                             np.inf, -np.inf])
# finish off and plot in the usual manner                           Pandas treats integer comparisons with plus or minus
ax.set_title('Fake Growth Data')                                    infinity as expected.
ax.set_xlabel('Quarter')
ax.set_ylabel('Per cent')                                           Testing for finite numbers
                                                                    (using the data from the previous example)
fig = ax.figure
                                                                    b = np.isfinite(s)
fig.set_size_inches(8, 3)
fig.tight_layout(pad=1)
fig.savefig('filename.png', dpi=125)
plt.close(fig)
Version 15 January 2017 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
                                                                                                                               10
          Working with Categorical Data                                              Working with strings
Renaming categories
s = Series(list('abc'), dtype='category')
s.cat.categories = [1, 2, 3] # in place
s = s.cat.rename_categories([4,5,6])
# using a comprehension ...
s.cat.categories = ['Group ' + str(i)
     for i in s.cat.categories]
Trap: categories must be uniquely named
Removing categories
s = s.cat.remove_categories([4])
s.cat.remove_unused_categories() #inplace
Version 15 January 2017 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
                                                                                                                           11
                    Basic Statistics
Summary statistics
s = df['col1'].describe()
df1 = df.describe()
Value counts
s = df['col1'].value_counts()
Histogram binning
count, bins = np.histogram(df['col1'])
count, bins = np.histogram(df['col'],
                  bins=5)
count, bins = np.histogram(df['col1'],
                  bins=[-3,-2,-1,0,1,2,3,4])
Regression
import statsmodels.formula.api as sm
result = sm.ols(formula="col1 ~ col2 +
           col3", data=df).fit()
print (result.params)
print (result.summary())
Cautionary note
Version 15 January 2017 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
                                                                                                                    12