Pandas Worksheets ALL
Pandas Worksheets ALL
Usage Patterns
Use pd.read_clipboard() for one-off data extractions.
Use the other pd.read_* methods in scripts for repeatable analyses.
Writing Data Structures to Disk Writing Data Structures from and to a Database
Write data structures to disk: Read, using SQLAlchemy. Supports multiple databases:
> s_df.to_csv(filename) > from sqlalchemy import create_engine
> s_df.to_excel(filename) > engine = create_engine(database_url)
Write multiple DataFrames to single Excel file: > conn = engine.connect()
> writer = pd.ExcelWriter(filename) > df = pd.read_sql(query_str_or_table_name, conn)
> df1.to_excel(writer, sheet_name='First') Write:
> df2.to_excel(writer, sheet_name='Second') > df.to_sql(table_name, conn)
> writer.save()
name.
df A DataFrame, which maps index and column labels to values. It Indexing and Slicing
is like a dictionary of Series (columns) sharing the same index, Use these attributes on Series and DataFrames for indexing, slicing,
or like a 2D Numpy array with row and column labels. and assignments:
s_df Applies to both Series and DataFrames. s_df.loc[ ] Refers only to the index labels
Manipulations of Pandas objects usually return copies. s_df.iloc[ ] Refers only to the integer location,
similar to lists or Numpy arrays
Creating Series and DataFrames s_df.xs(key, level=L) Select rows with label key in level L
of an object with MultiIndex.
Values
Series
n1 'Cary' 0 Masking and Boolean Indexing
> pd.Series(values, index=index, name=name)
n2 'Lynn' 1 Create masks with comparisons:
> pd.Series({'idx1' : val1,'idx2' : val2}
n3 'Sam 2 mask = df['X'] < 0
Where values, index, and name are sequences or arrays.
Index
Or isin, for membership mask:
DataFrame mask = df['X'].isin(list_of_valid_values)
Age Gender Coulmns
> pd.DataFrame(values, Use masks for indexing:
'Cary' 32 M
index=index, columns=col_names) df.loc[mask] = 0
'Lynn' 18 F
> pd.DataFrame({'col1' :
series1_or _seq,
'Sam 26 M Combine multiple masks with bitwise operators — and (&), or (|), or (^),
Where values is a sequence of sequences or a 2D array. mask = (df['X'] < 0) & (df['Y'] == 0)
b 2 + b 4 = b 4 b 2 + b 4 = b 6
To apply the function to every row (across columns), set axis=1:
NaN c 5 c NaN 0 c 5 c 5
df.apply(series_to_series, axis=1)
Use add, sub, mul, and div, to set fill value.
Apply a Function to a DataFrame
Rule 2: Mathematical operators (+ - * / exp, log, ...) Apply a function that receives a DataFrame and returns a Series,
apply element by element on the values.
a DataFrame, or a single value:
df + 1 df.abs() np.log(df)
df.pipe(df_to_series) → Series
df.pipe(df_to_df) →
DataFrame
X Y X Y X Y X Y
df.pipe(df_to_value) → Value
a -2 -2 a -1 -1 a 1 1 a 0 0
b -2 -2 b -1 -1 b 1 1 b 0 0 What Happens with Missing Values?
c -2 -2 c -1 -1 c 1 1 c 0 0 Missing values are represented by NaN (not a number) or NaT
(not a time).
• They propagate in operations across Pandas objects
Rule 3: Reduction operations (mean, std, skew, kurt,
sum, prod, ...) are applied column by column by default. (1 + NaN → NaN).
> df.sum() Series • They are ignored in a "sensible" way in computations;
They equal 0 in sum, they're ignored in mean, etc.
X Y > df.sum() X
a → Y
• They stay NaN with mathematical operations such
as np.log(NaN) → NaN.
b
c
count: Number of non-null observations
Operates across rows by default (axis=0, or axis='rows').
sum: Sum of values
Operate across columns with axis=1 or axis='columns'.
mean: Mean of values
mad: Mean absolute deviation
Differences Between
median: Arithmetic median of values
Pandas Objects and Numpy Arrays
min: Minimum
When it comes to Pandas objects and Numpy arrays, aligning objects
max: Maximum
on the index (or columns) before calculations might be the most
mode: Mode
important difference. There are built-in methods for most common
prod: Product of values
statistical operations, such as mean or sum, and they apply across
std: Bessel-corrected sample standard deviation
one-dimension at a time. To apply custom functions, use one of
var: Unbiased variance
three methods to do tablewise (pipe), row or column-wise (apply),
sem: Standard error of the mean
or elementwise (applymap) operations.
skew: Sample skewness (3rd moment)
Apply a Function to Each Value kurt: Sample kurtosis (4th moment)
Apply a function to each value in a Series or DataFrame: quartile: Sample quantile (Value at %)
s.apply(value_to_value) → Series value_counts: Count of unique values
df.applymap(value_to_value) → DataFrame
With a Series, Pandas plots values against With a DataFrame, Pandas creates one line Use Matplotlib to override or add annotations:
the index: per column: > ax.set_xlabel('Time')
> ax = s.plot() > ax = df.plot() > ax.set_ylabel('Value')
> ax.set_title('Experiment A')
Note: When plotting the results of complex manipulations with groupby, it's often useful to Pass labels if you want to override the column
stack/unstack the resulting DataFrame to fit the one-line-per-column assumption. names and set the legend location:
> ax.legend(labels, loc='best')
Kinds of Plots
+
df.plot.scatter(x, y) df.plot.bar() df.plot.hist() df.plot.box()
Frequency Offsets Save Yourself Some Pain: Use ISO 8601 Format
Used by date_range, period_range and resample: To be consistent and minimize the risk of error or confusion, use ISO
• BM: Business month end • U, us: Microseconds YES > pd.to_datetime('2000-01-13') # 13th January
Use on if merging on same column in both DataFrames, otherwise use “Stacks” DataFrames on top of each other.
Z
1 b a
Split/Apply/Combine 3 b b
c
Z
4 ca
MultiIndex: A Multi-Level Hierarchical Index Long to Wide Format and Back with stack()
and unstack()
Often created as a result of:
> df.groupby(list_of_columns) Pivot column level to index, i.e. “stacking the columns” (wide to long):
> df.set_index(list_of_columns) > df.stack()
Contiguous labels are displayed together but apply to each row. The Pivot index level to columns, “unstack the columns” (long to wide):
concept is similar to multi-level columns. > df.unstack()
A MultiIndex allows indexing and slicing one or multiple levels at once. If there are multiple indices or column levels, use level number or
Using the Long example from the right: name to stack/unstack:
long.loc[1900] All 1900 rows > df.unstack(1) or > df.unstack('Month')
long.loc[(1900, 'March')] Value 2 A common use case for unstacking, plotting group data vs index
long.xs('March', level='Month') All March rows after groupby:
> (df.groupby(['A', 'B])['relevant'].mean()
Simpler than using boolean indexing, for example: .unstack().plot())
> long[long.Month == 'March']
Pivot Tables
> pd.pivot_table(df,
. index=cols, keys to group by for index
. columns=cols2, keys to group by for columns
. values=cols3, columns to aggregate
. aggfunc='mean') what to do with repeated values From Wide to Long with melt
Specify which columns are identifiers (id_vars, values will be repeated
Omitting index, columns, or values will use all remaining
for each row) and which are “measured variables” (value_vars, will
columns of df. You can “pivot” a table manually using groupby, stack,
become values in variable column. All remaining columns by default).
and unstack.
> pd.melt(df, id_vars=id_cols, value_vars=value_columns)
> pd.melt(team, id_vars=['Color'],
. value_vars=['A', 'B', 'C'],
. var_name='Team',
. value_name='Score')
df.pivot() vs pd.pivot_table
df.pivot() Does not deal with repeated values in index.
It's a declarative form of stack and unstack.
pd.pivot_table() Use if you have repeated values in index
(specify aggfunc argument).