Introducing
DataFrames
                              D ATA M A N I P U L AT I O N W I T H PA N D A S
Richie Cotton
Data Evangelist at DataCamp
What's the point of pandas?
 Data Manipulation skill track
 Data Visualization skill track
                                  DATA MANIPULATION WITH PANDAS
Course outline
 Chapter 1: DataFrames          Chapter 3: Slicing and Indexing Data
   Sorting and subsetting         Subsetting using slicing
   Creating new columns           Indexes and subsetting using indexes
 Chapter 2: Aggregating Data    Chapter 4: Creating and Visualizing Data
   Summary statistics             Plotting
   Counting                       Handling missing data
   Grouped summary statistics     Reading data into a DataFrame
                                            DATA MANIPULATION WITH PANDAS
pandas is built on NumPy and Matplotlib
                                   DATA MANIPULATION WITH PANDAS
     pandas is popular
1   https://pypistats.org/packages/pandas
                                            DATA MANIPULATION WITH PANDAS
Rectangular data
Name     Breed         Color   Height (cm) Weight (kg) Date of Birth
Bella    Labrador      Brown 56            25          2013-07-01
Charlie Poodle         Black   43          23          2016-09-16
Lucy     Chow Chow Brown 46                22          2014-08-25
Cooper Schnauzer       Gray    49          17          2011-12-11
Max      Labrador      Black   59          29          2017-01-20
Stella   Chihuahua     Tan     18          2           2015-04-20
Bernie   St. Bernard   White   77          74          2018-02-27
                                                           DATA MANIPULATION WITH PANDAS
pandas DataFrames
print(dogs)
      name         breed    color   height_cm   weight_kg date_of_birth
0    Bella      Labrador    Brown         56          24    2013-07-01
1   Charlie       Poodle    Black         43          24    2016-09-16
2     Lucy     Chow Chow    Brown         46          24    2014-08-25
3   Cooper     Schnauzer    Gray          49          17    2011-12-11
4      Max      Labrador    Black         59          29    2017-01-20
5   Stella     Chihuahua     Tan          18           2    2015-04-20
6   Bernie    St. Bernard   White         77          74    2018-02-27
                                                            DATA MANIPULATION WITH PANDAS
Exploring a DataFrame: .head()
dogs.head()
      name       breed    color   height_cm   weight_kg date_of_birth
0    Bella    Labrador    Brown         56          24    2013-07-01
1   Charlie     Poodle    Black         43          24    2016-09-16
2     Lucy    Chow Chow   Brown         46          24    2014-08-25
3   Cooper    Schnauzer   Gray          49          17    2011-12-11
4      Max    Labrador    Black         59          29    2017-01-20
                                                          DATA MANIPULATION WITH PANDAS
Exploring a DataFrame: .info()
dogs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
 #    Column          Non-Null Count   Dtype
 --   ------          --------------   -----
 0    name            7 non-null       object
 1    breed           7 non-null       object
 2    color           7 non-null       object
 3    height_cm       7 non-null       int64
 4    weight_kg       7 non-null       int64
 5    date_of_birth   7 non-null       object
dtypes: int64(2), object(4)
memory usage: 464.0+ bytes
                                                DATA MANIPULATION WITH PANDAS
Exploring a DataFrame: .shape
dogs.shape
(7, 6)
                                DATA MANIPULATION WITH PANDAS
Exploring a DataFrame: .describe()
dogs.describe()
        height_cm   weight_kg
count   7.000000    7.000000
mean    49.714286   27.428571
std     17.960274   22.292429
min     18.000000   2.000000
25%     44.500000   19.500000
50%     49.000000   23.000000
75%     57.500000   27.000000
max     77.000000   74.000000
                                     DATA MANIPULATION WITH PANDAS
Components of a DataFrame: .values
dogs.values
array([['Bella', 'Labrador', 'Brown', 56, 24, '2013-07-01'],
      ['Charlie', 'Poodle', 'Black', 43, 24, '2016-09-16'],
      ['Lucy', 'Chow Chow', 'Brown', 46, 24, '2014-08-25'],
      ['Cooper', 'Schnauzer', 'Gray', 49, 17, '2011-12-11'],
      ['Max', 'Labrador', 'Black', 59, 29, '2017-01-20'],
      ['Stella', 'Chihuahua', 'Tan', 18, 2, '2015-04-20'],
      ['Bernie', 'St. Bernard', 'White', 77, 74, '2018-02-27']],
     dtype=object)
                                                       DATA MANIPULATION WITH PANDAS
Components of a DataFrame: .columns and .index
dogs.columns
Index(['name', 'breed', 'color', 'height_cm', 'weight_kg', 'date_of_birth'],
dtype='object')
dogs.index
RangeIndex(start=0, stop=7, step=1)
                                                       DATA MANIPULATION WITH PANDAS
     pandas Philosophy
       There should be one -- and preferably only one -- obvious way to do it.
        - The Zen of Python by Tim Peters, Item 13
1   https://www.python.org/dev/peps/pep-0020/
                                                                    DATA MANIPULATION WITH PANDAS
    Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S
                                     Sorting and
                                      subsetting
                              D ATA M A N I P U L AT I O N W I T H PA N D A S
Richie Cotton
Data Evangelist at DataCamp
Sorting
dogs.sort_values("weight_kg")
      name         breed    color   height_cm   weight_kg date_of_birth
5   Stella     Chihuahua     Tan          18           2    2015-04-20
3   Cooper     Schnauzer    Gray          49          17    2011-12-11
0    Bella      Labrador    Brown         56          24    2013-07-01
1   Charlie       Poodle    Black         43          24    2016-09-16
2     Lucy     Chow Chow    Brown         46          24    2014-08-25
4      Max      Labrador    Black         59          29    2017-01-20
6   Bernie    St. Bernard   White         77          74    2018-02-27
                                                            DATA MANIPULATION WITH PANDAS
Sorting in descending order
dogs.sort_values("weight_kg", ascending=False)
      name         breed    color   height_cm   weight_kg date_of_birth
6   Bernie    St. Bernard   White         77          74    2018-02-27
4      Max      Labrador    Black         59          29    2017-01-20
0    Bella      Labrador    Brown         56          24    2013-07-01
1   Charlie       Poodle    Black         43          24    2016-09-16
2     Lucy     Chow Chow    Brown         46          24    2014-08-25
3   Cooper     Schnauzer    Gray          49          17    2011-12-11
5   Stella     Chihuahua     Tan          18           2    2015-04-20
                                                            DATA MANIPULATION WITH PANDAS
Sorting by multiple variables
dogs.sort_values(["weight_kg", "height_cm"])
      name         breed    color   height_cm   weight_kg date_of_birth
5   Stella     Chihuahua     Tan          18           2    2015-04-20
3   Cooper     Schnauzer    Gray          49          17    2011-12-11
1   Charlie       Poodle    Black         43          24    2016-09-16
2     Lucy     Chow Chow    Brown         46          24    2014-08-25
0    Bella      Labrador    Brown         56          24    2013-07-01
4      Max      Labrador    Black         59          29    2017-01-20
6   Bernie    St. Bernard   White         77          74    2018-02-27
                                                            DATA MANIPULATION WITH PANDAS
Sorting by multiple variables
dogs.sort_values(["weight_kg", "height_cm"], ascending=[True, False])
      name         breed    color   height_cm   weight_kg date_of_birth
5   Stella     Chihuahua     Tan          18           2    2015-04-20
3   Cooper     Schnauzer    Gray          49          17    2011-12-11
0    Bella      Labrador    Brown         56          24    2013-07-01
2     Lucy     Chow Chow    Brown         46          24    2014-08-25
1   Charlie       Poodle    Black         43          24    2016-09-16
4      Max      Labrador    Black         59          29    2017-01-20
6   Bernie    St. Bernard   White         77          74    2018-02-27
                                                            DATA MANIPULATION WITH PANDAS
Subsetting columns
dogs["name"]
0      Bella
1    Charlie
2       Lucy
3     Cooper
4        Max
5     Stella
6     Bernie
Name: name, dtype: object
                            DATA MANIPULATION WITH PANDAS
Subsetting multiple columns
dogs[["breed", "height_cm"]]   cols_to_subset = ["breed", "height_cm"]
                               dogs[cols_to_subset]
         breed    height_cm
0     Labrador          56              breed    height_cm
1       Poodle          43     0     Labrador          56
2    Chow Chow          46     1       Poodle          43
3    Schnauzer          49     2    Chow Chow          46
4     Labrador          59     3    Schnauzer          49
5    Chihuahua          18     4     Labrador          59
6   St. Bernard         77     5    Chihuahua          18
                               6   St. Bernard         77
                                            DATA MANIPULATION WITH PANDAS
Subsetting rows
dogs["height_cm"] > 50
0     True
1    False
2    False
3    False
4     True
5    False
6     True
Name: height_cm, dtype: bool
                               DATA MANIPULATION WITH PANDAS
Subsetting rows
dogs[dogs["height_cm"] > 50]
     name         breed    color   height_cm   weight_kg date_of_birth
0   Bella      Labrador    Brown         56          24    2013-07-01
4     Max      Labrador    Black         59          29    2017-01-20
6   Bernie   St. Bernard   White         77          74    2018-02-27
                                                            DATA MANIPULATION WITH PANDAS
Subsetting based on text data
dogs[dogs["breed"] == "Labrador"]
     name        breed   color   height_cm   weight_kg date_of_birth
0   Bella     Labrador   Brown         56          24    2013-07-01
4     Max     Labrador   Black         59          29    2017-01-20
                                                          DATA MANIPULATION WITH PANDAS
Subsetting based on dates
dogs[dogs["date_of_birth"] < "2015-01-01"]
     name       breed    color   height_cm   weight_kg date_of_birth
0   Bella    Labrador    Brown         56          24    2013-07-01
2    Lucy    Chow Chow   Brown         46          24    2014-08-25
3   Cooper   Schnauzer   Gray          49          17    2011-12-11
                                                            DATA MANIPULATION WITH PANDAS
Subsetting based on multiple conditions
is_lab = dogs["breed"] == "Labrador"
is_brown = dogs["color"] == "Brown"
dogs[is_lab & is_brown]
     name        breed    color   height_cm   weight_kg date_of_birth
0   Bella     Labrador    Brown         56          24    2013-07-01
dogs[ (dogs["breed"] == "Labrador") & (dogs["color"] == "Brown") ]
                                                           DATA MANIPULATION WITH PANDAS
Subsetting using .isin()
is_black_or_brown = dogs["color"].isin(["Black", "Brown"])
dogs[is_black_or_brown]
      name       breed    color   height_cm   weight_kg date_of_birth
0    Bella    Labrador    Brown         56          24    2013-07-01
1   Charlie     Poodle    Black         43          24    2016-09-16
2     Lucy    Chow Chow   Brown         46          24    2014-08-25
4      Max    Labrador    Black         59          29    2017-01-20
                                                            DATA MANIPULATION WITH PANDAS
    Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S
                                  New columns
                              D ATA M A N I P U L AT I O N W I T H PA N D A S
Richie Cotton
Data Evangelist at DataCamp
Adding a new column
dogs["height_m"] = dogs["height_cm"] / 100
print(dogs)
      name         breed    color   height_cm   weight_kg date_of_birth   height_m
0    Bella      Labrador    Brown         56          24    2013-07-01       0.56
1   Charlie       Poodle    Black         43          24    2016-09-16       0.43
2     Lucy     Chow Chow    Brown         46          24    2014-08-25       0.46
3   Cooper     Schnauzer    Gray          49          17    2011-12-11       0.49
4      Max      Labrador    Black         59          29    2017-01-20       0.59
5   Stella     Chihuahua     Tan          18           2    2015-04-20       0.18
6   Bernie    St. Bernard   White         77          74    2018-02-27       0.77
                                                            DATA MANIPULATION WITH PANDAS
Doggy mass index
                                  BMI = weight in kg/(height in m)2
dogs["bmi"] = dogs["weight_kg"] / dogs["height_m"] ** 2
print(dogs.head())
      name       breed    color    height_cm   weight_kg date_of_birth   height_m         bmi
0    Bella    Labrador    Brown          56          24    2013-07-01       0.56    76.530612
1   Charlie     Poodle    Black          43          24    2016-09-16       0.43    129.799892
2     Lucy    Chow Chow   Brown          46          24    2014-08-25       0.46    113.421550
3   Cooper    Schnauzer   Gray           49          17    2011-12-11       0.49    70.803832
4      Max    Labrador    Black          59          29    2017-01-20       0.59    83.309394
                                                                   DATA MANIPULATION WITH PANDAS
Multiple manipulations
bmi_lt_100 = dogs[dogs["bmi"] < 100]
bmi_lt_100_height   = bmi_lt_100.sort_values("height_cm", ascending=False)
bmi_lt_100_height[["name", "height_cm", "bmi"]]
     name    height_cm        bmi
4     Max           59   83.309394
0   Bella           56   76.530612
3   Cooper          49   70.803832
5   Stella          18   61.728395
                                                        DATA MANIPULATION WITH PANDAS
    Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S