Pandas plots
Pandas also provides us options to visualize the data. Here are some of the examples:
Syntax:
    df.plot(X, y, marker, kind)
    X = value on X axis
    y = value on y axis
    marker = shape in case of specific plots like a scatter plot
    kind = type of plot
A scatter plot to visualize the trend of acceleration in different years.
   1. df.plot(x = 'model_year', y = 'acceleration', marker = 'o',
      kind = 'scatter');
   2.
A bar plot to visualize mean acceleration in different years.
   1. df.groupby('model_year').mean()[['acceleration']].plot(kind
      = 'bar');
   2.
A histogram to visualize the frequency distribution of cylinders
   1. df['cylinders'].plot(kind = 'hist')
   2.
A scatter plot to visualize the relationship between weight and mpg.
   1. df.plot(x = 'weight', y = 'mpg', kind = 'scatter')
   2.
A bar plot to visualize the sorted mean values of acceleration with respect to number of
cylinders.
   1. df.groupby('cylinders').mean().sort_values('acceleration')
      [['acceleration']].plot(kind = 'bar')
   2.
Pandas Exercise
Problem Statement:
to find the rainfall dataset that is to be considered for the exercise.. This data contains
region(district) wise rainfall across India.
   1.   Import the data into Python environment as a Pandas DataFrame.
   2.   Check for missing values, if any and drop the corresponding rows.
   3.   Find the district that gets the highest annual rainfall.
   4.   Display the top 5 states that get the highest annual rainfall.
   5.   Drop the columns 'Jan-Feb', 'Mar-May', 'Jun-Sep', 'Oct-Dec'.
   6.   Display the state-wise mean rainfall for all the months using a pivot table.
   7.   Display the count of districts in each state.
   8.   For each state, display the district that gets the highest rainfall in May. Also display the
        recorded rainfall.
Exercise
Problem Statement:
Given below is a DataFrame df:
    1. import pandas as pd
    2.
    3. df = pd.DataFrame([[0.23,'f1'],[5.36,'f2']],
    4.                            index = list('pq'),
    5.                               columns = list('ab'))
    6.
Do the following:
    1. Change the column name from ' a' to ' A'.
    2. Add a new column ' c' filled with random values.
    3. Change the datatype of column ' A' values to complex.
    4. Display rows whose any of the element matches with any element of the given list:
    1. lst = ['f30','f50','f2','f0']
Merging
There are few more commonly used methods involved in Pandas as listed:
        Merging
        Reshaping
        Pivot Tables
        Grouping
Let us understand their importance by some illustrations starting with merging of datasets.
Given two datasets from a conducted experiment with each dataset having its own feature. Our
task is to form a single dataset combining all the features particular to each observation. To do so
we can take help of concat() function.
   1. import pandas as pd
   2.
   3. data1 = pd.DataFrame([[15, 12, -3],
   4.                                  [33, 54, 21],
   5.                                  [10, 32, 22]],
   6.                                  columns = list('ABC'))
   7. data2 = pd.DataFrame([[10, 1, 3],
   8.                                  [33, -54, 2],
   9.                                  [10, 0.32, 2]],
   10.                                  columns = list('DEF'))
   11. print(data1)
   12. #           A     B     C
   13. # 0       15    12    -3
   14. # 1       33    54    21
   15. # 2       10    32    22
   16.
   17. print(data2)
   18. #           D         E     F
   19. # 0       10      1.00      3
   20. # 1       33 -54.00         2
   21. # 2       10      0.32      2
     22.
     23. print(pd.concat( [data1,data2], axis = 1 ))
     24. #        A     B     C      D        E       F
     25. # 0     15   12     -3    10     1.00        3
     26. # 1     33   54     21    33 -54.00          2
     27. # 2     10   32     22    10     0.32        2
     28.
     29.
Suppose an individual took observation of 3 different features in two instances. Next, she wants
to combine all these samples to form a single dataset. We can achieve this task using the
same concat() function, but this time with the change of axis.
1.     import numpy as np
2.     import pandas as pd
3.
4.     data1 = pd.DataFrame(np.random.randn(9).reshape(3,3),
5.                                columns = list('ABC'))
6.     data2 = pd.DataFrame(np.arange(9).reshape(3,3),
7.                                columns = list('ABC'))
8.     print(data1)          # Random values
9.     #              A              B            C
10.    # 0     1.957218     0.433266     1.214950
11.    # 1 -0.143500 -0.092030 -0.823898
12.    # 2     0.481486 -0.024111 -0.769195
13.
14.      print(data2)
15.      #      A   B   C
16.      # 0    0   1   2
17.      # 1    3   4   5
18.      # 2    6   7   8
19.
20.      print(pd.concat( [data1,data2], axis = 0 ))
21.      #                A              B              C
22.      # 0    1.957218      0.433266       1.214950
23.      # 1 -0.143500 -0.092030 -0.823898
24.      # 2    0.481486 -0.024111 -0.769195
25.      # 0    0.000000      1.000000       2.000000
26.      # 1    3.000000      4.000000       5.000000
27.      # 2    6.000000      7.000000       8.000000
Pivot Tables
Reshaping a dataset is a perfect seed for quick text visualization. However, similar to reshaping we have
another term named as pivot tables which are more efficient in delivering a better visualization.
To understand pivot tables we take the same last dataframe except adding a new feature 'score'.
    1. import pandas as pd
    2.
    3. df = pd.DataFrame([
    4.                    ['IND', 'Gold', 'Game1', '9.9'],
    5.                    ['IND', 'Bronze', 'Game2', '8'],
    6.                    ['USA', 'Silver', 'Game1', '9.5'],
    7.                    ['USA', 'Gold', 'Game2', '8.6'],
    8.                    ], columns = ['Country', 'Medal',
    9.                         'Game', 'Score'],
    10.                           index = ['Year1', 'Year2','Year1', 'Year2'])
    11.
    12. print(df)
    13. #             Country        Medal        Game Score
    14. # Year1             IND       Gold       Game1     9.9
    15. # Year2             IND    Bronze        Game2        8
    16. # Year1             USA    Silver        Game1     9.5
    17. # Year2             USA       Gold       Game2     8.6
    18.
Pivot tables come handy when we have to break down a large dataset (in terms of features) into fewer features
for quick visualization. For example, finding which medal is common to both IND and USA, listing game(s) in
which India won Silver, etc.
We implement pivot table on the given dataset:
    1. # Listing all the features
    2. print(df.pivot(index = 'Country', columns = 'Medal'))
    3. #                  Game                           Score
    4. # Medal         Bronze       Gold Silver Bronze Gold Silver
    5. # Country
    6. # IND            Game2     Game1          None        8    9.9     None
    7. # USA              None       Game2   Game1       None     8.6        9.5
    8.
    9. # Listing only Score feature
    10. print(df.pivot(index = 'Country', columns = 'Medal',
    11.                          values = 'Score'))
    12. # Medal          Bronze Gold Silver
    13. # Country
    14. # IND                    8    9.9     None
    15. # USA               None      8.6       9.5
    16.
Pivot table limitations
The previous dataset doesn't include any redundant sample, for instance, let's say 2 people win gold as well as
silver medals in Year1 of Game1 from India. In that case, the above function named pivot() breaks due to
ambiguity in Game column. Therefore, to overcome this limitation, we have another function named
as pivot_table() which requires an argument named as aggfun through which we can clear the ambiguity. Let's
understand it through an example.
    1. import numpy as np
    2. import pandas as pd
    3.
    4. df = pd.DataFrame([
    5.                    ['IND', 'Gold', 'Game1', '9.9'],
    6.                    ['IND', 'Silver', 'Game1', '9.5'],
    7.                    ['IND', 'Bronze', 'Game2', '8'],
    8.                    ['USA', 'Bronze', 'Game1', '9.0'],
    9.                    ['USA', 'Silver', 'Game2', '8.6'],
    10.                      ], columns = ['Country', 'Medal',
    11.                            'Game', 'Score'],
    12.                            index = ['Year1', 'Year1', 'Year2','Year1',
       'Year2'])
    13.
    14. print(df)
    15. #              Country        Medal       Game Score
    16. # Year1              IND        Gold     Game1       9.9
    17. # Year1              IND    Silver       Game1       9.5
    18. # Year2              IND    Bronze       Game2          8
    19. # Year1              USA    Bronze       Game1       9.0
    20. # Year2              USA    Silver       Game2       8.6
    21.
    22. print(df.pivot_table(index = 'Country',
    23.                                    columns = 'Game',
    24.                                    values = 'Score',
    25.                                    aggfunc = np.max))
    26. # Game            Game1 Game2
    27. # Country
    28. # IND                9.9         8
    29. # USA                9.0      8.6
Grouping
To understand grouping, a concept similar to databases, let us consider that we have been given a dataset for
the sales of laptop and desktop systems. The observations for a particular type can be repeated for the different
sales price. In this case, if we need to calculate the total sales of each category then we can group similar data
and apply a certain function.
To do so we create a sample dummy dataset and sum up the total sales particular to each category.
    1. import pandas as pd
    2.
    3. df = pd.DataFrame([["Laptop", 1000],
    4.                                ["Laptop", 2520],
    5.                                ["Desktop", 3000],
    6.                      ["Desktop", 400]], columns =
         ['Category','Sales'])
    7.
    8. print(df)
    9. #      Category        Sales
    10. # 0         Laptop        1000
    11. # 1         Laptop        2520
    12. # 2       Desktop         3000
    13. # 3       Desktop          400
    14.
    15. print(df.groupby(['Category'], sort = False).sum())
    16. #                    Sales
    17. # Category
    18. # Laptop               3520
    19. # Desktop              3400
Exercise
Problem Statement:
Given a dataframe df which has three attributes defining: set_name: system names, spd_per_day:
Speed per day, speed: Network speed in MBps
   1. sys = ['s1','s1','s1','s1',
   2.               's2','s2','s2','s2']
   3. net_day = ['d1','d1','d2','d2',
   4.               'd1','d1','d2','d2']
   5. spd = [1.3, 11.4, 5.6, 12.3,
   6.               6.2, 1.1, 20.0, 8.8]
   7. df = pd.DataFrame({'set_name':sys,
   8.                               'spd_per_day':net_day,
   9.                               'speed':spd})
   10.
Do the following:
   1. Construct a dataframe new_df where the given dataset is grouped based on each system
      (s1 and s2) and speed per day (d1 and d2) with the median speed each day per system.
      Also, provide a secondary name ' Median' for the speed attribute.
   2. Sort the dataframe new_df in the ascending order of the median speed.
Hints/Answers:
1. Answer:
   1. set_name spd_per_day             speed
   2.                                   Median
     3. 0          s1                 d1    6.35
     4. 1          s1                 d2    8.95
     5. 2          s2                 d1    3.65
     6. 3          s2                 d2   14.40
     7.
2. Answer:
1.        set_name spd_per_day    speed
2.                                    Median
3.        2      s2              d1     3.65
4.        0      s1              d1     6.35
5.        1      s1              d2     8.95
6.        3      s2              d2    14.40