Pandas - Jupyter Notebook
Pandas - Jupyter Notebook
Pandas Series
Pandas Series is a one-dimensional labeled array/list capable of holding data of any type
(integer, string, float, python objects, etc.).
The labels are collectively called index.
Pandas Series can be thought as a single column of an excel spreadsheet and each entry
in a series corresponds to an individual row in the spreadsheet.
0 55
1 25
2 75
3 40
4 90
dtype: int64
0 55
1 25
2 75
3 40
4 90
dtype: int64
We can see that the list and array have been converted to a Pandas Series object.
We also see that the series has automatically got index labels. Let's see how these can be
modified.
localhost:8888/notebooks/Desktop/Pandas.ipynb 1/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
Omeprazole 55
Azithromycin 25
Metformin 75
Ibuprofen 40
Cetirizine 90
dtype: int64
The price of each medicine was increased by $2.5. Let's add this to the existing price.
A new price list was released by vendors for each medicine. Let's find the difference
between new price and the old price
Omeprazole 77.0
Azithromycin 45.5
Metformin 100.0
Ibuprofen 50.0
Cetirizine 80.0
dtype: float64
localhost:8888/notebooks/Desktop/Pandas.ipynb 2/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
Pandas DataFrame
Pandas DataFrame is a two-dimensional tabular data structure with labeled axes (rows and
columns).
Out[66]: Student
0 Mary
1 Peter
2 Susan
3 Toby
4 Vishal
0 Mary B-
1 Peter A+
2 Susan A-
3 Toby B+
4 Vishal C
The data for total energy consumption for the U.S. was collected from 2012 - 2018. Let's see
how this data can be presented in form of data frame.
localhost:8888/notebooks/Desktop/Pandas.ipynb 3/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
0 2012 2152
1 2013 2196
2 2014 2217
3 2015 2194
4 2016 2172
5 2017 2180
6 2018 2258
For encryption purposes a web browser company wants to generate random values which have
mean equal to 0 and variance equal to 1. They want 5 randomly generated numbers in 2
different trials.
0 0.956411 0.292236
1 -0.292173 0.730664
2 0.107673 1.493363
3 -1.156195 0.269528
4 0.091713 0.153680
Accessing Series
The revenue (in billion dollars) of different telecommunication operators in U.S. was collected for
the year of 2020. The following lists consist of the names of the telecommunication operators
and their respective revenue (in billion dollars).
localhost:8888/notebooks/Desktop/Pandas.ipynb 4/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
Out[71]: 171.76
Out[75]: 171.76
localhost:8888/notebooks/Desktop/Pandas.ipynb 5/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
Accessing DataFrames
The data of the customers visiting 24/7 Stores from different locations was collected. The data
includes Customer ID, location of store, gender of the customer, type of product purchased,
quantity of products purchased, total bill amount. Let's create the dataset and see how to
access different entries of it.
localhost:8888/notebooks/Desktop/Pandas.ipynb 6/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
Out[80]: 0 Chicago
1 Boston
2 Seattle
3 San Francisco
4 Austin
Name: location, dtype: object
loc method
loc is a method to access rows and columns on pandas objects. When using the loc
method on a dataframe, we specify which rows and which columns we want by using the
following format:
dataframe.loc[row selection, column selection]
DataFrame.loc[] method is a method that takes only index labels and returns row or
dataframe if the index label exists in the data frame.
In [83]: # accessing first index value using loc method (indexing starts from 0 in p
store_data.loc[1]
localhost:8888/notebooks/Desktop/Pandas.ipynb 7/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
In [84]: # accessing 1st and 4th index values along with location and type columns
store_data.loc[[1,4],['location','type']]
1 Boston Food&Beverages
4 Austin Beauty
iloc method
The iloc indexer for Pandas Dataframe is used for integer location-based
indexing/selection by position. When using the loc method on a dataframe, we specify
which rows and which columns we want by using the following format:
dataframe.iloc[row selection, column selection]
1 CustID01 M
4 CustID04 F
loc is label-based, which means that you have to specify rows and columns based on their
row and column labels.
iloc is integer position-based, so you have to specify rows and columns by their integer
position values (0-based integer position).
localhost:8888/notebooks/Desktop/Pandas.ipynb 8/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
-------------------------------------------------------------------------
--
IndexError Traceback (most recent call las
t)
<ipython-input-89-53acc0d7ec5b> in <module>()
1 # accessing selected rows and columns using iloc method
----> 2 store_data.iloc[[1,4],['location','type']]
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in __getit
em__(self, key)
871 # AttributeError for IntervalTree get_value
872 pass
--> 873 return self._getitem_tuple(key)
874 else:
875 # we by definition only have the 0th axis
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _getite
m_tuple(self, tup)
1441 def _getitem_tuple(self, tup: Tuple):
1442
-> 1443 self._has_valid_tuple(tup)
1444 try:
1445 return self._getitem_lowerdim(tup)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _has_va
lid_tuple(self, key)
700 raise IndexingError("Too many indexers")
701 try:
--> 702 self._validate_key(k, i)
703 except ValueError as err:
704 raise ValueError(
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _valida
te_key(self, key, axis)
1361 # check that the key has a numeric dtype
1362 if not is_numeric_dtype(arr.dtype):
-> 1363 raise IndexError(f".iloc requires numeric indexer
s, got {arr}")
1364
1365 # check that the key does not exceed the maximum size
of the index
In [89]:
localhost:8888/notebooks/Desktop/Pandas.ipynb 9/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
In [90]: print(store_data.loc[4,'type'])
store_data.loc[4,'type'] = 'Electronics'
Electronics
In [91]: store_data
In [93]: store_data['quantity']>1
Out[93]: 0 False
1 True
2 True
3 True
4 False
Name: quantity, dtype: bool
Wherever the condition of greater than 1 is satisfied in quantity column, 'True' is returned.
Let's retrieve the original values wherever the condition is satisfied.
In [94]: store_data.loc[store_data['quantity']>1]
localhost:8888/notebooks/Desktop/Pandas.ipynb 10/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
Wherever the condition is satisfied we get the original values, and wherever the condition is
not satisfied we do not get those records in the output.
In [95]: store_data
In [96]: # adding a new column in data frame store_data which is a rating (out of 5)
store_data['rating'] = [2,5,3,4,4]
store_data
The CustomerID column is a unique identifier of each customer. This unique identifier will
not help 24/7 Stores in getting useful insights about their customers. So, they have decided
to remove this column from the data frame.
localhost:8888/notebooks/Desktop/Pandas.ipynb 11/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
In [97]: store_data.drop('CustomerID',axis=1)
1 Boston M Food&Beverages 3 75 5
4 Austin F Beauty 1 80 4
We sucessfully removed the 'CustomerID' from dataframe. But this change is not
permanent in the dataframe, let's have a look at the store_data again.
In [98]: store_data
In [98]:
In [99]: store_data.drop('CustomerID',axis=1,inplace=True)
store_data
1 Boston M Food&Beverages 3 75 5
4 Austin F Beauty 1 80 4
Now the column has been permanently removed from the dataframe.
localhost:8888/notebooks/Desktop/Pandas.ipynb 12/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
1 Boston M Food&Beverages 3 75 5
4 Austin F Beauty 1 80 4
0 M Electronics 1 100
1 M Food&Beverages 3 75
2 F Food&Beverages 4 125
3 M Medicine 2 50
4 F Beauty 1 80
1 Boston M Food&Beverages 3 75 5
4 Austin F Beauty 1 80 4
localhost:8888/notebooks/Desktop/Pandas.ipynb 13/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
In [103]: store_data.drop(1,axis=0)
4 Austin F Beauty 1 80 4
In [104]: store_data
1 Boston M Food&Beverages 3 75 5
4 Austin F Beauty 1 80 4
Notice that we used axis=0 to drop a row from a data frame, while we were using
axis=1 for dropping a column from the data frame.
Also, to make permanent changes to the data frame we will have to use inplace=True
parameter.
We also see that the index are not correct now as first row has been removed. So, we will
have to reset the index of the data frame. Let's see how this can be done.
4 Austin F Beauty 1 80 4
3 4 Austin F Beauty 1 80 4
localhost:8888/notebooks/Desktop/Pandas.ipynb 14/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
We see that the index of the data frame is now resetted but the index has become a
column in the data frame. We do not need the index to become a column so we can simply
set the parameter drop=True in reset_index() function.
3 Austin F Beauty 1 80 4
1. concat
2. join
3. merge
In [109]: data_cust
1 102 Medium no 52
localhost:8888/notebooks/Desktop/Pandas.ipynb 15/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
In [110]: data_cust_new
4 101 12 123
5 103 9 214
6 104 44 663
7 105 21 331
In [111]: pd.concat([data_cust,data_cust_new],axis=0)
In [112]: pd.concat([data_cust,data_cust_new],axis=1)
localhost:8888/notebooks/Desktop/Pandas.ipynb 16/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
In [115]: pd.merge(data_cust,data_cust_new,how='right',on='customerID')
In [117]: data_quarters
Out[117]: Q1 Q2
I0 101 201
I1 102 202
I2 103 203
localhost:8888/notebooks/Desktop/Pandas.ipynb 17/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
In [118]: data_quarters_new
Out[118]: Q3 Q4
I0 301 401
I2 302 402
I3 303 403
join behaves just like merge, except instead of using the values of one of the columns to
combine data frames, it uses the index labels
Out[119]: Q1 Q2 Q3 Q4
Note
In real-life scenario, we deal with much larger datasets that have thousands of rows and
multiple columns. It will not be feasible for us to create datasets using multiple lists, especially if
the number of columns and rows increases.
So, it is clear we need a more efficient way of handling the data simultaneously at the columns
and row levels. In Python, we can import dataset from our local system, from links, or from
databases and work on them directly instead of creating our own dataset.
When the data file and jupyter notebook are in the same folder.
In [120]: # Using pd.read_csv() function will work without any path if the notebook a
# data = pd.read_csv('StockData.csv')
localhost:8888/notebooks/Desktop/Pandas.ipynb 18/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
Once we have access we can load files from google drive using read_csv() function.
In [ ]: # head() function helps us to see the first 5 rows of the data
data.head()
In [ ]: data_excel.head()
In [ ]:
In jupyter notebook, the dataset will be saved in the folder where the jupyter notebook is
located.
We can also save the dataset to a desired folder by providing the path/location of the folder.
localhost:8888/notebooks/Desktop/Pandas.ipynb 19/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
In [121]:
In [ ]: data.head()
In [ ]: data.tail()
In [ ]: data.shape
In [ ]: data.info()
The price column is numeric in nature while the stock and date columns are of object types.
In [ ]: data['price'].min()
In [ ]: data['price'].max()
unique() - to check the number of unique values that are present in a column
In [ ]: data['stock'].unique()
value_counts() - to check the number of values that each unique quantity has in a
column
In [ ]: data['stock'].value_counts()
localhost:8888/notebooks/Desktop/Pandas.ipynb 20/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
In [ ]: data['stock'].value_counts(normalize=True)
Statistical Functions
In [ ]: data['price'].mean()
In [ ]: data['price'].median()
In [ ]: data['stock'].mode()
To access a particular mode when the dataset has more than 1 mode
Group By function
Pandas dataframe.groupby() function is used to split the data into groups based on some
criteria.
In [ ]: data.groupby(['stock'])['price'].mean()
Here the groupby function is used to split the data into the 4 stocks that are present in the
dataset and then the mean price of each of the 4 stock is calculated.
Here the groupby function is used to split the data into the 4 stocks that are present in the
dataset and then the median price of each of the 4 stock is calculated.
localhost:8888/notebooks/Desktop/Pandas.ipynb 21/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
The Pandas apply() function lets you to manipulate columns and rows in a DataFrame.
In [ ]: data['price'].apply(profit)
We observe that the date column is of object type whereas it should be of date time data
type.
In [ ]: data.info()
We observe that the date column has been converted to datetime format
In [ ]: data.head()
The column 'date' is now in datetime format. Now we can change the format of the date
to any other format
localhost:8888/notebooks/Desktop/Pandas.ipynb 22/23
8/18/23, 8:10 PM Pandas - Jupyter Notebook
In [ ]: data['date'].dt.strftime('%m/%d/%Y')
In [ ]: data['date'].dt.strftime('%m-%d-%y')
In [ ]: data['date'].dt.year
Creating a new column and adding the extracted year values into the dataframe.
In [ ]: data['date'].dt.month
Creating a new column and adding the extracted month values into the dataframe.
In [ ]: data['date'].dt.day
Creating a new column and adding the extracted day values into the dataframe.
In [ ]: data.head()
We can see that year, month, and day columns have been added in the dataset.
In [ ]:
In [ ]:
localhost:8888/notebooks/Desktop/Pandas.ipynb 23/23