Experiment No: 2
Title: Study of Python Libraries for ML application such as Pandas and Matplotlib, Keras and
TensorFlow
Objective:
• To understand data preprocessing and analysis using Pandas library
• To understand data visualization in the form of 2D graphs and plots using Matplotlib library
Theory/Description:
List important ML libraries
o Python Libraries for Machine Learning
▪ Numpy
▪ Scipy
▪ Scikit-learn
▪ Theano
▪ TensorFlow
▪ Keras
▪ PyTorch
▪ Pandas
▪ Matplotlib
Importance of Pandas library
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data
structures and data analysis tools for the Python programming language.
Pandas makes importing, analyzing, and visualizing data much easier. It builds on packages like
NumPy and matplotlib to give you a single, convenient, place to do most of your data analysis
and visualization work.
Advantages of Pandas Library
There are many benefits of Python Pandas library, listing them all would probably take more time
than what it takes to learn the library. Therefore, these are the core advantages of using the Pandas
library:
1. Data representation
Pandas provide extremely streamlined forms of data representation. This helps to analyze
and understand data better. Simpler data representation facilitates better results for data
science projects.
2. Less writing and more work done
It is one of the best advantages of Pandas. What would have taken multiple lines in Python
without any support libraries, can simply be achieved through 1-2 lines with the use of
Pandas.
Thus, using Pandas helps to shorten the procedure of handling data. With the time saved, we
can focus more on data analysis algorithms.
3. An extensive set of features
Pandas are really powerful. They provide you with a huge set of important commands and
features which are used to easily analyze your data. We can use Pandas to perform various
tasks like filtering your data according to certain conditions, or segmenting and segregating the
data according to preference, etc.
4. Efficiently handles large data
Wes McKinney, the creator of Pandas, made the python library to mainly handle large
datasets efficiently. Pandas help to save a lot of time by importing large amounts of data very
fast. 5. Makes data flexible and customizable
Pandas provide a huge feature set to apply on the data you have so that you can customize, edit
and pivot it according to your own will and desire. This helps to bring the most out of your
data. 6. Made for Python
Python programming has become one of the most sought after programming languages in the
world, with its extensive amount of features and the sheer amount of productivity it provides.
Therefore, being able to code Pandas in Python, enables you to tap into the power of the various
other features and libraries which will use with Python. Some of these libraries are NumPy,
SciPy, MatPlotLib, etc.
Pandas Library
The primary two components of pandas are the Series and DataFrame.
A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a
collection of Series.
DataFrames and Series are quite similar in that many operations that you can do with
one you can do with the other, such as filling in null values and calculating the mean.
❖ Reading data from CSVs
With CSV files all you need is a single line to load in the data:
df = pd.read_csv('purchases.csv')
df
Let's load in the IMDB movies dataset to begin:
movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")
We're loading this dataset from a CSV and designating the movie titles to be our index.
❖ Viewing your data
The first thing to do when opening a new dataset is print out a few rows to keep as a visual
reference. We accomplish this with .head():
movies_df.head()
Another fast and useful attribute is .shape, which outputs just a tuple of (rows,
columns): movies_df.shape
Note that .shape has no parentheses and is a simple tuple of format (rows, columns). So we
have 1000 rows and 11 columns in our movies DataFrame.
You'll be going to .shape a lot when cleaning and transforming data. For example, you might
filter some rows based on some criteria and then want to know quickly how many rows were
removed.
❖ Handling duplicates
This dataset does not have duplicate rows, but it is always important to verify you
aren't aggregating duplicate rows.
To demonstrate, let's simply just double up our movies DataFrame by appending it
to itself:
temp_df = movies_df.append(movies_df)
temp_df.shape
Out:
(2000, 11)
Using append() will return a copy without affecting the original DataFrame. We
are capturing this copy in temp so we aren't working with the real data.
Notice call .shape quickly proves our DataFrame rows have doubled.
Now we can try dropping duplicates:
temp_df = temp_df.drop_duplicates()
temp_df.shape
Out:
(1000, 11)
Just like append(), the drop_duplicates() method will also return a copy of your
DataFrame, but this time with duplicates removed. Calling .shape confirms we're back
to the 1000 rows of our original dataset.
It's a little verbose to keep assigning DataFrames to the same variable like in this
example. For this reason, pandas has the inplace keyword argument on many of
its methods. Using inplace=True will modify the DataFrame object in place:
temp_df.drop_duplicates(inplace=True)
Now our temp_df will have the transformed data automatically.
Another important argument for drop_duplicates() is keep, which has three
possible options:
• first: (default) Drop duplicates except for the first occurrence.
• last: Drop duplicates except for the last occurrence.
• False: Drop all duplicates.
Since we didn't define the keep arugment in the previous example it was defaulted to
first. This means that if two rows are the same pandas will drop the second row and
keep the first row. Using last has the opposite effect: the first row is dropped.
False, on the other hand, will drop all duplicates. If two rows are the same then
both willbe dropped. Watch what happens to temp_df:
temp_df = movies_df.append(movies_df) # make a new copy
temp_df.drop_duplicates(inplace=True, keep=False)
temp_df.shape
Out:
(0, 11)
Since all rows were duplicates, keep=False dropped them all resulting in zero rows
being left over. If you're wondering why you would want to do this, one reason is that it
allows you to locate all duplicates in your dataset. When conditional selections are shown
below you'll see how to do that.
❖ Column cleanup
Many times, datasets will have verbose column names with symbols, upper and
lowercasewords, spaces, and typos. To make selecting data by column name easier we
can spend a little time cleaning up their names.
Here's how to print the column names of our dataset:
movies_df.columns
Out:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
'Metascore'],
dtype='object')
Not only does .columns come in handy if you want to rename columns by allowing for
simple copy and paste, it's also useful if you need to understand why you are receiving a
Key Error when selecting data by column.
We can use the .rename() method to rename certain or all columns via a dict. We
don't want parentheses, so let's rename those:
movies_df.rename(columns={
'Runtime (Minutes)': 'Runtime',
'Revenue (Millions)': 'Revenue_millions'
}, inplace=True)
movies_df.columns
Out:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors',
'Year', 'Runtime',
'Rating', 'Votes', 'Revenue_millions', 'Metascore'],
dtype='object')
Excellent. But what if we want to lowercase all names? Instead of using .rename()
we could also set a list of names to the columns like so:
movies_df.columns = ['rank', 'genre', 'description', 'director',
'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore']
movies_df.columns
Out:
Index(['rank', 'genre', 'description', 'director', 'actors',
'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore'],
dtype='object')
But that's too much work. Instead of just renaming each column manually we can do a
list comprehension:
movies_df.columns = [col.lower() for col in movies_df]
movies_df.columns
Out:
Index(['rank', 'genre', 'description', 'director', 'actors',
'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore'],
dtype='object')
list (and dict) comprehensions come in handy a lot when working with pandas
and data in general.
It's a good idea to lowercase, remove special characters, and replace spaces
with underscores if you'll be working with a dataset for some time.
❖ How to work with missing values
When exploring data, you‘ll most likely encounter missing or null values, which are essentially
placeholders for non-existent values. Most commonly you'll see Python's None or NumPy's
np.nan, each of which are handled differently in some situations.
There are two options in dealing with nulls:
1. Get rid of rows or columns with nulls
2. Replace nulls with non-null values, a technique known as imputation
Let's calculate to total number of nulls in each column of our dataset. The first step is to
check which cells in our DataFrame are null:
movies_df.isnull()
Notice isnull() returns a DataFrame where each cell is either True or False depending on that
cell's null status.
To count the number of nulls in each column we use an aggregate function for
summing: movies_df.isnull().sum()
❖ DataFrame slicing, selecting, extracting
Up until now we've focused on some basic summaries of our data. We've learned about simple
column extraction using single brackets, and we imputed null values in a column using fillna().
Below are the other methods of slicing, selecting, and extracting you'll need to use constantly.
It's important to note that, although many methods are the same, DataFrames and Series have
different attributes, so you'll need be sure to know which type you are working with or else you
will receive attribute errors.
Let's look at working with columns first.
By column
You already saw how to extract a column using square brackets like this:
genre_col = movies_df['genre']
type(genre_col)
Importance of Matplotlib library
To make necessary statistical inferences, it becomes necessary to visualize your data and
Matplotlib is one such solution for the Python users. It is a very powerful plotting library
useful for those working with Python and NumPy. The most used module of Matplotib is
Pyplot which provides an interface like MATLAB but instead, it uses Python and it is
open source.
❖ General Concepts
A Matplotlib figure can be categorized into several parts as below:
1. Figure: It is a whole figure which may contain one or more than one axes (plots). You can
think of a Figure as a canvas which contains plots.
2. Axes: It is what we generally think of as a plot. A Figure can contain many Axes. It contains
two or three (in the case of 3D) Axis objects. Each Axes has a title, an x-label and a y-label. 3.
Axis: They are the number line like objects and take care of generating the graph limits. 4.
Artist: Everything which one can see on the figure is an artist like Text objects, Line2D objects,
collection objects. Most Artists are tied to Axes.
Matplotlib Library
Pyplot is a module of Matplotlib which provides simple functions to add plot
elements like lines, images, text, etc. to the current axes in the current figure.
❖ Make a simple plot
import matplotlib.pyplot as plt
import numpy as np
List of all the methods as they appeared.
• plot(x-axisvalues, y-axis values) — plots a simple line graph with x-axis values
against y-axis values
• show() — displays the graph
• title(―string‖) — set the title of the plot as specified by the string
• xlabel(―string‖) — set the label for x-axis as specified by the string •
ylabel(―string‖) — set the label for y-axis as specified by the string •
figure() — used to control a figure level attributes
• subplot(nrows, ncols, index) — Add a subplot to the current figure • suptitle(―string‖)
— It adds a common title to the figure specified by the string • subplots(nrows, ncols,
figsize) — a convenient way to create subplots, in a single call. It returns a tuple of a
figure and number of axes.
• set_title(―string‖) — an axes level method used to set the title of subplots in a figure •
bar(categorical variables, values, color) — used to create vertical bar graphs •
barh(categorical variables, values, color) — used to create horizontal bar graphs •
legend(loc) — used to make legend of the graph
• xticks(index, categorical variables) — Get or set the current tick locations and labels
of the x-axis
• pie(value, categorical variables) — used to create a pie chart
• hist(values, number of bins) — used to create a histogram
• xlim(start value, end value) — used to set the limit of values of the x-axis •
ylim(start value, end value) — used to set the limit of values of the y-axis
• scatter(x-axis values, y-axis values) — plots a scatter plot with x-axis values against
y-axis values
• axes() — adds an axes to the current figure
• set_xlabel(―string‖) — axes level method used to set the x-label of the plot specified
as a string
• set_ylabel(―string‖) — axes level method used to set the y-label of the plot specified
as a string
• scatter3D(x-axis values, y-axis values) — plots a three-dimensional scatter plot with
x-axis values against y-axis values
• plot3D(x-axis values, y-axis values) — plots a three-dimensional line graph with x
axis values against y-axis values
Here we import Matplotlib‘s Pyplot module and Numpy library as most of the data
that we will be working with will be in the form of arrays only.
We pass two arrays as our input arguments to Pyplot‘s plot() method and use show()
method to invoke the required plot. Here note that the first array appears on the x-axis and
second array appears on the y-axis of the plot. Now that our first plot is ready, let us add
the title, and name x-axis and y-axis using methods title(), xlabel() and ylabel()
respectively.
We can also specify the size of the figure using method figure() and passing the
values as a tuple of the length of rows and columns to the argument figsize
With every X and Y argument, you can also pass an optional third argument in the form
of a string which indicates the colour and line type of the plot. The default format is b
which means a solid blue line. In the figure below we use go which means green circles.
Likewise, we can make many such combinations to format our plot.
We can also plot multiple sets of data by passing in multiple sets of arguments of X and
Y axis in the plot() method as shown.
❖ Multiple plotsin one figure:
We can use subplot() method to add more than one plots in one figure. In the image
below, we used this method to separate two graphs which we plotted on the same axes in
the previous example. The subplot() method takes three arguments: they are nrows,
ncols and index. They indicate the number of rows, number of columns and the index
number of the sub-plot. For instance, in our example, we want to create two sub-plots in
one figure such that it comes in one row and in two columns and hence we pass
arguments (1,2,1) and (1,2,2) in the subplot() method. Note that we have
separately used title()method for both the subplots. We use suptitle() method to make
a centralized title for the figure.
If we want our sub-plots in two rows and single column, we can pass arguments
(2,1,1) and (2,1,2)
The above way of creating subplots becomes a bit tedious when we want many subplots
in our figure. A more convenient way is to use subpltots() method. Notice the
difference of ‘s’ in both the methods. This method takes two arguments nrows and
ncols as number of rows and number of columns respectively. This method creates two
objects:figure and axes which we store in variables fig and ax which can be used to
change the figure and axes level attributes respectively. Note that these variable names
are chosen arbitrarily.
Keras API
Keras is a high-level neural network API, written in Python. It is most powerful and easy to use for developing
and evaluating deep learning models. It runs seamlessly on CPU and GPU. Keras uses TensorFlow, Theano,
MxNet, and CNTK (Microsoft) as backends.
❖ Why Keras?
➢ Allows easy and fast prototyping
➢ Supports both convolutional networks, recurrent networks, and combination of both
➢ Provides clear and actionable feedback for user error
➢ Follows best practices for reducing cognitive load
❖ Installation of Keras
⮚ Install Keras in virtualenv:
A. pip3 install keras
⮚ Install Keras from the GitHub source:
A. Clone Keras using git:
i. git clone https://github.com/keras-team/keras.git
B. cd to the Keras folder and run the install command:
i. cd keras
ii. sudo python setup.py install
❖ Creating a Keras Model
➢ Architecture Definition: Number of layers, number of nodes in layers, and activation function to be
used
➢ Compile: Defines the loss function and details about how optimization works
➢ Predict: Predicts with the model prepared
➢ Fit: Finalizes the model through back propagation and optimization of weights with input data
TensorFlow Library
Tensor: A multidimensional array
Flow: A graph of operations
A popular open source library for deep learning and machine learning. Developed by Google Brain Team and
released in 2015. TensorFlow uses a dataflow graph to represent computation. Dataflow is a common
programming model for parallel computing. TensorFlow is used mainly for classification, perception,
understanding, discovering, prediction, and creation.
❖ Why TensorFlow?
➢ Flexibility (highly efficient C++ implementations of ML, flexibility to create sort of computations )
➢ Parallel Computation (Supports Distributed Computing)
➢ Multiple Environment Friendly (Linux, macOS, iOS, Android, Raspberry Pi, Windows)
➢ Large Community (Popular and growing)
❖ Installation of TensorFlow
➢ TensorFlow 2 packages require a pip version >19.0.
A. pip install --upgrade pip
Conclusion : Learned to implement Python Libraries for ML application such as Pandas and
Matplotlib, Keras and TensorFlow in this experiment.