PRACTICAL 2
Aim: To understand and implement data exploration and visualization techniques using
Pandas Library.
Prerequisite:
Python Programming, Pandas library Python, Numpy Library, MatplotLib,
Seaborn Library
Outcome: After successful completion of this experiment students will be able to,
Understand finding of null values and replacing null values.
Understand methods of handling duplicate values.
Understand and implement class label encoding.
Understand and implement one hot encoding.
Understand usage of different types of Python libraries for plotting data
Plotting of data using different types of plots
Theory:
Missing Data can occur when no information is provided for one or more items or for a
whole unit. Missing Data is a very big problem in a real-life scenarios. Missing Data can also
refer to as NA(Not Available) values in pandas. In DataFrame sometimes many datasets
simply arrive with missing data, either because it exists and was not collected or it never
existed. For Example, suppose different users being surveyed may choose not to share their
income, some users may choose not to share the address in this way many datasets went
missing.
In Pandas missing data is represented by two value:
None: None is a Python singleton object that is often used for missing data in
Python code.
NaN : NaN (an acronym for Not a Number), is a special floating-point value
recognized by all systems
Pandas treat None and NaN as essentially interchangeable for indicating missing or null
values. To facilitate this convention, there are several useful functions for detecting,
removing, and replacing null values in Pandas DataFrame:
isnull()
notnull()
dropna()
fillna()
replace()
interpolate()
In order to check missing values in Pandas DataFrame, we use a function isnull() and
notnull(). Both function help in checking whether a value is NaN or not.
The fillna() method replaces the NULL values with a specified value. The fillna() method
returns a new DataFrame object unless the inplace parameter is set to True, in that case the
fillna() method does the replacing in the original DataFrame instead.
Syntax: dataframe.fillna(value, method, axis, inplace, limit, downcast)
Find duplicate columns from a DataFrame- To find duplicate columns we need to iterate
through all columns of a DataFrame and for each and every column it will search if any
other column exists in DataFrame with the same contents already. If yes then that
column name will be stored in the duplicate column set. In the end, the function will
return the list of column names of the duplicate column.
Pandas drop_duplicates() method helps in removing duplicates from the Pandas
Dataframe In Python.
Encoding: In many practical Data Science activities, the data set will contain categorical
variables. Many machine learning algorithms can support categorical values without further
manipulation but there are many more algorithms that do not. Therefore, the analyst is faced
with the challenge of figuring out how to turn these text attributes into numerical values for
further processing. The python tools of pandas and scikit-learn provide several approaches
that can be applied to transform the categorical data into suitable numeric values.
Label encoding - Is simply converting each value in a column to a number. Label encoding
has the advantage that it is straightforward, but it has the disadvantage that the numeric
values can be “misinterpreted” by the algorithms. For example, the value of 0 is obviously
less than the value of 4 but does that really correspond to the data set in real life?
One hot encoding: The basic idea of one-hot encoding is to create new variables that take on
values 0 and 1 to represent the original categorical values. We use One hot encoding method
when the values are close to each other, and the categorical feature is not ordinal. This
technique is very useful, but it can cause the number of columns to expand greatly if you
have very many unique values in a column.
get_dummies is one of the easiest way to implement one hot encoding method and it has very
useful parameters
Visualization:
Python is one of the most popular programming languages for data analytics as well as
data visualization. There are several libraries available in recent years that create beautiful
and complex data visualizations. These libraries are so popular because they allow analysts
and statisticians to create visual data models easily according to their specifications by
conveniently providing an interface, data visualization tools all in one place.
Matplotlib is a data visualization library and 2-D plotting library of Python and it is the
most popular and widely-used plotting library in the Python community. It comes with an
interactive environment across multiple platforms. Matplotlib can be used in Python scripts,
the Python and IPython shells, the Jupyter notebook, web application servers, etc. We can
use Matplotlib to create plots, bar charts, pie charts, histograms, scatterplots, error charts,
other visualization charts.
Plotly is a free open-source graphing library that can be used to form data visualizations.
Plotly (plotly.py) is built on top of the Plotly JavaScript library (plotly.js) and can be used
to create web-based data visualizations that can be displayed in Jupyter notebooks or web
applications using Dash or saved as individual HTML files. Plotly provides more than 40
unique chart types like scatter plots, histograms, line charts, bar charts, pie charts, error
bars, box plots, multiple axes, sparklines, dendrograms, 3-D charts, etc.
Seaborn is a Python data visualization library that is based on Matplotlib and closely
integrated with the NumPy and pandas data structures. Seaborn has various dataset-oriented
plotting functions that operate on data frames and arrays that have whole datasets within
them. Then it internally performs the necessary statistical aggregation and mapping
functions to create informative plots that the user desires. It is a high-level interface for
creating beautiful and informative statistical graphics that are integral to exploring and
understanding data. The Seaborn data graphics can include bar charts, pie charts,
histograms, scatterplots, error charts, etc. Seaborn also has various tools for choosing color
palettes that can reveal patterns in the data.
Scatter Plot:
A Scatter plot is a type of data visualization technique that shows the relationship between
two numerical variables. For plotting to scatter plot using pandas there is DataFrame class
and this class has a member called plot. Calling the scatter() method on the plot member
draws a plot between two variables or two columns of pandas DataFrame.
Syntax: DataFrame.plot.scatter(x, y, s = none, c = none)
Parameter:
x: column name to be used as horizontal coordinates for each point
y: column name to be used as vertical coordinates for each point
s: size of dots
c: color of dots
Correlation:
Given two variables, if the value of one variable is dependent on the value of the other
variables, we say the variables are related. The measure of the relationship between two
variables statistically is called "Correlation".
Correlation means an association, it is a measure of the extent to which two variables are
related.
1. Positive Correlation: When two variables increase together and decrease together.
2. Negative Correlation: When one variable increase and the other variable decreases
together and vice versa. They are negatively correlated.
To find the correlation between these two variables, as mentioned above, corr() method is
used. A Heatmap is an efficient way of plotting a correlation matrix. It shows the correlation
of a pair of every two variables, and it belongs to the Seaborn library.
(TO BE COMPLETED BY STUDENTS)
Roll No. Name:
Class: Batch:
Date of Practical: Date of Submission:
Grade:
Perform exploratory data analysis (EDA) on Car dataset and write the inferences for each
question.
1. Read the Car.csv file into a DataFrame.
2. Explore size, shape, data types of each column in the dataset.
3. List down the columns of dataset
4. Display the info of dataset and state your observations.
5. Rename the column “Engine Fuel Type” to “Fuel Type” and “No. of Doors” to
“Doors”
6. Find out ‘Fuel Type’ for the 4th row.
7. Find out value of second column for the 4th row.
8. Select all rows for column “Fuel Type”
9. Select all rows for columns “Make” and “Transmission Type”
10. Display 1 to 5 rows for columns 2 to 4 (excluding row 5 and column 4)
11. Identify unique values for columns “Make”, “Transmission Type” and “Vehicle
Size”
12. Create a new data frame, by replacing “??” with NAN
13. Check whether there are any duplicate values in the data set and delete those rows
14. Identify the total number of null values in each column of the data set
15. Drop rows with null values
16. Convert data types of columns “Doors”, “Engine HP” and “Engine Cylinder” to int
17. Check the numerical values in the data set and normalize if required.
18. Replace the categorical values in the “Vehicle Size” column with its corresponding
numeric value using label encoding and that of column “Transmission Type” by one
hot encoding method.
19. Identify total number of cars that runs on “diesel” and “electric” .
20. Identify mean of “highway MPG ” column for the cars that runs on “diesel”
21. Plot a scatter plot for columns “highway MPG vs and “city mpg” using the color
and title option. Write your inference.
22. Plot a Correlation matrix and write your inference.