Practical 2 fKs4RPadH3

This document outlines a practical exercise aimed at understanding data exploration and visualization techniques using the Pandas library in Python. It covers handling missing data, encoding categorical variables, and utilizing various libraries for data visualization, including Matplotlib, Plotly, and Seaborn. The practical tasks involve performing exploratory data analysis on a car dataset, including data cleaning, transformation, and plotting results.

Uploaded by

spotifyuserforphone

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views4 pages

Practical 2 fKs4RPadH3

Uploaded by

spotifyuserforphone

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

PRACTICAL 2

Aim: To understand and implement data exploration and visualization techniques using
Pandas Library.

Prerequisite:
 Python Programming, Pandas library Python, Numpy Library, MatplotLib,
Seaborn Library

Outcome: After successful completion of this experiment students will be able to,
 Understand finding of null values and replacing null values.
 Understand methods of handling duplicate values.
 Understand and implement class label encoding.
 Understand and implement one hot encoding.
 Understand usage of different types of Python libraries for plotting data
 Plotting of data using different types of plots

Theory:
Missing Data can occur when no information is provided for one or more items or for a
whole unit. Missing Data is a very big problem in a real-life scenarios. Missing Data can also
refer to as NA(Not Available) values in pandas. In DataFrame sometimes many datasets
simply arrive with missing data, either because it exists and was not collected or it never
existed. For Example, suppose different users being surveyed may choose not to share their
income, some users may choose not to share the address in this way many datasets went
missing.
In Pandas missing data is represented by two value:
 None: None is a Python singleton object that is often used for missing data in
Python code.
 NaN : NaN (an acronym for Not a Number), is a special floating-point value
recognized by all systems
Pandas treat None and NaN as essentially interchangeable for indicating missing or null
values. To facilitate this convention, there are several useful functions for detecting,
removing, and replacing null values in Pandas DataFrame:
 isnull()
 notnull()
 dropna()
 fillna()
 replace()
 interpolate()

 In order to check missing values in Pandas DataFrame, we use a function isnull() and
notnull(). Both function help in checking whether a value is NaN or not.
 The fillna() method replaces the NULL values with a specified value. The fillna() method
returns a new DataFrame object unless the inplace parameter is set to True, in that case the
fillna() method does the replacing in the original DataFrame instead.

Syntax: dataframe.fillna(value, method, axis, inplace, limit, downcast)

 Find duplicate columns from a DataFrame- To find duplicate columns we need to iterate
through all columns of a DataFrame and for each and every column it will search if any
other column exists in DataFrame with the same contents already. If yes then that
column name will be stored in the duplicate column set. In the end, the function will
return the list of column names of the duplicate column.
Pandas drop_duplicates() method helps in removing duplicates from the Pandas
Dataframe In Python.

Encoding: In many practical Data Science activities, the data set will contain categorical
variables. Many machine learning algorithms can support categorical values without further
manipulation but there are many more algorithms that do not. Therefore, the analyst is faced
with the challenge of figuring out how to turn these text attributes into numerical values for
further processing. The python tools of pandas and scikit-learn provide several approaches
that can be applied to transform the categorical data into suitable numeric values.
Label encoding - Is simply converting each value in a column to a number. Label encoding
has the advantage that it is straightforward, but it has the disadvantage that the numeric
values can be “misinterpreted” by the algorithms. For example, the value of 0 is obviously
less than the value of 4 but does that really correspond to the data set in real life?
One hot encoding: The basic idea of one-hot encoding is to create new variables that take on
values 0 and 1 to represent the original categorical values. We use One hot encoding method
when the values are close to each other, and the categorical feature is not ordinal. This
technique is very useful, but it can cause the number of columns to expand greatly if you
have very many unique values in a column.
get_dummies is one of the easiest way to implement one hot encoding method and it has very
useful parameters

Visualization:
Python is one of the most popular programming languages for data analytics as well as
data visualization. There are several libraries available in recent years that create beautiful
and complex data visualizations. These libraries are so popular because they allow analysts
and statisticians to create visual data models easily according to their specifications by
conveniently providing an interface, data visualization tools all in one place.
Matplotlib is a data visualization library and 2-D plotting library of Python and it is the
most popular and widely-used plotting library in the Python community. It comes with an
interactive environment across multiple platforms. Matplotlib can be used in Python scripts,
the Python and IPython shells, the Jupyter notebook, web application servers, etc. We can
use Matplotlib to create plots, bar charts, pie charts, histograms, scatterplots, error charts,
other visualization charts.
Plotly is a free open-source graphing library that can be used to form data visualizations.
Plotly (plotly.py) is built on top of the Plotly JavaScript library (plotly.js) and can be used
to create web-based data visualizations that can be displayed in Jupyter notebooks or web
applications using Dash or saved as individual HTML files. Plotly provides more than 40
unique chart types like scatter plots, histograms, line charts, bar charts, pie charts, error
bars, box plots, multiple axes, sparklines, dendrograms, 3-D charts, etc.
Seaborn is a Python data visualization library that is based on Matplotlib and closely
integrated with the NumPy and pandas data structures. Seaborn has various dataset-oriented
plotting functions that operate on data frames and arrays that have whole datasets within
them. Then it internally performs the necessary statistical aggregation and mapping
functions to create informative plots that the user desires. It is a high-level interface for
creating beautiful and informative statistical graphics that are integral to exploring and
understanding data. The Seaborn data graphics can include bar charts, pie charts,
histograms, scatterplots, error charts, etc. Seaborn also has various tools for choosing color
palettes that can reveal patterns in the data.
Scatter Plot:
A Scatter plot is a type of data visualization technique that shows the relationship between
two numerical variables. For plotting to scatter plot using pandas there is DataFrame class
and this class has a member called plot. Calling the scatter() method on the plot member
draws a plot between two variables or two columns of pandas DataFrame.

Syntax: DataFrame.plot.scatter(x, y, s = none, c = none)

Parameter:
x: column name to be used as horizontal coordinates for each point
y: column name to be used as vertical coordinates for each point
s: size of dots
c: color of dots

Correlation:
Given two variables, if the value of one variable is dependent on the value of the other
variables, we say the variables are related. The measure of the relationship between two
variables statistically is called "Correlation".
Correlation means an association, it is a measure of the extent to which two variables are
related.
1. Positive Correlation: When two variables increase together and decrease together.
2. Negative Correlation: When one variable increase and the other variable decreases
together and vice versa. They are negatively correlated.
To find the correlation between these two variables, as mentioned above, corr() method is
used. A Heatmap is an efficient way of plotting a correlation matrix. It shows the correlation
of a pair of every two variables, and it belongs to the Seaborn library.
(TO BE COMPLETED BY STUDENTS)

Roll No. Name:

Class: Batch:
Date of Practical: Date of Submission:
Grade:

Perform exploratory data analysis (EDA) on Car dataset and write the inferences for each
question.

1. Read the Car.csv file into a DataFrame.

2. Explore size, shape, data types of each column in the dataset.
3. List down the columns of dataset
4. Display the info of dataset and state your observations.
5. Rename the column “Engine Fuel Type” to “Fuel Type” and “No. of Doors” to
“Doors”
6. Find out ‘Fuel Type’ for the 4th row.
7. Find out value of second column for the 4th row.
8. Select all rows for column “Fuel Type”
9. Select all rows for columns “Make” and “Transmission Type”
10. Display 1 to 5 rows for columns 2 to 4 (excluding row 5 and column 4)
11. Identify unique values for columns “Make”, “Transmission Type” and “Vehicle
Size”
12. Create a new data frame, by replacing “??” with NAN
13. Check whether there are any duplicate values in the data set and delete those rows
14. Identify the total number of null values in each column of the data set
15. Drop rows with null values
16. Convert data types of columns “Doors”, “Engine HP” and “Engine Cylinder” to int
17. Check the numerical values in the data set and normalize if required.
18. Replace the categorical values in the “Vehicle Size” column with its corresponding
numeric value using label encoding and that of column “Transmission Type” by one
hot encoding method.
19. Identify total number of cars that runs on “diesel” and “electric” .
20. Identify mean of “highway MPG ” column for the cars that runs on “diesel”
21. Plot a scatter plot for columns “highway MPG vs and “city mpg” using the color
and title option. Write your inference.
22. Plot a Correlation matrix and write your inference.

Lec 19
No ratings yet
Lec 19
14 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Certificate
No ratings yet
Certificate
25 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Presentation
No ratings yet
Presentation
19 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
24UAD315 DEV Final Record
No ratings yet
24UAD315 DEV Final Record
49 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
IntroToPython Unit 5
No ratings yet
IntroToPython Unit 5
42 pages
EDA Code Syntax Cheatsheet
No ratings yet
EDA Code Syntax Cheatsheet
29 pages
Python in Research
No ratings yet
Python in Research
18 pages
Final Dev Record
No ratings yet
Final Dev Record
49 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Week 3 Q&A
No ratings yet
Week 3 Q&A
10 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
BDA File
No ratings yet
BDA File
26 pages
Dev Lab Record
No ratings yet
Dev Lab Record
21 pages
Unit 4 Fod
100% (1)
Unit 4 Fod
21 pages
Presentation - University
No ratings yet
Presentation - University
52 pages
Machine Learning Experiment
No ratings yet
Machine Learning Experiment
69 pages
Data Analysis
No ratings yet
Data Analysis
42 pages
Dev Lab Manual Org
No ratings yet
Dev Lab Manual Org
28 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Data Analysis
No ratings yet
Data Analysis
20 pages
DMV Unit-4-1 PDF
No ratings yet
DMV Unit-4-1 PDF
10 pages
Even Students
No ratings yet
Even Students
36 pages
Analystics Data Cleaning Questions Interview
No ratings yet
Analystics Data Cleaning Questions Interview
8 pages
Part A Assignment - No - 1
No ratings yet
Part A Assignment - No - 1
7 pages
Pandas
No ratings yet
Pandas
25 pages
EX-02-Data Manipulation Pandas Matplot
No ratings yet
EX-02-Data Manipulation Pandas Matplot
9 pages
Unit 2
No ratings yet
Unit 2
36 pages
Lab Record Dev
No ratings yet
Lab Record Dev
20 pages
Datascience
No ratings yet
Datascience
26 pages
Aadarsh
No ratings yet
Aadarsh
26 pages
Manishadav
No ratings yet
Manishadav
27 pages
ML Expt 1 Description
No ratings yet
ML Expt 1 Description
15 pages
DS Day 5
No ratings yet
DS Day 5
11 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
FOUND. DATA SCIENCE Practical
No ratings yet
FOUND. DATA SCIENCE Practical
15 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Python Codes
No ratings yet
Python Codes
17 pages
EDA Exp 2 Outout
No ratings yet
EDA Exp 2 Outout
7 pages
AI & Data Science Lab Record
No ratings yet
AI & Data Science Lab Record
28 pages
Assignment
No ratings yet
Assignment
12 pages
Lec 18
No ratings yet
Lec 18
17 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Unit 5
No ratings yet
Unit 5
28 pages
Matplotlib Library in Python
No ratings yet
Matplotlib Library in Python
85 pages
Python Interview Questions
No ratings yet
Python Interview Questions
6 pages
Commands SQL, Python (BASICS)
No ratings yet
Commands SQL, Python (BASICS)
7 pages
16 Mark Ds
No ratings yet
16 Mark Ds
18 pages
Pandas 1702216043
No ratings yet
Pandas 1702216043
86 pages
Civil Engineering Hydraulics Exam
No ratings yet
Civil Engineering Hydraulics Exam
2 pages
Grade4 101304 5 9204
No ratings yet
Grade4 101304 5 9204
4 pages
Guidance For Generative AI in Education and Research
No ratings yet
Guidance For Generative AI in Education and Research
48 pages
Hydraulic Schematic 6060 FS (MDG+SIL) - 2 - 2013
No ratings yet
Hydraulic Schematic 6060 FS (MDG+SIL) - 2 - 2013
1 page
Qatar Chamber - Form (English)
No ratings yet
Qatar Chamber - Form (English)
3 pages
Oracle Commands Basics
No ratings yet
Oracle Commands Basics
5 pages
Minutes of The Meeting: Focused Group Discussion
100% (1)
Minutes of The Meeting: Focused Group Discussion
5 pages
Transmission Valve Body Assy (U340E) : Replacement
No ratings yet
Transmission Valve Body Assy (U340E) : Replacement
5 pages
Unisza:taf3023: Discrete Mathematics Presentation 1:forespec Group
No ratings yet
Unisza:taf3023: Discrete Mathematics Presentation 1:forespec Group
29 pages
A444 - One Side Antistatic Coated and Other Side Corona Treated Fold Retainable Polyester Film - For Candy Wrapping Application
No ratings yet
A444 - One Side Antistatic Coated and Other Side Corona Treated Fold Retainable Polyester Film - For Candy Wrapping Application
1 page
Commodore Magazine Vol-08-N06 1987 Jun
No ratings yet
Commodore Magazine Vol-08-N06 1987 Jun
132 pages
RVP Duo 65
No ratings yet
RVP Duo 65
66 pages
Case of Contagious Cook
No ratings yet
Case of Contagious Cook
5 pages
Voltage Dips During Start-Up of Large Compressor Motors
No ratings yet
Voltage Dips During Start-Up of Large Compressor Motors
2 pages
Amateur Radio Power Supply Guide
No ratings yet
Amateur Radio Power Supply Guide
7 pages
Rustilo 181
No ratings yet
Rustilo 181
2 pages
A-Level Calculus: Limits & Functions
No ratings yet
A-Level Calculus: Limits & Functions
17 pages
Interview - Hasyim Siraj
No ratings yet
Interview - Hasyim Siraj
28 pages
Real Test Bank Fundamentals of Musculoskeletal Imaging 4th Edition Digital Bundle
No ratings yet
Real Test Bank Fundamentals of Musculoskeletal Imaging 4th Edition Digital Bundle
318 pages
Bookbinders Case 2
0% (3)
Bookbinders Case 2
6 pages
Scilab Programs for Numerical Methods
No ratings yet
Scilab Programs for Numerical Methods
28 pages
Soil Testing for Engineers
No ratings yet
Soil Testing for Engineers
6 pages
DYNA Overview E01 Mail
No ratings yet
DYNA Overview E01 Mail
6 pages
11th-New History-Book Back 1 41 PDF
No ratings yet
11th-New History-Book Back 1 41 PDF
33 pages
Alien's Extraterrestrial Visual Systems (Compressed - Low Quality)
No ratings yet
Alien's Extraterrestrial Visual Systems (Compressed - Low Quality)
115 pages
Datasheet SmartSolar Charge Controller MPPT 100 30 & 100 50 EN
No ratings yet
Datasheet SmartSolar Charge Controller MPPT 100 30 & 100 50 EN
1 page
2 From
No ratings yet
2 From
27 pages
500cc Manual
No ratings yet
500cc Manual
52 pages
Abb 264DS DP
No ratings yet
Abb 264DS DP
5 pages
Historical Development of Language
No ratings yet
Historical Development of Language
2 pages

Practical 2 fKs4RPadH3

Uploaded by

Practical 2 fKs4RPadH3

Uploaded by

PRACTICAL 2

Syntax: dataframe.fillna(value, method, axis, inplace, limit, downcast)

Syntax: DataFrame.plot.scatter(x, y, s = none, c = none)

Roll No. Name:

1. Read the Car.csv file into a DataFrame.

You might also like