0% found this document useful (0 votes)

23 views5 pages

DSBDA1

The document outlines the process of importing libraries, loading, and preprocessing an automobile dataset from Kaggle, which includes various car characteristics. It details steps such as checking for missing values, obtaining descriptive statistics, and converting categorical variables into quantitative formats. The final cleaned dataset is saved as 'cleaned_autodata.csv' for further analysis.

Uploaded by

naitikpawar22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views5 pages

DSBDA1

Uploaded by

naitikpawar22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

In [28]: # 1.

Import all the required Python Libraries

import pandas as pd
import numpy as np

In [28]: # 1. Import all the required Python Libraries

import pandas as pd
import numpy as np

In [29]: # 2. Dataset Description

description = """
Dataset Name: Automobile Dataset
Source: https://www.kaggle.com/datasets/toramky/automobile-dataset

This dataset contains various characteristics of cars such as fuel type, number of doors,
engine size, horsepower, and more. It is commonly used for machine learning projects
like regression, classification, and price prediction.

You uploaded the dataset locally as 'autodata.csv'.

"""
print(description)

Dataset Name: Automobile Dataset

Source: https://www.kaggle.com/datasets/toramky/automobile-dataset

You uploaded the dataset locally as 'autodata.csv'.

In [31]: # 3. Load the Dataset into pandas dataframe

df = pd.read_csv("C:/Users/prajw/Desktop/Indexs/DSBDA print/GROUP A/Assignment 1 (Data Wranglin I)/autodata.csv")
df.head()

Out[31]: num-
Unnamed: normalized- body- drive- engine- wheel- compression- pe
symboling make aspiration of- ... horsepower
0 losses style wheels location base ratio r
doors

alfa-
0 0 3 122 std two convertible rwd front 88.6 ... 9.0 111.0 500
romero

alfa-
1 1 3 122 std two convertible rwd front 88.6 ... 9.0 111.0 500
romero

alfa-
2 2 1 122 std two hatchback rwd front 94.5 ... 9.0 154.0 500
romero

3 3 2 164 audi std four sedan fwd front 99.8 ... 10.0 102.0 550

4 4 2 164 audi std four sedan 4wd front 99.4 ... 8.0 115.0 550

5 rows × 30 columns

 

In [ ]: # 4. Data Preprocessing

In [33]: # Check for missing values

print("Missing values per column:\n", df.isnull().sum())
Missing values per column:
Unnamed: 0 0
symboling 0
normalized-losses 0
make 0
aspiration 0
num-of-doors 0
body-style 0
drive-wheels 0
engine-location 0
wheel-base 0
length 0
width 0
height 0
curb-weight 0
engine-type 0
num-of-cylinders 0
engine-size 0
fuel-system 0
bore 0
stroke 4
compression-ratio 0
horsepower 2
peak-rpm 2
city-mpg 0
highway-mpg 0
price 0
city-L/100km 0
horsepower-binned 2
diesel 0
gas 0
dtype: int64

In [34]: # Get initial statistics

print("\nDescriptive statistics:\n", df.describe(include="all"))
Descriptive statistics:
Unnamed: 0 symboling normalized-losses make aspiration \
count 201.000000 201.000000 201.00000 201 201
unique NaN NaN NaN 22 2
top NaN NaN NaN toyota std
freq NaN NaN NaN 32 165
mean 100.000000 0.840796 122.00000 NaN NaN
std 58.167861 1.254802 31.99625 NaN NaN
min 0.000000 -2.000000 65.00000 NaN NaN
25% 50.000000 0.000000 101.00000 NaN NaN
50% 100.000000 1.000000 122.00000 NaN NaN
75% 150.000000 2.000000 137.00000 NaN NaN
max 200.000000 3.000000 256.00000 NaN NaN

num-of-doors body-style drive-wheels engine-location wheel-base ... \

count 201 201 201 201 201.000000 ...
unique 2 5 3 2 NaN ...
top four sedan fwd front NaN ...
freq 115 94 118 198 NaN ...
mean NaN NaN NaN NaN 98.797015 ...
std NaN NaN NaN NaN 6.066366 ...
min NaN NaN NaN NaN 86.600000 ...
25% NaN NaN NaN NaN 94.500000 ...
50% NaN NaN NaN NaN 97.000000 ...
75% NaN NaN NaN NaN 102.400000 ...
max NaN NaN NaN NaN 120.900000 ...

compression-ratio horsepower peak-rpm city-mpg highway-mpg \

count 201.000000 199.000000 199.000000 201.000000 201.000000
unique NaN NaN NaN NaN NaN
top NaN NaN NaN NaN NaN
freq NaN NaN NaN NaN NaN
mean 10.164279 103.396985 5117.587940 25.179104 30.686567
std 4.004965 37.553843 480.521824 6.423220 6.815150
min 7.000000 48.000000 4150.000000 13.000000 16.000000
25% 8.600000 70.000000 4800.000000 19.000000 25.000000
50% 9.000000 95.000000 5200.000000 24.000000 30.000000
75% 9.400000 116.000000 5500.000000 30.000000 34.000000
max 23.000000 262.000000 6600.000000 49.000000 54.000000

price city-L/100km horsepower-binned diesel gas

count 201.000000 201.000000 199 201.000000 201.000000
unique NaN NaN 3 NaN NaN
top NaN NaN Low NaN NaN
freq NaN NaN 151 NaN NaN
mean 13207.129353 9.944145 NaN 0.099502 0.900498
std 7947.066342 2.534599 NaN 0.300083 0.300083
min 5118.000000 4.795918 NaN 0.000000 0.000000
25% 7775.000000 7.833333 NaN 0.000000 1.000000
50% 10295.000000 9.791667 NaN 0.000000 1.000000
75% 16500.000000 12.368421 NaN 0.000000 1.000000
max 45400.000000 18.076923 NaN 1.000000 1.000000

[11 rows x 30 columns]

In [35]: # Check dimensions

print("\nShape of the dataset:", df.shape)

Shape of the dataset: (201, 30)

In [36]: # Column names

print("\nColumns in dataset:\n", df.columns)

Columns in dataset:
Index(['Unnamed: 0', 'symboling', 'normalized-losses', 'make', 'aspiration',
'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
'highway-mpg', 'price', 'city-L/100km', 'horsepower-binned', 'diesel',
'gas'],
dtype='object')

In [39]: # 5. Data Formatting and Normalization

# Check data types of each column

print("\nData types before conversion:\n", df.dtypes)

# Example conversions: ensure numeric columns are numeric

# (force conversion if needed using errors='coerce')
df["price"] = pd.to_numeric(df["price"], errors='coerce')
df["horsepower"] = pd.to_numeric(df["horsepower"], errors='coerce')
df["peak-rpm"] = pd.to_numeric(df["peak-rpm"], errors='coerce')

# Fill missing values with mean

df["price"] = df["price"].fillna(df["price"].mean())
df["horsepower"] = df["horsepower"].fillna(df["horsepower"].mean())
df["peak-rpm"] = df["peak-rpm"].fillna(df["peak-rpm"].mean())
# Fill 'num-of-doors' missing values with mode
df["num-of-doors"] = df["num-of-doors"].fillna(df["num-of-doors"].mode()[0])

# Drop rows where target or crucial column is still missing (if any)
df.dropna(subset=["price"], axis=0, inplace=True)

print("\nData types after conversion:\n", df.dtypes)

Data types before conversion:

Unnamed: 0 int64
symboling int64
normalized-losses int64
make object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore float64
stroke float64
compression-ratio float64
horsepower float64
peak-rpm float64
city-mpg int64
highway-mpg int64
price float64
city-L/100km float64
horsepower-binned object
diesel int64
gas int64
dtype: object

Data types after conversion:

In [40]: # 6. Turn categorical variables into quantitative variables

# Select object (categorical) columns

categorical_columns = df.select_dtypes(include=['object']).columns
print("Categorical columns:\n", categorical_columns)

# One-hot encode categorical columns

df_encoded = pd.get_dummies(df, columns=categorical_columns)
df_encoded.head()
Categorical columns:
Index(['make', 'aspiration', 'num-of-doors', 'body-style', 'drive-wheels',
'engine-location', 'engine-type', 'num-of-cylinders', 'fuel-system',
'horsepower-binned'],
dtype='object')
Out[40]: Unnamed: normalized- wheel- curb- engine- fuel- fuel- fuel
symboling length width height bore ...
0 losses base weight size system_2bbl system_4bbl system_id

0 0 3 122 88.6 0.811148 0.890278 48.8 2548 130 3.47 ... False False Fals

1 1 3 122 88.6 0.811148 0.890278 48.8 2548 130 3.47 ... False False Fals

2 2 1 122 94.5 0.822681 0.909722 52.4 2823 152 2.68 ... False False Fals

3 3 2 164 99.8 0.848630 0.919444 54.3 2337 109 3.19 ... False False Fals

4 4 2 164 99.4 0.848630 0.922222 54.3 2824 136 3.19 ... False False Fals

5 rows × 80 columns

 

In [41]: # Save final cleaned and encoded dataset (optional)

df_encoded.to_csv("cleaned_autodata.csv", index=False)
print("✅ Cleaned dataset saved as 'cleaned_autodata.csv'")

✅ Cleaned dataset saved as 'cleaned_autodata.csv'

In [ ]:

Practical Example Full Notes
No ratings yet
Practical Example Full Notes
48 pages
Pandas 32
No ratings yet
Pandas 32
21 pages
Import As Import As
No ratings yet
Import As Import As
18 pages
'Horsepower' "?" 'Horsepower' 'Horsepower' 'Horsepower' 'Horsepower' 'Horsepower'
No ratings yet
'Horsepower' "?" 'Horsepower' 'Horsepower' 'Horsepower' 'Horsepower' 'Horsepower'
5 pages
Car Price Prediction Oasis Infobyte Task3
No ratings yet
Car Price Prediction Oasis Infobyte Task3
7 pages
Numpy,,Pandas (24.4.25)
No ratings yet
Numpy,,Pandas (24.4.25)
1 page
GmPrac1 - Jupyter Notebook
No ratings yet
GmPrac1 - Jupyter Notebook
11 pages
Exp 5 Exploratory Data Analysis SDK Ok
No ratings yet
Exp 5 Exploratory Data Analysis SDK Ok
13 pages
Elite Sports Cars Eda
No ratings yet
Elite Sports Cars Eda
9 pages
Mtcars - Ipynb - Colab
No ratings yet
Mtcars - Ipynb - Colab
2 pages
Car Price Prediction
No ratings yet
Car Price Prediction
35 pages
Quikr Car Price Prediction Using Linear Regression 1717999953
No ratings yet
Quikr Car Price Prediction Using Linear Regression 1717999953
12 pages
Data Analysis for Auto Enthusiasts
No ratings yet
Data Analysis for Auto Enthusiasts
8 pages
Untitled 0
No ratings yet
Untitled 0
3 pages
Week 4
No ratings yet
Week 4
13 pages
City Cycle Fuel Consumption 2024
No ratings yet
City Cycle Fuel Consumption 2024
23 pages
R Lab Ex 1 To 5
No ratings yet
R Lab Ex 1 To 5
26 pages
DataFrames: Handling Missing Values & Visualization
No ratings yet
DataFrames: Handling Missing Values & Visualization
12 pages
Pandas 2
No ratings yet
Pandas 2
18 pages
Mohy - Jupyter Notebook
No ratings yet
Mohy - Jupyter Notebook
3 pages
Lab Assignment 6
No ratings yet
Lab Assignment 6
5 pages
Se Python - Merged
No ratings yet
Se Python - Merged
77 pages
Car Price Prediction 1
No ratings yet
Car Price Prediction 1
24 pages
Car Data Analysis for Data Scientists
No ratings yet
Car Data Analysis for Data Scientists
11 pages
Miles Per Gallon
No ratings yet
Miles Per Gallon
11 pages
Topic
No ratings yet
Topic
9 pages
3 Exp-3
No ratings yet
3 Exp-3
3 pages
Linear Regression
No ratings yet
Linear Regression
4 pages
Eda 1
No ratings yet
Eda 1
29 pages
Python Codes
No ratings yet
Python Codes
17 pages
06 Data Analysis With Python I
No ratings yet
06 Data Analysis With Python I
6 pages
Big Data Analytics Practical Guide
No ratings yet
Big Data Analytics Practical Guide
41 pages
Machine Learning Project 1690186790
No ratings yet
Machine Learning Project 1690186790
18 pages
R Studio
No ratings yet
R Studio
4 pages
Basic of Pandas
No ratings yet
Basic of Pandas
13 pages
Data Analysis Report
No ratings yet
Data Analysis Report
74 pages
Statisitics Project 3
No ratings yet
Statisitics Project 3
22 pages
#1 - Skill Builds - Data Analysis With Python
No ratings yet
#1 - Skill Builds - Data Analysis With Python
3 pages
Finalll - Ipynb - Colab
No ratings yet
Finalll - Ipynb - Colab
11 pages
Internship
No ratings yet
Internship
23 pages
Engo 645
No ratings yet
Engo 645
10 pages
Introduction To Python - Minor Project
No ratings yet
Introduction To Python - Minor Project
5 pages
Day09 DataWrangling
No ratings yet
Day09 DataWrangling
27 pages
Data Analysis for Car Sales Insights
No ratings yet
Data Analysis for Car Sales Insights
19 pages
Sem 4.1
No ratings yet
Sem 4.1
8 pages
DV Ca-1
No ratings yet
DV Ca-1
9 pages
Untitled 21
No ratings yet
Untitled 21
6 pages
Week 3 Lec Pandas 1-5
No ratings yet
Week 3 Lec Pandas 1-5
1 page
Submitted By:-Shaikshahanaafroz - Cms20Mba093: 1. Identify The Shape of The Data
No ratings yet
Submitted By:-Shaikshahanaafroz - Cms20Mba093: 1. Identify The Shape of The Data
6 pages
Pandas Notes Basic To Advance
No ratings yet
Pandas Notes Basic To Advance
21 pages
Exercise3 Solution
No ratings yet
Exercise3 Solution
19 pages
Assignment CSE-520
No ratings yet
Assignment CSE-520
29 pages
Dehlivery CASESTUDY - Ipynb - Colab
No ratings yet
Dehlivery CASESTUDY - Ipynb - Colab
21 pages
DA Exp6 HTML
No ratings yet
DA Exp6 HTML
9 pages
Advance EDA & Predictive Analytics
No ratings yet
Advance EDA & Predictive Analytics
38 pages
Auto Dataset MK - Part 1: Pandas PD Numpy NP
No ratings yet
Auto Dataset MK - Part 1: Pandas PD Numpy NP
18 pages
Statisitics Project 7
No ratings yet
Statisitics Project 7
22 pages
22eg107a11 DWV
No ratings yet
22eg107a11 DWV
15 pages
DSBDA9
No ratings yet
DSBDA9
7 pages
Website Evaluation Report
No ratings yet
Website Evaluation Report
1 page
OUTPUT AI-flat
No ratings yet
OUTPUT AI-flat
44 pages
DSBDA3
No ratings yet
DSBDA3
3 pages
2024 Calendar
No ratings yet
2024 Calendar
13 pages
CN's Index
No ratings yet
CN's Index
2 pages
How Is Numerical Integration Programmed in S7-SCL and STEP 7
No ratings yet
How Is Numerical Integration Programmed in S7-SCL and STEP 7
2 pages
Cratonização: Tectónica - 2018 - Daud Jamal
No ratings yet
Cratonização: Tectónica - 2018 - Daud Jamal
38 pages
Diaphargm Wall Design
80% (5)
Diaphargm Wall Design
24 pages
Research
No ratings yet
Research
3 pages
Concrete Durability Mechanisms
No ratings yet
Concrete Durability Mechanisms
5 pages
High-Speed SSDs for Gamers
No ratings yet
High-Speed SSDs for Gamers
1 page
IFSCA Phase 2 Admit Card
No ratings yet
IFSCA Phase 2 Admit Card
3 pages
Filipino Values
No ratings yet
Filipino Values
13 pages
Westby Play Scale 2000
0% (1)
Westby Play Scale 2000
5 pages
Au L 1654256983 Year 1 Semester 1 English Reading Assessment - Ver - 2
No ratings yet
Au L 1654256983 Year 1 Semester 1 English Reading Assessment - Ver - 2
6 pages
CH - 03 - 1 DataLink Layer
No ratings yet
CH - 03 - 1 DataLink Layer
11 pages
DR - K Manikandan
No ratings yet
DR - K Manikandan
2 pages
Bank Exam Prep & Current Affairs
No ratings yet
Bank Exam Prep & Current Affairs
21 pages
Mid Pre Board
No ratings yet
Mid Pre Board
6 pages
IB Physics IA Guide for Students
No ratings yet
IB Physics IA Guide for Students
27 pages
ASM Vacuum Technology Practical
100% (2)
ASM Vacuum Technology Practical
263 pages
04-Ceragon-IP-10G Radio Configuration PDF
No ratings yet
04-Ceragon-IP-10G Radio Configuration PDF
16 pages
Rugged 24V to 12V DC Converter
No ratings yet
Rugged 24V to 12V DC Converter
4 pages
Individual Conference Form
No ratings yet
Individual Conference Form
24 pages
Tos Lab Report
No ratings yet
Tos Lab Report
15 pages
(Ebook) Incompleteness - The Proof and Paradox of Kurt Gödel ( Goedel) by Rebecca Goldstein ISBN 9780393051698, 0393051692 Latest PDF 2025
100% (1)
(Ebook) Incompleteness - The Proof and Paradox of Kurt Gödel ( Goedel) by Rebecca Goldstein ISBN 9780393051698, 0393051692 Latest PDF 2025
148 pages
Development Communication Outline
No ratings yet
Development Communication Outline
9 pages
IshworThapa MPA631Rural-UrbanDevelopment
No ratings yet
IshworThapa MPA631Rural-UrbanDevelopment
114 pages
Tie Bar and Straps
No ratings yet
Tie Bar and Straps
2 pages
Quotation LBM - Meet K Drama 2025
No ratings yet
Quotation LBM - Meet K Drama 2025
4 pages
EWB - Vehicle Tracking Mobile App
No ratings yet
EWB - Vehicle Tracking Mobile App
16 pages
EE314 Lab 1 Final
No ratings yet
EE314 Lab 1 Final
19 pages
CV - Viraj Sandaruwa Jul23
No ratings yet
CV - Viraj Sandaruwa Jul23
1 page
Supply Chain Strategies for Walmart
No ratings yet
Supply Chain Strategies for Walmart
16 pages
Micros Opera Chapter III Reservation
No ratings yet
Micros Opera Chapter III Reservation
109 pages

DSBDA1

Uploaded by

DSBDA1

Uploaded by

In [28]: # 1.

Import all the required Python Libraries

In [28]: # 1. Import all the required Python Libraries

In [29]: # 2. Dataset Description

You uploaded the dataset locally as 'autodata.csv'.

Dataset Name: Automobile Dataset

You uploaded the dataset locally as 'autodata.csv'.

In [31]: # 3. Load the Dataset into pandas dataframe

In [33]: # Check for missing values

In [34]: # Get initial statistics

num-of-doors body-style drive-wheels engine-location wheel-base ... \

compression-ratio horsepower peak-rpm city-mpg highway-mpg \

price city-L/100km horsepower-binned diesel gas

[11 rows x 30 columns]

In [35]: # Check dimensions

Shape of the dataset: (201, 30)

In [36]: # Column names

In [39]: # 5. Data Formatting and Normalization

# Check data types of each column

# Example conversions: ensure numeric columns are numeric

# Fill missing values with mean

print("\nData types after conversion:\n", df.dtypes)

Data types before conversion:

Data types after conversion:

In [40]: # 6. Turn categorical variables into quantitative variables

# Select object (categorical) columns

# One-hot encode categorical columns

In [41]: # Save final cleaned and encoded dataset (optional)

✅ Cleaned dataset saved as 'cleaned_autodata.csv'

You might also like