0% found this document useful (0 votes)
11 views7 pages

Enda Practical 3 Explanation One

This script performs data preprocessing by importing necessary libraries, loading a CSV file into a pandas DataFrame, and displaying various sections of the dataset. It handles missing values and duplicates, encodes categorical variables into numerical values, standardizes numerical data, and splits the dataset into training and testing sets. The process concludes with a completion message indicating that data preprocessing is finished.

Uploaded by

Edilita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

Enda Practical 3 Explanation One

This script performs data preprocessing by importing necessary libraries, loading a CSV file into a pandas DataFrame, and displaying various sections of the dataset. It handles missing values and duplicates, encodes categorical variables into numerical values, standardizes numerical data, and splits the dataset into training and testing sets. The process concludes with a completion message indicating that data preprocessing is finished.

Uploaded by

Edilita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Let's break this down step by step:

---

## **1. Importing Necessary Libraries**

```python

import pandas as pd # Importing pandas for data manipulation

import numpy as np # Importing numpy for numerical computations

from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder,

OneHotEncoder

from sklearn.model_selection import train_test_split

```

- **pandas (`pd`)**: A library for data manipulation and analysis. It provides data

structures such as `DataFrame` and `Series` to store and process tabular data.

- **numpy (`np`)**: A library for numerical computations, providing support for arrays,

matrices, and mathematical functions.

- **sklearn.preprocessing**:

- `StandardScaler`: Standardizes data by removing the mean and scaling to unit

variance.

- `MinMaxScaler`: Scales data to a fixed range, usually [0,1].

- `LabelEncoder`: Converts categorical labels into numeric values.

- `OneHotEncoder`: Encodes categorical variables as binary vectors.

- **sklearn.model_selection**:
- `train_test_split`: Splits data into training and testing sets.

---

## **2. Loading the Dataset**

```python

df =

pd.read_csv('/content/food-price-index-september-2023-weighted-average-prices.csv')

```

- Reads a CSV (Comma-Separated Values) file into a pandas `DataFrame`.

- `df` is now a tabular dataset.

### **Displaying Data**

```python

print("Orignal Data :")

print(df.head()) # Displays the first 5 rows

print(df.head()) # Again displays the first 5 rows

print(df.tail()) # Displays the last 5 rows

print(df.tail(15)) # Displays the last 15 rows

print(df.head(30)) # Displays the first 30 rows

print(df.head(15)) # Displays the first 15 rows

```

- `df.head(n)`: Shows the first `n` rows (default: 5).


- `df.tail(n)`: Shows the last `n` rows (default: 5).

- The redundant print statements might be unintentional.

---

## **3. Handling Missing Values**

```python

print("\nChecking for missing values:")

print(df.isnull().sum()) # Count missing values per column

df = df.dropna() # Drops rows with missing values

```

- `df.isnull().sum()`: Checks how many missing values each column has.

- `df.dropna()`: Removes rows with missing values.

_(Alternative: `df.fillna(value)` fills missing values with a specified value.)_

---

## **4. Handling Duplicates**

```python

print("\nchecking for duplicates:")

print(df.duplicated().sum()) # Counts duplicate rows

df = df.drop_duplicates() # Removes duplicate rows

```
- `df.duplicated().sum()`: Counts the number of duplicate rows.

- `df.drop_duplicates()`: Removes duplicate rows.

---

## **5. Encoding Categorical Variables**

```python

print("\nEncoding categorical variables:")

categorical_cols = df.select_dtypes(include=['object']).columns # Selecting categorical

columns

label_encoders = {}

for col in categorical_cols:

le = LabelEncoder()

df[col] = le.fit_transform(df[col]) # Apply Label Encoding

label_encoders[col] = le

print(df.head()) # Display transformed dataset

```

- `df.select_dtypes(include=['object']).columns`: Finds all categorical columns.

- **Label Encoding**:

- Converts categorical data (text) into numbers.

- Example: `['Apple', 'Banana', 'Cherry'] → [0, 1, 2]`.

- The encoded values replace the original categorical values.


---

## **6. Feature Scaling**

```python

print("\nApplying feature scaling:")

numeric_cols = df.select_dtypes(include=[np.number]).columns # Selecting numeric

columns

scaler = StandardScaler() # Initializing Standard Scaler

df[numeric_cols] = scaler.fit_transform(df[numeric_cols]) # Standardizing numerical

features

print(df.head()) # Display scaled dataset

```

- `df.select_dtypes(include=[np.number]).columns`: Finds all numeric columns.

- `StandardScaler()`:

- Standardizes data: `(value - mean) / standard deviation`

- Ensures all features have a mean of 0 and standard deviation of 1.

- **Effect**: Prevents large numerical values from dominating smaller ones.

---

## **7. Splitting Dataset into Training and Testing Sets**

```python

print("\nSplitting dataset into training and testing sets:")


X = df.drop(columns=['Series_reference']) # Assuming 'Series_reference' is the target

variable

y = df['Series_reference']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape}, Testing set size: {X_test.shape}")

```

- `X = df.drop(columns=['Series_reference'])`: Features (independent variables).

- `y = df['Series_reference']`: Target variable (dependent variable).

- `train_test_split(X, y, test_size=0.2, random_state=42)`:

- Splits the dataset into:

- `80%` training (`X_train`, `y_train`)

- `20%` testing (`X_test`, `y_test`)

- `random_state=42` ensures reproducibility.

---

## **8. Completion Message**

```python

print("\nData Preprocessing Completed!")

```

- Confirms that data preprocessing is finished.

---
## **Summary**

### **What does this script do?**

1. **Imports necessary libraries** for data manipulation, encoding, scaling, and splitting.

2. **Loads a CSV file** into a `pandas` DataFrame.

3. **Displays different sections** of the dataset.

4. **Handles missing values** by removing rows with `NaN`s.

5. **Removes duplicate rows** to avoid redundant data.

6. **Encodes categorical variables** into numerical values.

7. **Standardizes numerical data** for consistency.

8. **Splits the dataset** into training and testing sets.

You might also like