Let's break this down step by step:
---
## **1. Importing Necessary Libraries**
```python
import pandas as pd # Importing pandas for data manipulation
import numpy as np # Importing numpy for numerical computations
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder,
OneHotEncoder
from sklearn.model_selection import train_test_split
```
- **pandas (`pd`)**: A library for data manipulation and analysis. It provides data
structures such as `DataFrame` and `Series` to store and process tabular data.
- **numpy (`np`)**: A library for numerical computations, providing support for arrays,
matrices, and mathematical functions.
- **sklearn.preprocessing**:
- `StandardScaler`: Standardizes data by removing the mean and scaling to unit
variance.
- `MinMaxScaler`: Scales data to a fixed range, usually [0,1].
- `LabelEncoder`: Converts categorical labels into numeric values.
- `OneHotEncoder`: Encodes categorical variables as binary vectors.
- **sklearn.model_selection**:
- `train_test_split`: Splits data into training and testing sets.
---
## **2. Loading the Dataset**
```python
df =
pd.read_csv('/content/food-price-index-september-2023-weighted-average-prices.csv')
```
- Reads a CSV (Comma-Separated Values) file into a pandas `DataFrame`.
- `df` is now a tabular dataset.
### **Displaying Data**
```python
print("Orignal Data :")
print(df.head()) # Displays the first 5 rows
print(df.head()) # Again displays the first 5 rows
print(df.tail()) # Displays the last 5 rows
print(df.tail(15)) # Displays the last 15 rows
print(df.head(30)) # Displays the first 30 rows
print(df.head(15)) # Displays the first 15 rows
```
- `df.head(n)`: Shows the first `n` rows (default: 5).
- `df.tail(n)`: Shows the last `n` rows (default: 5).
- The redundant print statements might be unintentional.
---
## **3. Handling Missing Values**
```python
print("\nChecking for missing values:")
print(df.isnull().sum()) # Count missing values per column
df = df.dropna() # Drops rows with missing values
```
- `df.isnull().sum()`: Checks how many missing values each column has.
- `df.dropna()`: Removes rows with missing values.
_(Alternative: `df.fillna(value)` fills missing values with a specified value.)_
---
## **4. Handling Duplicates**
```python
print("\nchecking for duplicates:")
print(df.duplicated().sum()) # Counts duplicate rows
df = df.drop_duplicates() # Removes duplicate rows
```
- `df.duplicated().sum()`: Counts the number of duplicate rows.
- `df.drop_duplicates()`: Removes duplicate rows.
---
## **5. Encoding Categorical Variables**
```python
print("\nEncoding categorical variables:")
categorical_cols = df.select_dtypes(include=['object']).columns # Selecting categorical
columns
label_encoders = {}
for col in categorical_cols:
le = LabelEncoder()
df[col] = le.fit_transform(df[col]) # Apply Label Encoding
label_encoders[col] = le
print(df.head()) # Display transformed dataset
```
- `df.select_dtypes(include=['object']).columns`: Finds all categorical columns.
- **Label Encoding**:
- Converts categorical data (text) into numbers.
- Example: `['Apple', 'Banana', 'Cherry'] → [0, 1, 2]`.
- The encoded values replace the original categorical values.
---
## **6. Feature Scaling**
```python
print("\nApplying feature scaling:")
numeric_cols = df.select_dtypes(include=[np.number]).columns # Selecting numeric
columns
scaler = StandardScaler() # Initializing Standard Scaler
df[numeric_cols] = scaler.fit_transform(df[numeric_cols]) # Standardizing numerical
features
print(df.head()) # Display scaled dataset
```
- `df.select_dtypes(include=[np.number]).columns`: Finds all numeric columns.
- `StandardScaler()`:
- Standardizes data: `(value - mean) / standard deviation`
- Ensures all features have a mean of 0 and standard deviation of 1.
- **Effect**: Prevents large numerical values from dominating smaller ones.
---
## **7. Splitting Dataset into Training and Testing Sets**
```python
print("\nSplitting dataset into training and testing sets:")
X = df.drop(columns=['Series_reference']) # Assuming 'Series_reference' is the target
variable
y = df['Series_reference']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set size: {X_train.shape}, Testing set size: {X_test.shape}")
```
- `X = df.drop(columns=['Series_reference'])`: Features (independent variables).
- `y = df['Series_reference']`: Target variable (dependent variable).
- `train_test_split(X, y, test_size=0.2, random_state=42)`:
- Splits the dataset into:
- `80%` training (`X_train`, `y_train`)
- `20%` testing (`X_test`, `y_test`)
- `random_state=42` ensures reproducibility.
---
## **8. Completion Message**
```python
print("\nData Preprocessing Completed!")
```
- Confirms that data preprocessing is finished.
---
## **Summary**
### **What does this script do?**
1. **Imports necessary libraries** for data manipulation, encoding, scaling, and splitting.
2. **Loads a CSV file** into a `pandas` DataFrame.
3. **Displays different sections** of the dataset.
4. **Handles missing values** by removing rows with `NaN`s.
5. **Removes duplicate rows** to avoid redundant data.
6. **Encodes categorical variables** into numerical values.
7. **Standardizes numerical data** for consistency.
8. **Splits the dataset** into training and testing sets.