Nims Institute of Engineering & Technology
Nims University Rajasthan, Jaipur
LAB MANUAL
FOR
Machine Learning (CSC601B)
NIMS UNIVERSITY RAJASTHAN, JAIPUR
Jaipur-Delhi Highway
Jaipur - 303121, Rajasthan, India
Website: www.nimsuniversity.org
1
Contents
Syllabus ................................................................................................................................................... 3
MACHINE LEARNING LAB ........................................................................................................................ 3
Practical 1 ................................................................................................................................................ 4
Aim: 1. Predict housing prices based on features like area, number of bedrooms, and location using
linear regression. .................................................................................................................................... 4
Practical 2 .............................................................................................................................................. 16
Aim: 2. Classify a dataset using a k-Nearest Neighbors (kNN) classifier ............................................... 16
Practical 3 .............................................................................................................................................. 22
Aim: 3. Implement a decision tree algorithm to classify email spam based on keywords and sender
information. .......................................................................................................................................... 22
Practical 4 .............................................................................................................................................. 25
Aim: 4. Cluster customer data based on purchase history using k-means clustering. ......................... 25
Practical 5 .............................................................................................................................................. 28
Aim: 5. Predict future stock prices using a time series forecasting model (e.g., ARIMA) .................... 28
Practical 6 .............................................................................................................................................. 32
Aim: 6. Develop a sentiment analysis model to classify movie reviews as positive, negative, or
neutral. .................................................................................................................................................. 32
Practical 7 .............................................................................................................................................. 36
Aim: 7. Explore dimensionality reduction techniques like Principal Component Analysis (PCA) to
visualize high-dimensional data. ........................................................................................................... 36
Practical 8 .............................................................................................................................................. 39
Aim: 8. Train a support vector machine (SVM) to classify data. ........................................................... 39
2
Syllabus
MACHINE LEARNING LAB
Course Objectives:
1. Understand the concept of learning in computer and science.
2. Compare and contrast different paradigms for learning (supervised, unsupervised, etc.).
3. Design experiments to evaluate and compare different machine learning techniques on
real-world problems.
Experiments
1. Predict housing prices based on features like area, number of bedrooms, and
location using linear regression.
2. Classify a dataset using a k-Nearest Neighbors (kNN) classifier.
3. Implement a decision tree algorithm to classify email spam based on keywords
and sender information.
4. Cluster customer data based on purchase history using k-means clustering.
5. Predict future stock prices using a time series forecasting model (e.g., ARIMA).
6. Develop a sentiment analysis model to classify movie reviews as positive,
negative, or neutral.
7. Explore dimensionality reduction techniques like Principal Component Analysis
(PCA) to visualize high-dimensional data.
8. Train a support vector machine (SVM) to classify data.
Course Outcomes:
1. Implement and analyse existing learning algorithms, including well-studied methods for
classification, regression and clustering
2. Apply evaluation metrics for various algorithms.
3. Identifying and implementing real-world problem
3
Practical 1
Aim: 1. Predict housing prices based on features like area, number of
bedrooms, and location using linear regression.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Create a small dataset
data = {
"Area": [1500, 1800, 2400, 3000, 3500],
"Bedrooms": [3, 4, 3, 5, 4],
"Location": [1, 2, 1, 3, 2], # Encoding for location (e.g., 1: City A, 2: City B, 3: City C)
"Price": [300000, 400000, 350000, 500000, 450000]
}
# Convert to DataFrame
df = pd.DataFrame(data)
# Features and target
X = df[["Area", "Bedrooms", "Location"]]
y = df["Price"]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Model Coefficients:", model.coef_)
print("Model Intercept:", model.intercept_)
print("Mean Squared Error:", mse)
print("R2 Score:", r2)
# Visualize actual vs predicted
plt.scatter(range(len(y_test)), y_test, color="blue", label="Actual Prices")
4
plt.scatter(range(len(y_pred)), y_pred, color="red", label="Predicted Prices")
plt.xlabel("Sample Index")
plt.ylabel("Price")
plt.legend()
plt.title("Actual vs Predicted Prices")
plt.show()
Output:
How the Code Works
Dataset: A small dataset with features: Area, Bedrooms, Location, and Price.
Location is encoded as integers.
Splitting Data: Divides the data into training (80%) and testing (20%) sets.
Linear Regression: Uses Linear Regression from scikit-learn to build the model.
Evaluation: Computes Mean Squared Error (MSE) and R² score to evaluate the model's
performance.
Visualization: Compares actual vs. predicted prices.
5
Reading Data from an Excel File instead of Data Frame
Firstly, create a dataset and save it as 1_Housing.xlsx
Upload this file in the Google Drive
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load data from an Excel file
data_path = "1_Housing.xlsx" # Replace with your Excel file path
df = pd.read_excel(data_path)
6
# Features and target
X = df[["Area", "Bedrooms", "Location"]]
y = df["Price"]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Model Coefficients:", model.coef_)
print("Model Intercept:", model.intercept_)
print("Mean Squared Error:", mse)
print("R2 Score:", r2)
# Visualize actual vs predicted
plt.scatter(range(len(y_test)), y_test, color="blue", label="Actual Prices")
plt.scatter(range(len(y_pred)), y_pred, color="red", label="Predicted Prices")
plt.xlabel("Sample Index")
plt.ylabel("Price")
plt.legend()
plt.title("Actual vs Predicted Prices")
7
plt.show()
Output:
Reading a data file from excel and then predicting the Price on the basis of Area, No. of
Rooms and City.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load the dataset from an Excel file
df = pd.read_excel('1_Housing.xlsx') # Replace 'housing_data.xlsx' with your file path
8
# Features and target
X = df[["Area", "Bedrooms", "Location"]]
y = df["Price"]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Model Coefficients:", model.coef_)
print("Model Intercept:", model.intercept_)
print("Mean Squared Error:", mse)
print("R2 Score:", r2)
# Visualize actual vs predicted
plt.scatter(range(len(y_test)), y_test, color="blue", label="Actual Prices")
plt.scatter(range(len(y_pred)), y_pred, color="red", label="Predicted Prices")
plt.xlabel("Sample Index")
plt.ylabel("Price")
plt.legend()
9
plt.title("Actual vs Predicted Prices")
plt.show()
# Predict price for new data
new_data = pd.DataFrame({
"Area": [float(input("Enter Area: "))],
"Bedrooms": [int(input("Enter Bedrooms: "))],
"Location": [int(input("Enter Location (e.g., 1 for City A, 2 for City B): "))]
})
predicted_price = model.predict(new_data)
print(f"Predicted Price: {predicted_price[0]}")
10
11
12
13
14
15
Practical 2
Aim: 2. Classify a dataset using a k-Nearest Neighbors (kNN) classifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load dataset
data = load_iris()
X = data.data
y = data.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train kNN classifier
k=3
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
# Make predictions
y_pred = knn.predict(X_test)
16
# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Output:
Add some Visualization to the code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load dataset
17
data = load_iris()
X = data.data
y = data.target
feature_names = data.feature_names
class_names = data.target_names
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train kNN classifier
k=3
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
# Make predictions
y_pred = knn.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
# Visualizations
# 1. Pairplot of features
df = pd.DataFrame(X, columns=feature_names)
18
df['target'] = y
sns.pairplot(df, hue='target', diag_kind='hist', palette='Set2')
plt.suptitle('Feature Distributions and Pairwise Scatter Plots', y=1.02)
plt.show()
# 2. Confusion Matrix Heatmap
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=class_names,
yticklabels=class_names)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()
# 3. Decision Boundaries (for 2 features only, e.g., first two features)
if X.shape[1] == 2: # Only possible for datasets with 2 features
X_plot = X[:, :2] # Use the first two features
X_train_plot, X_test_plot, y_train_plot, y_test_plot = train_test_split(X_plot, y, test_size=0.2,
random_state=42)
X_train_plot = scaler.fit_transform(X_train_plot)
X_test_plot = scaler.transform(X_test_plot)
knn_2d = KNeighborsClassifier(n_neighbors=k)
knn_2d.fit(X_train_plot, y_train_plot)
# Create a mesh grid
x_min, x_max = X_train_plot[:, 0].min() - 1, X_train_plot[:, 0].max() + 1
y_min, y_max = X_train_plot[:, 1].min() - 1, X_train_plot[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
np.arange(y_min, y_max, 0.01))
19
Z = knn_2d.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot decision boundary
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.8, cmap='Set3')
scatter = plt.scatter(X_test_plot[:, 0], X_test_plot[:, 1], c=y_test_plot, edgecolor='k', cmap='viridis')
plt.title('kNN Decision Boundary (2D)')
plt.xlabel(feature_names[0])
plt.ylabel(feature_names[1])
legend = plt.legend(handles=scatter.legend_elements()[0], labels=class_names)
plt.show()
else:
print("Decision boundary visualization is only possible for datasets with 2 features.")
Output:
20
How kNN works?
21
Practical 3
Aim: 3. Implement a decision tree algorithm to classify email spam
based on keywords and sender information.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.metrics import accuracy_score, classification_report
# Sample dataset
data = {
'contains_free': [1, 0, 1, 0, 1, 0, 1, 0],
'contains_offer': [0, 1, 1, 0, 0, 1, 1, 0],
'sender_known': [0, 1, 1, 1, 0, 1, 0, 0],
'spam': [1, 0, 1, 0, 1, 0, 1, 0]
# Create a DataFrame
df = pd.DataFrame(data)
# Features and target
X = df[['contains_free', 'contains_offer', 'sender_known']]
y = df['spam']
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
22
# Initialize the Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)
# Train the model
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Display the decision tree
tree_rules = export_text(clf, feature_names=list(X.columns))
print("\nDecision Tree Rules:")
print(tree_rules)
Output:
23
Explanation of the Code
1. Dataset:
The dataset contains three features:
o contains_free: Indicates if the email contains the keyword "free" (1 for yes,
0 for no).
o contains_offer: Indicates if the email contains the keyword "offer" (1 for
yes, 0 for no).
o sender_known: Indicates if the sender is known (1 for yes, 0 for no).
The spam column is the target (1 for spam, 0 for not spam).
2. Splitting the Dataset:
The dataset is split into training and testing sets using train_test_split.
3. Model Training:
A decision tree classifier is initialized and trained on the training set.
4. Model Evaluation:
Predictions are made on the testing set, and the accuracy and classification report are
printed.
5. Visualizing the Tree:
The export_text function generates human-readable decision rules for the tree.
Output Example
sql
Copy code
Accuracy: 100.00%
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 2
1 1.00 1.00 1.00 2
accuracy 1.00 4
macro avg 1.00 1.00 1.00 4
weighted avg 1.00 1.00 1.00 4
Decision Tree Rules:
|--- contains_free <= 0.50
| |--- sender_known <= 0.50
| | |--- class: 0
| |--- sender_known > 0.50
| | |--- class: 0
|--- contains_free > 0.50
| |--- class: 1
Notes
1. Customization: You can replace the sample dataset with a larger and more realistic
email dataset.
2. Feature Engineering: Add more features like the length of the email, frequency of
certain words, etc.
3. Model Tuning: Adjust the parameters of DecisionTreeClassifier (e.g.,
max_depth, min_samples_split) for better performance.
24
Practical 4
Aim: 4. Cluster customer data based on purchase history using k-means
clustering.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Simulated customer data
data = {
'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'total_spent': [500, 1500, 300, 800, 2500, 200, 1000, 1800, 400, 700],
'frequency': [5, 20, 2, 8, 30, 1, 12, 25, 3, 6],
'average_purchase_value': [100, 75, 150, 100, 83, 200, 83, 72, 133, 117],
}
# Create a DataFrame
df = pd.DataFrame(data)
# Features for clustering
features = df[['total_spent', 'frequency', 'average_purchase_value']]
# Standardize the features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(scaled_features)
# Add cluster labels to the DataFrame
df['cluster'] = clusters
# Visualize the clusters using PCA (2D projection)
pca = PCA(n_components=2)
pca_features = pca.fit_transform(scaled_features)
plt.figure(figsize=(8, 6))
for cluster in np.unique(clusters):
plt.scatter(
pca_features[clusters == cluster, 0],
pca_features[clusters == cluster, 1],
label=f'Cluster {cluster}'
)
25
# Add centroids to the plot
centroids = pca.transform(kmeans.cluster_centers_)
plt.scatter(centroids[:, 0], centroids[:, 1], s=200, c='black', marker='X', label='Centroids')
plt.title('Customer Clusters')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.legend()
plt.grid()
plt.show()
# Print clustered data
print("Clustered Customer Data:")
print(df)
Output:
26
Key Components:
1. Dataset:
o total_spent: Total amount spent by a customer.
o frequency: Number of purchases.
o average_purchase_value: Average value of each purchase.
2. Standardization:
o Used StandardScaler to normalize features for better clustering
performance.
3. K-Means:
o Specified n_clusters=3 (can be optimized using the elbow method or
silhouette score).
4. Visualization:
o Used PCA for 2D visualization of high-dimensional data.
o Plotted clusters with their centroids.
Notes:
1. Elbow Method: Use the elbow method to determine the optimal number of clusters
by plotting the inertia for different cluster counts.
2. Feature Engineering: Add more features like customer lifetime value, recency, etc.,
for better clustering.
3. Real Data: Replace the simulated data with real customer data.
27
Practical 5
Aim: 5. Predict future stock prices using a time series forecasting model
(e.g., ARIMA)
import pandas as pd
import numpy as np
from statsmodels.tsa.arima.model import ARIMA
import matplotlib.pyplot as plt
# Step 1: Create a small dataset for stock prices
data = {
"Date": pd.date_range(start="2024-01-01", periods=10, freq="D"),
"Stock_Price": [150 + i * 2 + np.random.uniform(-3, 3) for i in
range(10)] # Simulated stock prices with noise
}
stock_data = pd.DataFrame(data)
stock_data.set_index("Date", inplace=True)
# Step 2: Fit an ARIMA model
stock_prices = stock_data['Stock_Price']
model = ARIMA(stock_prices, order=(1, 1, 1)) # ARIMA parameters (p, d,
q)
fitted_model = model.fit()
# Step 3: Forecast future stock prices
forecast_steps = 5
forecast = fitted_model.forecast(steps=forecast_steps)
# Step 4: Plot the actual and forecasted stock prices
plt.figure(figsize=(10, 6))
plt.plot(stock_prices, label="Actual Stock Prices", marker="o")
forecast_index = pd.date_range(start=stock_data.index[-1] +
pd.Timedelta(days=1), periods=forecast_steps, freq='D')
plt.plot(forecast_index, forecast, label="Forecasted Stock Prices",
marker="x", linestyle="--", color="red")
plt.xlabel("Date")
plt.ylabel("Stock Price")
plt.title("Stock Price Forecast using ARIMA")
plt.legend()
plt.grid()
plt.show()
# Step 5: Display the forecasted values
print("Forecasted Stock Prices:")
forecast_df = pd.DataFrame({"Date": forecast_index, "Forecasted Price":
forecast.values})
28
print(forecast_df)
Output:
Here’s a step-by-step explanation of the code:
Step 1: Create a Small Dataset for Stock Prices
python
Copy code
data = {
"Date": pd.date_range(start="2024-01-01", periods=10, freq="D"),
"Stock_Price": [150 + i * 2 + np.random.uniform(-3, 3) for i in
range(10)] # Simulated stock prices with noise
}
stock_data = pd.DataFrame(data)
stock_data.set_index("Date", inplace=True)
1. pd.date_range: Creates a range of 10 consecutive dates starting from "2024-01-01".
2. Stock_Price formula: Simulates prices starting at 150 and incrementing by 2 per day,
with random noise added using np.random.uniform(-3, 3).
3. pd.DataFrame: Stores the generated data in a pandas DataFrame.
4. set_index: Sets the Date column as the index, which is essential for time series
analysis.
29
Step 2: Fit an ARIMA Model
python
Copy code
stock_prices = stock_data['Stock_Price']
model = ARIMA(stock_prices, order=(1, 1, 1)) # ARIMA parameters (p, d, q)
fitted_model = model.fit()
1. ARIMA: A popular time series forecasting model with three parameters:
o p: Autoregressive order (how past values influence current values).
o d: Degree of differencing (removes trends in the data).
o q: Moving average order (models residuals/errors).
2. Fit the model: The fit() method trains the ARIMA model on the dataset.
Step 3: Forecast Future Stock Prices
python
Copy code
forecast_steps = 5
forecast = fitted_model.forecast(steps=forecast_steps)
1. forecast_steps: Specifies the number of future days to predict (5 in this case).
2. fitted_model.forecast: Generates the predicted values for the specified number of
steps.
Step 4: Plot Actual and Forecasted Stock Prices
python
Copy code
plt.figure(figsize=(10, 6))
plt.plot(stock_prices, label="Actual Stock Prices", marker="o")
forecast_index = pd.date_range(start=stock_data.index[-1] +
pd.Timedelta(days=1), periods=forecast_steps, freq='D')
plt.plot(forecast_index, forecast, label="Forecasted Stock Prices",
marker="x", linestyle="--", color="red")
plt.xlabel("Date")
plt.ylabel("Stock Price")
plt.title("Stock Price Forecast using ARIMA")
plt.legend()
plt.grid()
plt.show()
1. Plot actual data: plt.plot displays the historical stock prices (stock_prices).
2. Create forecast dates: pd.date_range generates future dates starting from the day
after the last date in the dataset.
3. Plot forecast: Plots the predicted stock prices on the same graph.
4. Styling: Adds labels, title, legend, and grid for better visualization.
Step 5: Display Forecasted Values
python
Copy code
30
forecast_df = pd.DataFrame({"Date": forecast_index, "Forecasted Price":
forecast.values})
print(forecast_df)
1. Create DataFrame: Combines the forecasted dates and predicted prices into a new
DataFrame.
2. Print results: Displays the forecasted values for better clarity.
Output
• Graph: Shows both the actual stock prices and the forecasted values with clear
markers.
• Table: Displays the forecasted stock prices in tabular form.
31
Practical 6
Aim: 6. Develop a sentiment analysis model to classify movie reviews
as positive, negative, or neutral.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
# Step 1: Create a small dataset of movie reviews
data = {
"Review": [
"The movie was fantastic! I loved the characters and the plot.",
"What a terrible movie. It was a complete waste of time.",
"The movie was okay, not too good, not too bad.",
"Absolutely loved it! One of the best movies I've seen this year.",
"The plot was predictable, but the acting was decent.",
"Horrible! I couldn't even finish it.",
"It was just fine. Nothing special, nothing terrible.",
"A masterpiece. Beautifully directed and acted.",
"Worst movie ever. Do not watch this.",
"Pretty average. Had some good moments but also some flaws."
],
"Sentiment": [
"Positive",
32
"Negative",
"Neutral",
"Positive",
"Neutral",
"Negative",
"Neutral",
"Positive",
"Negative",
"Neutral"
# Convert to DataFrame
df = pd.DataFrame(data)
# Step 2: Split the data into training and test sets
X = df['Review']
y = df['Sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Step 3: Create a pipeline for vectorization and classification
pipeline = Pipeline([
('vectorizer', CountVectorizer()), # Converts text into numerical features
('classifier', MultinomialNB()) # Naive Bayes classifier
])
# Step 4: Train the model
33
pipeline.fit(X_train, y_train)
# Step 5: Evaluate the model
y_pred = pipeline.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))
# Step 6: Test with new reviews
new_reviews = [
"What an amazing film! I would watch it again.",
"It was boring and predictable. Not worth my time.",
"An average movie. Nothing stood out."
predictions = pipeline.predict(new_reviews)
# Display predictions
for review, sentiment in zip(new_reviews, predictions):
print(f"Review: {review}\nPredicted Sentiment: {sentiment}\n")
Output:
34
Explanation
1. Dataset:
o A small dataset with 10 movie reviews labeled as Positive, Negative, or
Neutral.
2. Splitting Data:
o The dataset is split into training (70%) and testing (30%) sets to evaluate the
model's performance.
3. Pipeline:
o CountVectorizer: Converts text into a numerical format using word
frequency.
o MultinomialNB: A Naive Bayes classifier, effective for text classification
tasks.
4. Training:
o The model is trained on the training dataset using the pipeline.
5. Evaluation:
o The model is tested on unseen reviews (test set) using
classification_report.
6. Predictions:
o The trained model predicts sentiments for new reviews.
Output
• Classification Report: Displays precision, recall, and F1 scores.
• Predictions: Shows the predicted sentiment for new reviews.
35
Practical 7
Aim: 7. Explore dimensionality reduction techniques like Principal
Component Analysis (PCA) to visualize high-dimensional data.
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import seaborn as sns
# Step 1: Generate high-dimensional data
X, y = make_classification(
n_samples=500, # Number of samples
n_features=10, # Number of features
n_informative=5, # Number of informative features
n_redundant=2, # Number of redundant features
n_classes=3, # Number of classes
random_state=42
# Step 2: Apply PCA
pca = PCA(n_components=2) # Reduce to 2 components for visualization
X_pca = pca.fit_transform(X)
# Step 3: Create a DataFrame for visualization
pca_df = pd.DataFrame(X_pca, columns=["PCA1", "PCA2"])
36
pca_df["Target"] = y
# Step 4: Visualize the PCA results
plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df, x="PCA1", y="PCA2", hue="Target", palette="Set2", s=70)
plt.title("PCA Visualization of High-Dimensional Data", fontsize=14)
plt.xlabel("Principal Component 1", fontsize=12)
plt.ylabel("Principal Component 2", fontsize=12)
plt.legend(title="Class")
plt.grid()
plt.show()
# Step 5: Explain variance captured by PCA components
explained_variance = pca.explained_variance_ratio_
print("Explained Variance by Each Component:")
for i, variance in enumerate(explained_variance, 1):
print(f"Component {i}: {variance:.2f}")
Explanation
37
1. Data Generation:
o make_classification creates a synthetic dataset with 10 features and 3
classes.
o n_informative and n_redundant specify the number of informative and
redundant features.
2. PCA Application:
o PCA(n_components=2) reduces the dimensionality to 2 components for easy
visualization.
o fit_transform applies PCA to the data.
3. Visualization:
o Seaborn scatterplot: Displays data points in a 2D space (PCA1 vs. PCA2)
colored by their class labels.
4. Variance Explained:
o explained_variance_ratio_: Provides the proportion of variance explained
by each principal component.
38
Practical 8
Aim: 8. Train a support vector machine (SVM) to classify data.
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import seaborn as sns
# Step 1: Generate high-dimensional data
X, y = make_classification(
n_samples=500, # Number of samples
n_features=10, # Number of features
n_informative=5, # Number of informative features
n_redundant=2, # Number of redundant features
n_classes=3, # Number of classes
random_state=42
# Step 2: Apply PCA
pca = PCA(n_components=2) # Reduce to 2 components for visualization
X_pca = pca.fit_transform(X)
# Step 3: Create a DataFrame for visualization
pca_df = pd.DataFrame(X_pca, columns=["PCA1", "PCA2"])
pca_df["Target"] = y
39
# Step 4: Visualize the PCA results
plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df, x="PCA1", y="PCA2", hue="Target", palette="Set2", s=70)
plt.title("PCA Visualization of High-Dimensional Data", fontsize=14)
plt.xlabel("Principal Component 1", fontsize=12)
plt.ylabel("Principal Component 2", fontsize=12)
plt.legend(title="Class")
plt.grid()
plt.show()
# Step 5: Explain variance captured by PCA components
explained_variance = pca.explained_variance_ratio_
print("Explained Variance by Each Component:")
for i, variance in enumerate(explained_variance, 1):
print(f"Component {i}: {variance:.2f}")
Output:
40
Explanation
41
1. Dataset Creation:
o make_classification: Generates a synthetic dataset with two classes and
four features.
o 2 features are informative, and the rest are noise.
2. Train-Test Split:
o Splits data into 70% training and 30% testing sets.
3. Training the SVM:
o SVC: Trains a Support Vector Machine with a linear kernel to classify data.
4. Predictions:
o Predictions are made on the testing data.
5. Evaluation:
o classification_report: Displays precision, recall, F1-score, and accuracy.
o Confusion matrix: Provides a visual representation of true and false
predictions.
6. Visualization:
o Decision boundaries are plotted for the first two features to demonstrate how
the SVM separates the classes.
Output
1. Classification Report:
o Precision, recall, F1-score, and accuracy for each class.
2. Confusion Matrix:
o A heatmap displaying the confusion matrix.
3. Decision Boundary Plot:
o Visualization of the decision boundaries and the training/testing points.
42