VI Semester B.C.A.
Examination, July/August - 2024
Machine Learning
SECTION-A
Answer any Four of the following questions.
1.Define machine learning.
Ans: Refer MQP 2 Q 1
2.What is Dataset.
Ans: Refer MQP 2 Q 2
3. Define regression. Give an example.
Ans: Refer MQP 1 Q 2
4. Define clustering. Mention one application.
Ans: Refer MQP 1 Q 6
5.Mention any two tools used for machine learning.
Ans: Refer MQP 2 Q 4
6. What is Data splitting.
Ans: Refer MQP 2 Q 3
SECTION-B
Answer any Four of the following questions.
7. Explain types of machine learning with examples.
Ans: Refer MQP 1 Q 7
8. Explain exploratory data analysis and data cleaning.
Ans: Refer MQP 1 Q 11
9.Explain Baye's theorem with an example.
Ans: Refer MQP 2 Q 7
10. Explain K-means clustering for image segmentation.
Ans: Refer MQP 1 Q 15
11. Explain how DBSCAN works.
Ans: Refer MQP 2 Q 14
12. Write and explain K-Nearest Neighbour Algorithm.
Ans: Refer MQP 1 Q 8
Algorithm K-NN Algorithm
Step-1: Choose the Number of Neighbors (K):
The first step in the K-NN algorithm is to select the number of neighbors (K) that will be considered
when making predictions for a new data point. The value of K is a hyperparameter that needs to be
specified before running the algorithm.
Step-2: Calculate Distance:
Compute the distance between the new data point and all the data points in the training set. The
distance metric, commonly Euclidean distance, measures the similarity or proximity between data
points in the feature space.
Step-3:Sort and Select Nearest Neighbors:
After calculating the distances, sort the distances in ascending order and selects the K data points with
the smallest distances to the new data point. These K data points are the nearest neighbors to the new
data point in the feature space.
Step-4: For Classification:
In the classification task, assign a class label to the new data point based on the majority class among
the K nearest neighbors. The class with the highest frequency among the K neighbors is chosen as the
predicted class for the new data point.
Step-5: For Regression:
In regression tasks, compute the average (or weighted average) of the target values of the K nearest
neighbors. This average value serves as the predicted target value for the new data point in regression
analysis.
SECTION-C
Answer any Four of the following questions.
13. Explain main challenges of Machine Learning.
Ans: Refer MQP 2 Q 11
14. Explain how to prepare the data for Machine Learning Algorithms.
Ans: Data preparation is a fundamental aspect of the Machine Learning workflow, essential for
optimizing the data before model training. It involves a series of steps such as cleaning, transforming,
and structuring the data to make it well-suited for the specific algorithm being used. Each step in this
process serves a unique purpose and employs specific techniques to enhance the quality and usability
of the dataset. Proper data preparation is crucial for ensuring the accuracy and effectiveness of the
machine learning model during training and evaluation.
1. Data Cleaning: Data cleaning, also known as data cleansing, is the process of identifying and
correcting errors, inconsistencies, and missing values in a dataset to improve its quality and reliability
for analysis and modelling. Data cleaning is a crucial step in data preprocessing as it ensures that the
data is accurate, complete, and consistent.
i. Handling Missing Values: Missing values in data refer to the absence of information or data points
for certain observations or attributes in a dataset. Handling missing values is crucial in-data
preprocessing to ensure the quality and reliability of the machine learning model.
Example: The Titanic Passengers dataset has missing values in the Age and Cabin columns. The
passenger information has been extracted from various historical sources. In this case the missing
values couldn't be found in the sources.
ii. Handling Outliers: Outliers are data points that significantly differ from other observations ina
dataset. These data points can skew statistical analyses and machine learning models, leading to
inaccurate results. Outliers can occur due to various reasons such as measurement errors, data entry
mistakes, or genuine extreme values in the data.
Example: In a dataset containing information about individuals, such as their age, it is common to
encounter outliers, such as ages above 100 years. While some individuals may indeed be over 100
years old, extreme ages can impact statistical analyses and machine learning models if not handled
appropriately.
2. Data Transformation: Data transformation is a fundamental process in data preprocessing that
involves modifying the original data to make it more suitable for analysis or modeling. This
transformation can help improve the quality of the data, address issues like skewness or outliers, and
enhance the performance of machine learning algorithms.
i. Normalization: Normalization is a type of data transformation that scales the values of numerical
features to a standard range, typically between 0 and 1.
ii. Standardization: Standardization is another data transformation technique that centres the data
around a mean of 0 and scales it to have a standard deviation of 1. It is like converting heights and
weights to z-scores.
iii. Log Transformation: Log transformation is applied to skewed data, like converting income values
to their logarithmic form to handle extreme values.
iv. Encoding Categorical Variables: Converting categorical variables into numerical representations
through techniques like one-hot encoding or label encoding is a form of data transformation.
One-hot Encoding: Creates a new binary column for each category level.
Label Encoding: Assigns a unique integer based on the alphabetical ordering of the categories.
3. Data Reduction Data reduction is a critical step in preparing data for efficient analysis, especially
in contexts involving large datasets or complex models. The process of data reduction involves
diminishing the amount of data that needs to be processed and analyzed without significantly
sacrificing valuable information.
i. Feature Creation: This involves creating new variables from existing data to provide additional
insight to the models. This might involve combining features, deriving new metrics from existing
data, or aggregating data over time or space.
ii. Feature Transformation: Transforming features to enhance their predictive power or making
them more suitable for models. Common transformations include normalization, scaling,
applying mathematical functions like logarithms or exponentials, and more.
4. Feature Engineering: Feature engineering is a fundamental process in the field of machine
learning where raw data is transformed into formatted datasets that machine learning algorithms can
work with more effectively. This process involves creating new features from existing data,
transforming data into more useful formats, or enhancing the quality of data to improve the accuracy
and efficiency of predictive models.
5. Data Spitting: Data splitting in machine learning is the process of dividing the data into separate
subsets to be used at different stages of model building and evaluation. The primary goal of data
splitting is to ensure that the model trained on one set of data can generalize well to new, unseen data.
This helps avoid problems like overfitting, where a model performs well on the training data but
poorly on new data.
15. Explain confusion matrix and performance evaluation metrics in classification.
Ans: Refer MQP 2 Q 15 and
1. Accuracy: Accuracy is the most commonly used metric for evaluating classification models. It
measures the proportion of correct predictions made by the model out of the total number of
predictions. It is calculated as the ratio of the number of correct predictions to the total number of
predictions. Accuracy = (TP + TN) / (TP + FP + TN + FN)
A high accuracy score indicates that the model is making correct predictions most of the time.
However, accuracy can be misleading when the class distribution is imbalanced.
2. Precision: Precision measures the proportion of the positives among the instances that the model
predicted as positive. It is calculated as the ratio of the number of true positives to the total number of
instances predicted as positive. Precision = TP / (TP + FP)
A high precision score indicates that the model is making fewer false positive predictions. It is useful
when the cost of false positives is high.
3. Recall: Recall measures the proportion of true positives among the instances that are actually
positive. It is calculated as the ratio of the number of true positives to the total number of actual
positive instances. Recall = TP / (TP + FN)
A high recall score indicates that the model is capturing a majority of the actual positive instances. It
is useful when the cost of false negatives is high.
4. F1 score: F1 score is the harmonic mean of precision and recall. It provides a balance between the
two metrics and is particularly useful when the class distribution is imbalanced.
Fi score = 2 * (precision * recall} / (precision + recall)
A high F1 score indicates that the model has both good precision and recall. It is useful when both
false positives and false negatives are equally important
16. Explain any four unsupervised learning techniques.
Ans: Refer MQP 2 Q 16
17. Explain.
a) Scikit-learn and pandas.
Ans: Refer MQP 2 Q 13
b) Explain the steps to select and train a model.
Ans: Process of Selecting and Training a Machine Learning Model
Step 1: Model Selection
The choice of machine learning model is crucial and depends on the nature of the problem at hand.
Understanding the problem type, whether it involves regression, classification, clustering, or other
tasks—is essential for selecting the most appropriate model that can effectively address the specific
requirements and characteristics of the data. For a task like predicting student performance (numeric
score prediction), regression models are typically suitable.
Example: For predicting student performance, a regression task could start with simpler models like
linear regression but may require more complex models like Random Forest Regressors if the
relationships between features and the target are non-linear. Given the initial analysis suggesting non-
linear patterns, we opt for a Random Forest Regressor due to its ability to handle complex data
structures and provide robustness against overfitting.
Step 2: Model Training
Model training is a critical step in the machine learning pipeline where the selected model is exposed
to the prepared dataset to learn patterns and relationships between the input features and the target
variable. During this phase, the model adjusts its internal parameters based on the training data to
minimize prediction error and improve its performance.
Example: The Random Forest model is trained using features such as study hours, attendance
records, and historical grades. This model does not require setting many hyperparameters initially but
does involve decisions about the number of trees and their depth, which we initially set to default
values for a baseline model.
Step 3: Model Evaluation
The model's performance is assessed using a validation set, which is a subset of the training data that
the model has not seen before. This evaluation helps in assessing the model's learning capability and
its ability to generalize to new data.
Example: Evaluate the initial Random Forest model by calculating its Root Mean Square Error
(RMSE) on the validation set. If the performance is unsatisfactory, it suggests the need for tuning
hyperparameters or possibly revisiting the feature engineering step.
Step 4: Hyperparameter Tuning
Hyperparameters are parameters that are set before the learning process begins. They control the
learning process and model behavior but are not learned from the data. Examples include learning
rate, regularization strength, number of hidden layers in a neural network, and kernel type in Support
Vector Machines.
Fine-tuning the model's hyperparameters is crucial to enhance its performance. Techniques like grid
search or random search can be employed to systematically explore different hyperparameter
combinations.
Example: Adjusting hyperparameters such as the number of trees or tree depth in a Random Forest
model through grid search can help minimize RMSE on the validation set, thereby improving model
accuracy.
Step 5: Cross-Validation
Cross-validation ensures the model's stability across various data subsets. By repeatedly splitting the
data into training and validation sets, training on each subset, and averaging the results, the model's
robustness is assessed.
Example: Implementing 10-fold cross-validation on a Random Forest model involves dividing the
training data into 10 subsets, using each as a validation set once, and training on the remaining 9
subsets. The average RMSE across all validations provides a reliable performance estimate.
Step 6: Final Model Training
After identifying the best model and hyperparameters, the final model is trained on the entire training
dataset to leverage all available data for optimal learning.
Example: The optimized Random Forest model, with fine-tuned hyperparameters from cross-
validation, is trained on the complete student dataset to maximize its predictive capabilities.
Step 7: Model Testing
The trained model is tested on a separate dataset that was not used during training or validation to
evaluate its performance and real-world applicability.
Example: The final test for the Random Forest model involves predicting exam scores for new
students based on their study habits and past performance. The model's predictions are compared
against actual scores to calculate the final RMSE, assessing its effectiveness.
18. Write note on:
a) Entropy and information gain.
Ans: Entropy and information gain are key concepts in domains such as information theory, data
science, and machine learning. Information gain is the amount of knowledge acquired during a certain
decision or action, whereas entropy is a measure of uncertainty or unpredictability. People can handle
difficult situations and make wise judgments across a variety of disciplines when they have a solid
understanding of these principles. Entropy can be used in data science, for instance, to assess the
variety or unpredictable nature of a dataset, whereas Information Gain can assist in identifying the
qualities that would be most useful to include in a model. In this article, we'll examine the main
distinctions between entropy and information gain and how they affect machine learning.
1. Entropy: The term "entropy" comes from the study of thermodynamics, and it describes how
chaotic or unpredictable a system is. Entropy is a measurement of a data set's impurity in the context
of machine learning. In essence, it is a method of calculating the degree of uncertainty in a given
dataset.
2. Information Gain is a statistical metric used to assess a feature's applicability in a dataset. It is an
important idea in machine learning and is frequently utilized in decision tree algorithms. By
contrasting the dataset's entropy before and after a feature is separated, information gain is
estimated. A feature's relevance to the categorization of the data increases with information gain.
b) Partitioning clustering and hierarchical clustering.
Ans:
1.Partitioning Clustering
Definition:
Partitioning clustering divides the dataset into distinct, non-overlapping groups (clusters) based on a
specific criterion. Each data point belongs to exactly one cluster.
Key Characteristics:
• Flat Structure: The result is a flat partition of the data into clusters.
• Number of Clusters: The number of clusters (k) must be specified in advance (e.g., in k-
means clustering).
• Centroid-Based: Many partitioning methods, like k-means, use centroids to represent
clusters. The algorithm iteratively assigns points to the nearest centroid and updates the
centroids based on the assigned points.
• Efficiency: Generally faster and more efficient for large datasets compared to hierarchical
methods.
• Sensitivity to Initialization: The results can vary based on the initial placement of centroids,
especially in k-means.
Common Algorithms:
• K-Means Clustering: Partitions data into k clusters by minimizing the variance within each
cluster.
• K-Medoids (PAM): Similar to k-means but uses actual data points (medoids) as cluster
centres.
Use Cases: Market segmentation, Image compression, Document clustering
2.Hierarchical Clustering
Definition:
Hierarchical clustering creates a tree-like structure (dendrogram) that represents the nested grouping
of data points. It can be agglomerative (bottom-up) or divisive (top-down).
Key Characteristics:
• Tree Structure: Produces a hierarchy of clusters, allowing for different levels of granularity.
• No Predefined Number of Clusters: The number of clusters does not need to be specified in
advance; it can be determined by cutting the dendrogram at a desired level.
• Agglomerative vs. Divisive:
o Agglomerative: Starts with each data point as its own cluster and merges them based
on similarity until one cluster remains.
o Divisive: Starts with one cluster containing all data points and splits it into smaller
clusters.
• Distance Metrics: The choice of distance metric (e.g., Euclidean, Manhattan) and linkage
criteria (e.g., single, complete, average) significantly affect the clustering results.
Common Algorithms:
• Agglomerative Clustering: Merges clusters based on distance metrics.
• Divisive Clustering: Splits clusters based on distance metrics.
Use Cases: Gene expression analysis, Social network analysis, Document clustering.
Machine Learning
Model Questions
Short Answer Questions
1. Applications of machine learning.
Ans: the applications of machine learning:
1. Image Recognition
2. Speech Recognition
3. Medical Diagnosis
4. Traffic Prediction
5. Product Recommendations
6. Online Fraud Detection
7. Self-Driving Cars
8. Email Spam Filtering
9. Automatic Language Translation
10. Virtual Personal Assistants
11. Stock Market Trading
2. Scikit learn with its features.
Ans: Scikit-learn (sklearn) is a popular Python library used for machine learning tasks. It provides a
wide range of tools for various stages of the machine learning process, including:
Features:
•Simple and Efficient: Easy-to-use interface for building and evaluating machine learning models.
•Comprehensive: Supports a wide range of supervised and unsupervised learning algorithms,
including regression, classification, clustering, and dimensionality reduction.
•Model Selection: Tools for model selection and evaluation, such as cross-validation, grid search, and
performance metrics.
•Data Preprocessing: Includes tools for data preprocessing like scaling, normalization, encoding
categorical variables, handling missing values, etc.
•Integration: Seamless integration with other scientific Python libraries like NumPy, SciPy, and
matplotlib.
3. Visualization of data.
Ans: Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and
understand trends, outliers, and patterns in data. Steps involved are:
1. Define the Purpose
2. Collect and Prepare Data
3. Choose the Right Visualization Type
4. Design the Visualization
5. Interpret and Share
4. List the different clustering techniques
Ans: Refer MQP 1 Q 9
5. Different linear models
Ans: different types of linear models:
1. Simple Linear Regression
2. Multiple Linear Regression
3. Polynomial Regression
4. Ridge Regression
5. Lasso Regression
6. Robust Regression
Long answer Questions
6. Decision tree algorithm with its advantages and disadvantages.
Ans: Refer MQP 1 Q 18
7. Steps in Data Preparation process
Ans: Data Preparation is a crucial step in the machine learning pipeline that involves transforming
raw data into a suitable format for analysis and modeling. The process ensures the data is clean,
consistent, and ready for use in training models.
1. Data Collection:
• Sources: Gather data from various sources such as databases, APIs, surveys, or files.
• Integration: Combine data from multiple sources, ensuring consistency and completeness.
2. Data Cleaning:
• Handling Missing Values: Address missing data using imputation methods like mean
substitution, median substitution, or filling with a default value.
• Outlier Detection and Removal: Identify and handle outliers that can skew analysis, using
statistical methods or domain knowledge.
• Removing Duplicates: Detect and eliminate duplicate records to ensure data integrity.
3. Data Transformation:
Normalization: Scale numerical features to a standard range, such as [0, 1] or [-1, 1], to
ensure consistent input for models.
• Encoding Categorical Variables: Convert categorical data into numerical format using
techniques like one-hot encoding or label encoding.
• Feature Engineering: Create new features that capture relevant information from existing
data to enhance model performance.
4. Data Reduction:
• Dimensionality Reduction: Reduce the number of features using techniques like PCA to
simplify the dataset while retaining essential information.
• Feature Selection: Identify and retain the most relevant features, removing those that do
not contribute significantly to the model.
5. Data Splitting:
• Training, Validation, and Test Sets: Split the dataset into subsets to train, validate, and test
the model, ensuring unbiased evaluation of model performance.
6. Data Augmentation:
• Synthetic Data Generation: Create additional data samples to balance classes or increase
the diversity of the dataset, especially in image and text data.
7. Data Integration and Finalization:
• Combining Data Sources: Merge data from different sources into a unified dataset.
• Final Checks: Perform final validation and ensure the data is correctly formatted and ready
for modeling.
8. Naïve Bayes classification algorithm with an example
Ans: Refer MQP 2 Q 7
9. K means Clustering Algorithms
Ans: Refer MQP 1 Q 15
10. DBSCAN algorithm
Ans: Refer MQP 2 Q 14
11. Differentiate between supervised and unsupervised learning algorithms, listing a few
algorithms under each type.
Ans: Refer MQP 2 Q 9