0% found this document useful (0 votes)

19 views13 pages

ML 1

Data Encoding is a crucial pre-processing step in Machine Learning that converts categorical or textual data into numerical format for algorithmic processing. Common encoding methods include One-Hot Encoding, Dummy Encoding, Label Encoding, and Target Encoding, each serving different purposes in handling categorical variables. Data Preparation, which includes cleaning, transforming, and organizing data, is essential for improving model performance and ensuring accurate results.

Uploaded by

donpraj6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views13 pages

ML 1

Uploaded by

donpraj6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

What is Data Encoding?

Data Encoding is an important pre-processing step

in Machine Learning. It refers to the process of
converting categorical or textual data into numerical
format, so that it can be used as input for algorithms
to process. The reason for encoding is that most
machine learning algorithms work with numbers and
not with text or categorical variables.

Q2} Explain Different Methods for Encoding Categorical

Variables.
Categorical data encoding is essential for converting non-numeric data into a numerical format
suitable for machine learning algorithms. Below are some common encoding techniques:

1. One-Hot Encoding
2. 2. Dummy Encoding
3. 3. Ordinal Encoding
4. 4. Binary Encoding
5. 5. Count Encoding
6. 6. Target Encoding
7.
1. One-Hot Encoding:

• One of the most common encoding methods.

• It creates separate binary columns for each unique category in the variable.
• If a category is present, the respective column is marked as 1, while others remain 0.
• Example: If a variable has categories {A, B, C}, the encoded values for B would be [0,1,0].
2. Dummy Encoding:

• Similar to One-Hot Encoding but uses N-1 binary features for N categories to avoid
redundancy.
• Advantage: Reduces the number of features compared to One-Hot Encoding.
3. Label Encoding:

• Assigns a unique integer to each category.

• Example: Categories {Red, Green, Blue} could be encoded as {0,1,2}.
• Drawback: The model may misinterpret these numbers as having an ordinal relationship
when they don’t.
4. Ordinal Encoding:

• Used when categories have a natural order.

• Example: {Low, Medium, High} can be encoded as {1, 2, 3}.
• Ensures the algorithm understands the inherent ranking.
5. Binary Encoding:

• Converts categories into binary format and stores them in multiple columns.
• Example: If there are 4 categories, the numbers (0, 1, 2, 3) are converted into binary (00,
01, 10, 11).
1. Count Encoding

• Also known as Frequency Encoding, this method replaces each category with the count of
occurrences in the dataset.

• Example:
If a "City" column contains {Mumbai, Delhi, Mumbai, Pune, Delhi, Mumbai}, the encoded
values will be:

• Count
• City
Encoding
• Mum
• 3
bai
• Delhi • 2
• Pune • 1

• Advantages:
✅ Reduces dimensionality compared to One-Hot Encoding.
✅ Retains useful information about category frequency.

• Disadvantages:
❌ May not work well if counts are too skewed.
❌ Doesn't capture category relationships with the target variable.

2. Target Encoding

• This method replaces categorical values with the mean of the target variable (used mostly
in supervised learning).

• Example: Suppose we have a "City" column predicting house prices, and the average prices
for each city are:

• Target Encoding (Avg.

• City
Price)
• Mum
• 80 Lakhs
bai
• Delhi • 70 Lakhs
• Pune • 60 Lakhs

• Advantages:
✅ Captures category-target relationship effectively.
✅ Reduces dimensionality compared to One-Hot Encoding.
◦ Helps in credit scoring and loan approvals.
◦ Enables mobile check deposits using image processing.
3. Commute Estimation

◦ Google Maps uses ML for real-time traffic estimation.

◦ Ridesharing apps like Uber use ML for fare estimation and route planning.
◦ GPS navigation predicts travel time based on past traffic data.

3. Explain Some Key Applications of Machine Learning.

Answer:
Machine Learning is widely used in different domains, including:

• Face Detection & Speech Recognition: Used in security systems and virtual assistants
(e.g., Face ID, Siri, Google Assistant).
• Stock Prediction: Helps in forecasting stock prices based on historical trends.
• Spam Email Detection: ML filters spam messages from important emails using
classification algorithms.
•
• Machine Translation: Google Translate uses ML to translate languages in real time.
• Recommender Systems: Netflix, Amazon, and YouTube suggest movies, products, and
videos based on past behavior.
• Self-Parking Cars: Autonomous vehicles like Tesla use ML for automatic parking and
driving assistance.
________________________________________________________________________________
____________________________________________________________________
1. What is Data Preparation in Machine Learning?

Answer:
Data Preparation is the process of cleaning, transforming, and organizing raw data into a suitable
format for Machine Learning models. It ensures that the data is accurate, complete, and ready for
analysis.

Definition from PDF:

"Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step in creating a machine-learning model."

Key Aspects of Data Preparation:

• Handling missing values and noisy data.

• Converting categorical data into numerical format (encoding).
• Splitting data into training and testing sets.
• Applying feature selection and scaling techniques.

2. Why is Data Preparation Important in Machine Learning?

Answer:
Data preparation is essential because real-world data often contains inconsistencies, errors, and
missing values that can negatively affect model performance.

From PDF:
• Real-world data contains noise and missing values that cannot be directly used in ML
models.
• Increases accuracy and efficiency of ML models by providing structured data.
• Reduces errors and improves model generalization.
• Helps in removing duplicate content to avoid redundant calculations.
• Improves decision-making capabilities by refining input features.
• Saves time and cost by automating data cleaning and transformation.

3. What are the Steps in the Data Preparation Process?

Answer:
Data Preparation involves multiple steps to ensure high-quality data for Machine Learning models.

Steps from PDF:

1. Understand the Problem:

◦ Define the objective and analyze the type of data required.

◦ Identify whether it is a classification, regression, or clustering problem.
2. Data Collection:

◦ Gather data from various sources such as databases, APIs, or third-party vendors.
◦ Ensure data is diverse and not biased.
3. Data Exploration and Profiling:

◦ Identify trends, outliers, and missing values.

◦ Perform statistical analysis to understand feature distributions.
4. Data Cleaning and Validation:

◦ Handle missing values (imputation, deletion).

◦ Detect and remove outliers using visualization techniques.
5. Data Formatting and Transformation:

◦ Convert data into a structured format (e.g., converting date formats).

◦ Normalize or standardize numerical features.
6. Feature Engineering and Selection:

◦ Create new features from existing data.

◦ Apply dimensionality reduction techniques like PCA to reduce redundant features.

Imputation:
• Feature imputation is the technique to fill incomplete fields in the datasets.
• It is essential because most machine learning models don't work when there are missing
data
in the dataset.
• Although, the missing values problem can be reduced by using techniques such as single
value
imputation, multiple value imputation, K-Nearest neighbor, deleting the row, etc.
Encoding:
• Feature encoding is defined as the method to convert string values into numeric form.
• This is important as all ML models require all values in numeric format.
• Feature encoding includes label encoding and One Hot Encoding (also known as
get_dummies)
Conclusion:
Proper data preparation ensures the success of ML models by improving their performance,
reliability, and efficiency.

________________________________________________________________________________
________________________________________________________________________________

What are the Challenges in the Data Preparation Process?

Answer:
The data preparation process can be complicated by several issues that affect the quality and
usability of data. These challenges must be addressed to ensure accurate and efficient Machine
Learning models.

Challenges in Data Preparation (From PDF):

1. Missing or Incomplete Records:

◦ Some data points may be empty or have NULL values, making analysis difficult.
◦ Example: A customer survey dataset where age or income values are missing.
◦ Solution: Use imputation techniques (mean, median, mode) or remove incomplete
records.
2. Outliers or Anomalies:

◦ Unexpected values that skew model predictions.

◦ Example: A house price dataset where one house is priced at $100 million, affecting
averages.
◦ Solution: Detect and handle outliers using visualization (box plots, Z-score
method, IQR method).
3. Unstructured Data Format:

◦ Data from different sources may be in inconsistent formats (e.g., dates in "DD/MM/
YYYY" vs. "MM-DD-YYYY").
◦ Example: A retail dataset where currency values are in different units (USD,
EUR, INR).
◦ Solution: Apply data transformation techniques to standardize formats.
4. Limited or Sparse Features/Attributes:

◦ Datasets may have too few features, making ML models less effective.
◦ Example: A fraud detection system may lack customer location or transaction
history, affecting accuracy.
◦ Solution: Collect more relevant data or use feature engineering to generate new
attributes.
5. Understanding Feature Engineering:

◦ Selecting the right features is crucial for improving model performance.

◦ Example: In a healthcare dataset, choosing the correct biometric indicators (blood
pressure, cholesterol levels) is essential.
◦ Solution: Use techniques like Principal Component Analysis (PCA) and
Recursive Feature Elimination (RFE) for feature selection.

Q] 1. What is Data Pre-Processing in Machine Learning? Why is it Important?

Answer:
Data Pre-Processing is the process of cleaning and transforming raw data into a suitable format for
machine learning models. It is a crucial step before applying ML algorithms.

From PDF:

• "Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format."
• It ensures that data is clean, consistent, and structured for better model accuracy.
• Helps in removing missing values, duplicate records, and noise from data.
Importance of Data Pre-Processing:
✅ Improves model performance and accuracy.
✅ Ensures that data is error-free and complete.
✅ Reduces overfitting by handling noisy data.
✅ Helps in scaling features for better comparisons.

2. What is Data Cleaning in Machine Learning? Explain its Steps.

Answer:
Data Cleaning is the process of identifying and handling incorrect, incomplete, or irrelevant data.

Steps in Data Cleaning (From PDF):

1. Handling Missing Data

◦ Filling missing values using mean, median, or mode.

◦ Removing rows/columns with too many missing values.
2. Removing Noise

◦ Identifying and correcting errors in data.

◦ Filtering out irrelevant information.
3. Handling Outliers

◦ Detecting extreme values using visualization methods (e.g., box plots).

◦ Using techniques like Winsorization to cap extreme values.
4. Removing Duplicate Records

◦ Identifying and deleting repeated data entries to avoid bias.

5. Fixing Data Inconsistencies

◦ Standardizing data formats (e.g., date formats, text standardization).

Conclusion:
Data Cleaning improves data quality and ensures that models are trained on reliable and error-
free data.
3. What is Data Integration? Explain its Role in Machine Learning.

Answer:
Data Integration is the process of combining data from multiple sources into a single, unified
dataset.

Role in Machine Learning (From PDF):

• ML models require large, diverse datasets for better training.

• Integration helps in merging data from databases, APIs, and external sources.
• Ensures that redundant and inconsistent data is handled properly.
Techniques of Data Integration:

1. Schema Integration – Merging data based on common fields (e.g., joining tables).
2. Data Cleaning – Ensuring consistent data formats from multiple sources.
3. Entity Resolution – Identifying and merging duplicate records.
4. Data Transformation – Converting data into a common format.
Example:
A hospital combines patient records from multiple branches into a single dataset for ML-based
diagnosis.

4. What is Data Transformation? Explain Different Data Transformation

Techniques.

Answer:
Data Transformation converts raw data into a usable format for ML models.

From PDF:

• It helps in making data consistent, structured, and readable for ML algorithms.

Common Data Transformation Techniques:

1. Normalization (Min-Max Scaling):

2. Encoding Categorical Variables:

◦ Converts non-numeric data into numerical form.

◦ Methods: One-Hot Encoding, Label Encoding, Ordinal Encoding.
4. Feature Engineering:

◦ Creating new features from existing data (e.g., extracting year from a date
column).
Example:
Converting text-based customer reviews into numerical sentiment scores for an ML model.

5. What is Data Reduction? Explain Different Data Reduction Techniques.

Answer:
Data Reduction is the process of reducing the size of datasets while maintaining key information.
From PDF:

• Reducing data size improves computational efficiency and speeds up model training.
Techniques of Data Reduction:

1. Dimensionality Reduction (PCA, t-SNE):

◦ Principal Component Analysis (PCA) removes correlated features while

preserving variance.
◦ t-SNE (t-Distributed Stochastic Neighbor Embedding) is used for visualizing
high-dimensional data.
2. Data Compression:

◦ Uses algorithms to reduce storage size (e.g., wavelet transformation in images).

3. Feature Selection:

◦ Removing irrelevant or redundant features using Recursive Feature Elimination

(RFE).
4. Sampling Techniques:

◦ Selecting a subset of data while maintaining statistical significance (e.g., stratified

sampling).
Example:
Reducing 1000 features to 50 using PCA for an image recognition model.

6. What is Data Discretization? Explain Its Techniques.

Answer:
Data Discretization converts continuous numerical data into categorical bins or intervals.

From PDF:

• It helps in simplifying data, improving interpretability, and enhancing model

performance.
•
Techniques of Data Discretization:

1. Equal-Width Binning:

◦ Divides data into equal-sized intervals.

◦ Example: Age groups (0-18, 19-35, 36-60, etc.).
2. Equal-Frequency Binning:

◦ Each bin has the same number of data points.

◦ Example: Grouping exam scores into quartiles.
3. Clustering-Based Discretization (K-Means Clustering):

◦ Groups data into clusters and assigns discrete labels.

4. Decision Tree-Based Discretization:

◦ Splits data using decision tree rules.

Example:
Converting customer incomes into categories like Low, Medium, and High.

______________________________________________________________________________
———————————————————————————————————————-

2. What is a Feature in Machine Learning?

Answer:

• Generally, all machine learning algorithms take input data to generate the output.
• The input data remains in a tabular form, consisting of rows (instances or observations)
and columns (variables or attributes).
• These attributes are often known as features, which help in training the model for
predictions.

3. Explain the different processes involved in Feature Engineering.

Answer:

1. Feature Creation:
◦ Feature creation is the process of finding the most useful variables to be used in a
predictive model.
2. Feature Transformation:

◦ This step involves adjusting the predictor variables to improve the accuracy and
performance of the model.
3. Feature Extraction:

◦ Feature extraction is an automated feature engineering process that generates new

variables by extracting them from the raw data.
◦ The main aim of this step is to reduce the volume of data so that it can be easily
used and managed for data modeling.
◦ Feature extraction methods include cluster analysis, text analytics, edge detection
algorithms, and Principal Component Analysis (PCA).
4. Feature Selection:

◦ Feature selection is the process of selecting the most relevant features from the
original feature set by removing redundant, irrelevant, or noisy features.

4. What are the steps involved in Feature Engineering?

Answer:

1. Data Preparation:
◦ In this step, raw data acquired from different sources is prepared and formatted for
use in ML models.
◦ The data preparation process may include data cleaning, data augmentation, data
fusion, ingestion, or loading.
2. Exploratory Analysis:
◦ This step involves analyzing datasets and summarizing the main characteristics of
data.
◦ Different data visualization techniques are used to better understand the
manipulation of data sources, identify trends, and find the best statistical methods for
analysis.
3. Benchmarking:

◦ Benchmarking is the process of setting a standard baseline for accuracy to

compare all the variables from this baseline.
◦ The benchmarking process helps improve the predictability of the model and
reduce the error rate.

5. What are different Feature Engineering Techniques?

Answer:

1. Imputation:
◦ Imputation is responsible for handling irregularities within the dataset.
◦ For numerical data, missing values can be filled using mean or median.
◦ For categorical data, missing values can be replaced with the most frequently
occurring value in the column.
2. Handling Outliers:

◦ This technique identifies outliers and removes them if necessary.

◦ Standard deviation can be used to detect outliers.
◦ Z-score can also be used to identify extreme values in the dataset.
3. Log Transformation:

◦ Log transformation helps in handling skewed data and makes the distribution closer
to normal.
4. Binning:

◦ Binning can be used to normalize noisy data.

◦ This process involves segmenting features into bins to reduce the effect of minor
observation differences.
5. Feature Split:

◦ Feature splitting is the process of dividing features into two or more parts to create
new useful features.
6. One-Hot Encoding:

◦ One-Hot Encoding is a technique that converts categorical data into numerical form
so that it can be easily processed by machine learning algorithms.
◦ This method ensures that categorical variables do not introduce any misleading
numerical relationships.
________________________________________________________________________________
————————————————————————————————————————
Explain the Different Types of Learning in Machine Learning.
Answer:
Machine Learning techniques can be broadly categorized into several types, each with its own
methodology and applications:

1. Supervised Learning:
◦ Definition: The algorithm is trained on a labeled dataset, meaning each input is
paired with a correct output.
◦ Processes:
▪ Classification: Predicting discrete labels (e.g., spam vs. non-spam emails).
▪ Regression: Predicting continuous values (e.g., forecasting stock prices or
used car prices).
◦ Example (from PDF): For a financial institution, a classifier may use customer
savings and income to determine a risk category (high risk vs. low risk).
2. Unsupervised Learning:

◦ Definition: The algorithm works with unlabeled data and tries to find hidden patterns
or intrinsic structures.
◦ Processes:
▪ Clustering: Grouping similar data points together, such as segmenting
customers based on purchasing behavior.
▪ Association: Discovering rules that describe large portions of your data, like
market basket analysis.
◦ Example (from PDF): Identifying clusters of similar documents or grouping
customers without prior labels.
3. Semi-supervised Learning:

◦ Definition: Combines a small amount of labeled data with a large amount of

unlabeled data during training.
◦ Purpose: This approach is beneficial when obtaining labeled data is expensive or
time-consuming.
◦ Example (from PDF): Using a few labeled images along with many unlabeled
images to improve classification accuracy.
4. Reinforcement Learning:

◦ Definition: An agent learns to make decisions by interacting with an environment

and receiving rewards (or penalties) for its actions.
◦ Process: The algorithm explores and exploits actions to maximize cumulative
rewards over time.
◦ Example (from PDF): Training a self-driving car, where the system learns to
navigate by receiving feedback from successful or failed maneuvers.

________________________________________________________________________________
————————————————————————————————————————

What are the Key Elements of Machine Learning?

Answer:
Machine Learning consists of three key elements that define how an algorithm learns, evaluates, and
optimizes its performance.
1. Representation (How to Represent Knowledge)

• This refers to how knowledge is structured within the model.

• The type of model used in ML determines how data patterns are learned.
• Common ML Representations:
◦ Decision Trees – Used for classification problems.
◦ Rule-based Systems – If-Else rules for decision-making.
◦ Neural Networks – Used in deep learning for complex pattern recognition.
◦ Graphical Models – Bayesian Networks for probabilistic relationships.
◦ Support Vector Machines (SVMs) – Used for classification and regression.

2. Evaluation (How to Evaluate the Model’s Performance)

• Evaluating a model is necessary to measure its accuracy and effectiveness.

• Performance Metrics Used in Evaluation:
◦ Accuracy – Measures correct predictions out of total predictions.
◦ Precision & Recall – Used in classification tasks to check how well a model
distinguishes between different categories.
◦ Mean Squared Error (MSE) – Measures error in regression tasks.
◦ Log-likelihood & Entropy – Used in probabilistic models to measure uncertainty.

3. Optimization (How to Improve Learning)

• Optimization is the process of improving model performance by tuning parameters.

• Involves techniques to minimize errors and improve predictions.
• Common Optimization Techniques:
◦ Gradient Descent – Adjusts model weights to reduce error.
◦ Hyperparameter Tuning – Selecting the best model parameters.
◦ Regularization (L1/L2) – Prevents overfitting by penalizing large weights.
◦ Ensemble Methods (Bagging, Boosting) – Combines multiple models for better
accuracy.

Aspects of developing a learning system:

training data, concept representation, function approximation
• For training and testing purpose of our model we need to split the dataset in
to three distinct dataset, training set, validation set and testing set
• Training set:-
• A set of data used to train the model
• It is used to fit the model
• The model sees and learn from this data
• Later on the trained model can be deployed and used to accurately predict
on new data that it has not seen before
• Labeled data is used

1. Training Set

• The training set is the portion of the dataset used to train the machine learning model.
• The model learns patterns and relationships between input features and output labels from
this data.
• Example: In a house price prediction model, the training set consists of historical data on
house prices, square footage, and location.

2. Validation Set

• The validation set is a separate portion of the dataset used to tune hyperparameters and
avoid overfitting.
• It helps in model selection by providing feedback on how well the model is performing.
• Example: In deep learning, validation data is used to determine the optimal number of
layers in a neural network.

3. Test Set

• The test set is used to evaluate the final performance of the trained model.
• It consists of unseen data that the model has never encountered before.
• Helps in assessing the generalization ability of the model.
• Example: If a model is trained to classify emails as spam or not spam, the test set contains
new emails to check the model’s accuracy.

4. Data Splitting in Machine Learning

• To ensure a fair evaluation of the model, the dataset is split into different sets:

• Typical Split
• Dataset • Purpose
Ratio
• Training
• Used to train the model • 60-80%
Set
• Validation
• Used to fine-tune the model • 10-20%
Set
• Used to evaluate model
• Test Set • 10-20%
performance

• Best Practice: The dataset should be randomly shuffled before splitting to ensure an
unbiased representation of data.

Unit 2 ML
No ratings yet
Unit 2 ML
14 pages
Cse3001 Ai ML m2
No ratings yet
Cse3001 Ai ML m2
118 pages
ML Mdu 2024 10939237
No ratings yet
ML Mdu 2024 10939237
20 pages
Machine Learning
No ratings yet
Machine Learning
34 pages
Unit - 2 ML
No ratings yet
Unit - 2 ML
8 pages
Unit 4 - Question Bank and Answers
No ratings yet
Unit 4 - Question Bank and Answers
23 pages
Unit - 2 ML
No ratings yet
Unit - 2 ML
8 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
14 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
Module 2 - PART 1
No ratings yet
Module 2 - PART 1
50 pages
Machine Learning Data Prep Guide
No ratings yet
Machine Learning Data Prep Guide
9 pages
Extracting Knowledge From Data
No ratings yet
Extracting Knowledge From Data
16 pages
Module 2 - PART 1
No ratings yet
Module 2 - PART 1
50 pages
Supervised ML Algorithms Guide
No ratings yet
Supervised ML Algorithms Guide
37 pages
Module 3
No ratings yet
Module 3
50 pages
7 Data Preprocessing Steps in Machine Learning
No ratings yet
7 Data Preprocessing Steps in Machine Learning
5 pages
Presentation-2 Data Pre-Processing in Machine Learning
No ratings yet
Presentation-2 Data Pre-Processing in Machine Learning
11 pages
Data Preparation For Machine Learning Mo
No ratings yet
Data Preparation For Machine Learning Mo
5 pages
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
No ratings yet
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
22 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Aml Midsem
No ratings yet
Aml Midsem
59 pages
18ai61-Model Question Paper Solutions
No ratings yet
18ai61-Model Question Paper Solutions
71 pages
Unit 2
No ratings yet
Unit 2
18 pages
ML Notes All
No ratings yet
ML Notes All
32 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
4 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
E-Notes 33718 Content Document 20250325122736PM
No ratings yet
E-Notes 33718 Content Document 20250325122736PM
18 pages
AML Unit-1
No ratings yet
AML Unit-1
14 pages
Machine Learning
No ratings yet
Machine Learning
25 pages
Unit 1
No ratings yet
Unit 1
41 pages
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
No ratings yet
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
19 pages
TE ML LAB Mannual
No ratings yet
TE ML LAB Mannual
21 pages
Module - 1
No ratings yet
Module - 1
9 pages
Machine Learning for Beginners
No ratings yet
Machine Learning for Beginners
18 pages
Machine Learning (Unit I)
No ratings yet
Machine Learning (Unit I)
12 pages
Statistics For Data Science
100% (2)
Statistics For Data Science
39 pages
Course 4
No ratings yet
Course 4
29 pages
CH 3
No ratings yet
CH 3
33 pages
Unit I 1
No ratings yet
Unit I 1
203 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
ML Lecture Notes Unit-1
No ratings yet
ML Lecture Notes Unit-1
45 pages
Data Preparation For Machine Learning Mini Course
No ratings yet
Data Preparation For Machine Learning Mini Course
19 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
8 pages
Machinelearning Unit1
No ratings yet
Machinelearning Unit1
9 pages
4 Data Preprocessing
No ratings yet
4 Data Preprocessing
27 pages
ML Data Preprocessing Guide
No ratings yet
ML Data Preprocessing Guide
5 pages
ML Da
No ratings yet
ML Da
55 pages
ML Module I
No ratings yet
ML Module I
71 pages
Data Processing
No ratings yet
Data Processing
19 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
Present Explain
No ratings yet
Present Explain
11 pages
Unit 2
No ratings yet
Unit 2
12 pages
DPT Week 1
No ratings yet
DPT Week 1
3 pages
Philosophical Underpinnings of Research
100% (3)
Philosophical Underpinnings of Research
20 pages
Recognized University Scholarship Dec
No ratings yet
Recognized University Scholarship Dec
1 page
Teachings On The Prayer of The Heart in The Greek and Syrian Fathers
100% (4)
Teachings On The Prayer of The Heart in The Greek and Syrian Fathers
303 pages
Mock Test - Phan Hoàng Quốc Bảo
No ratings yet
Mock Test - Phan Hoàng Quốc Bảo
6 pages
Gambella University Dormitory Management System Soe Project
100% (2)
Gambella University Dormitory Management System Soe Project
31 pages
KOBORI Et Al-2012-Japanese Psychological Research
No ratings yet
KOBORI Et Al-2012-Japanese Psychological Research
6 pages
Class Program 2025-2026 Bonifacio
No ratings yet
Class Program 2025-2026 Bonifacio
2 pages
Biology U7 Worksheets
No ratings yet
Biology U7 Worksheets
9 pages
Nutrition UNIT PLAN
No ratings yet
Nutrition UNIT PLAN
1 page
Y2 Simplified English Yearly Sow
No ratings yet
Y2 Simplified English Yearly Sow
35 pages
Meiosis 1 vs 2: Key Differences Explained
No ratings yet
Meiosis 1 vs 2: Key Differences Explained
4 pages
Essay Mullananda
No ratings yet
Essay Mullananda
3 pages
FAFSA 101 - Check-Off List, Tips & Resources
No ratings yet
FAFSA 101 - Check-Off List, Tips & Resources
10 pages
Ielts Speaking 101 Insett
No ratings yet
Ielts Speaking 101 Insett
16 pages
THE Psychoanalytic: OF Therapy
No ratings yet
THE Psychoanalytic: OF Therapy
296 pages
Community Engagement Citizenship
No ratings yet
Community Engagement Citizenship
70 pages
Preschool Read Aloud Lesson Plan
No ratings yet
Preschool Read Aloud Lesson Plan
2 pages
STA Practitioner's Level Horary: 15-Week ONLINE COURSE
No ratings yet
STA Practitioner's Level Horary: 15-Week ONLINE COURSE
5 pages
ENGR 3122 - Dynamics - Syllabus Veazie Fall 2016
No ratings yet
ENGR 3122 - Dynamics - Syllabus Veazie Fall 2016
7 pages
Math8 - q1 - Mod9 - Solving Problems Involving Systems of Linear Equations in Two Variables - V3bedited2
100% (1)
Math8 - q1 - Mod9 - Solving Problems Involving Systems of Linear Equations in Two Variables - V3bedited2
40 pages
Sayninhao-Studychinesehsk1 Course
No ratings yet
Sayninhao-Studychinesehsk1 Course
1 page
Atg Bonotan
100% (2)
Atg Bonotan
5 pages
Bahasa Indonesia Languages: The Indonesian Language
No ratings yet
Bahasa Indonesia Languages: The Indonesian Language
3 pages
Participation or Non Participation Phase 5 Beei...
No ratings yet
Participation or Non Participation Phase 5 Beei...
2 pages
Dhan Rashi Baby Boy Names With Meanings
No ratings yet
Dhan Rashi Baby Boy Names With Meanings
4 pages
Ordinance For UG & PG Gracing Marks
No ratings yet
Ordinance For UG & PG Gracing Marks
6 pages
PHILIPPINES - Educational Structure
No ratings yet
PHILIPPINES - Educational Structure
15 pages
Dlp3 Moralist Approach
100% (2)
Dlp3 Moralist Approach
8 pages
G10 2nd QUARTER EXAM Answer Key
No ratings yet
G10 2nd QUARTER EXAM Answer Key
3 pages
Module in Gymnastics
No ratings yet
Module in Gymnastics
46 pages

ML 1

Uploaded by

ML 1

Uploaded by

What is Data Encoding?

Data Encoding is an important pre-processing step

Q2} Explain Different Methods for Encoding Categorical

• One of the most common encoding methods.

• Assigns a unique integer to each category.

• Used when categories have a natural order.

• Target Encoding (Avg.

◦ Google Maps uses ML for real-time traffic estimation.

3. Explain Some Key Applications of Machine Learning.

Definition from PDF:

Key Aspects of Data Preparation:

• Handling missing values and noisy data.

2. Why is Data Preparation Important in Machine Learning?

3. What are the Steps in the Data Preparation Process?

Steps from PDF:

1. Understand the Problem:

◦ Define the objective and analyze the type of data required.

◦ Identify trends, outliers, and missing values.

◦ Handle missing values (imputation, deletion).

◦ Convert data into a structured format (e.g., converting date formats).

◦ Create new features from existing data.

What are the Challenges in the Data Preparation Process?

Challenges in Data Preparation (From PDF):

◦ Unexpected values that skew model predictions.

◦ Selecting the right features is crucial for improving model performance.

Q] 1. What is Data Pre-Processing in Machine Learning? Why is it Important?

2. What is Data Cleaning in Machine Learning? Explain its Steps.

Steps in Data Cleaning (From PDF):

1. Handling Missing Data

◦ Filling missing values using mean, median, or mode.

◦ Identifying and correcting errors in data.

◦ Detecting extreme values using visualization methods (e.g., box plots).

◦ Identifying and deleting repeated data entries to avoid bias.

◦ Standardizing data formats (e.g., date formats, text standardization).

Role in Machine Learning (From PDF):

• ML models require large, diverse datasets for better training.

4. What is Data Transformation? Explain Different Data Transformation

• It helps in making data consistent, structured, and readable for ML algorithms.

1. Normalization (Min-Max Scaling):

◦ Converts non-numeric data into numerical form.

5. What is Data Reduction? Explain Different Data Reduction Techniques.

1. Dimensionality Reduction (PCA, t-SNE):

◦ Principal Component Analysis (PCA) removes correlated features while

◦ Uses algorithms to reduce storage size (e.g., wavelet transformation in images).

◦ Removing irrelevant or redundant features using Recursive Feature Elimination

◦ Selecting a subset of data while maintaining statistical significance (e.g., stratified

6. What is Data Discretization? Explain Its Techniques.

• It helps in simplifying data, improving interpretability, and enhancing model

◦ Divides data into equal-sized intervals.

◦ Each bin has the same number of data points.

◦ Groups data into clusters and assigns discrete labels.

◦ Splits data using decision tree rules.

2. What is a Feature in Machine Learning?

3. Explain the different processes involved in Feature Engineering.

◦ Feature extraction is an automated feature engineering process that generates new

4. What are the steps involved in Feature Engineering?

◦ Benchmarking is the process of setting a standard baseline for accuracy to

5. What are different Feature Engineering Techniques?

◦ This technique identifies outliers and removes them if necessary.

◦ Binning can be used to normalize noisy data.

◦ Definition: Combines a small amount of labeled data with a large amount of

◦ Definition: An agent learns to make decisions by interacting with an environment

What are the Key Elements of Machine Learning?

• This refers to how knowledge is structured within the model.

2. Evaluation (How to Evaluate the Model’s Performance)

• Evaluating a model is necessary to measure its accuracy and effectiveness.

3. Optimization (How to Improve Learning)

• Optimization is the process of improving model performance by tuning parameters.

Aspects of developing a learning system:

4. Data Splitting in Machine Learning

You might also like