0% found this document useful (0 votes)
11 views8 pages

1 Asdfadgaf

The document outlines a simulation exercise for data extraction and preprocessing techniques using MATLAB or Python. It includes objectives, procedures, and programming examples for handling missing values, normalization, encoding, outlier detection, and data smoothing. Additionally, it features pre-lab and post-lab questions to assess understanding and application of the techniques learned.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views8 pages

1 Asdfadgaf

The document outlines a simulation exercise for data extraction and preprocessing techniques using MATLAB or Python. It includes objectives, procedures, and programming examples for handling missing values, normalization, encoding, outlier detection, and data smoothing. Additionally, it features pre-lab and post-lab questions to assess understanding and application of the techniques learned.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

FLOWCHART:

Ex. No: 1 SIMULATE THE DATA EXTRACTION FROM THE DATABASE AND
VARIOUS DATA PRE-PROCESSING TECHNIQUES FOR A GIVEN
DATE:
DATASET

OBJECTIVES:
To perform data extraction techniques and preprocessing techniques for the given dataset
AIM:
To simulate data extraction from a database and apply specific preprocessing techniques,
such as handling missing values, normalization, encoding, and outlier detection, using
MATLAB/PYTHON to enhance data quality.
SOFTWARE REQUIRED:
MATLAB R2022a/ Open CV/ Google Colab
PROCEDURE FOR MATLAB:

1. Click on the MATLAB Icon on the desktop.


2. Click on the ‘FILE’ Menu on menu bar.
3. Click on NEW M-File from the file Menu.
4. Save the file in directory.
5. Click on DEBUG from Menu bar and Click Run.
6. Open the command window\ Figure window for the output

THEORY:
In machine learning, there are preprocessing techniques collectively improve the quality of the
dataset, making it more suitable for further analysis or machine learning applications. They are
Data Extraction:
Simulated by reading a CSV file: In this step, we simulate extracting data from a database
by reading it from a CSV file. This involves using MATLAB functions like readtable or csvread to
load the data into the workspace. The extracted data can then be processed and analyzed within
MATLAB.
PROGRAM:

% Read the Excel file


filename = 'Student_Details.xlsx'; % Ensure the file is in the working directory
data = readtable(filename);
% Find and replace missing values with column mean
numericCols = varfun(@isnumeric, data, 'OutputFormat', 'uniform'); % Identify numeric columns
for i = find(numericCols)
colData = data{:, i};
meanValue = mean(colData, 'omitnan'); % Compute mean ignoring NaN
colData(isnan(colData)) = meanValue; % Replace NaN with mean
data{:, i} = colData;
end
% Save the updated data back to a new Excel file
newFilename = 'Student_Details_Filled.xlsx';
writetable(data, newFilename);
disp('Missing values filled and saved successfully.');
Handling Missing Values:
Replaces missing values with the mean of the column: Missing data can cause issues in
analysis. To address this, we replace missing values with the mean value of their respective
columns. This is done using MATLAB's fill missing function, which ensures that the dataset
remains complete and reduces potential biases.
Normalization:
Scales the data to the range [0, 1]: Normalization adjusts the values in the dataset to a
common scale, typically between 0 and 1. This is crucial when the features have different units or
ranges. MATLAB's normalize function can be used to perform this scaling, helping to improve the
performance of machine learning algorithms.
Standardization:
Standardizes the data to have zero mean and unit variance: Standardization transforms the
data so that it has a mean of zero and a standard deviation of one. This is done using the zscore
function in MATLAB. Standardization is particularly useful when the data has varying scales and
is necessary for certain algorithms that assume normally distributed data.
Encoding Categorical Variables:
Converts categorical variables to numerical values: Categorical variables must be
converted to numerical values for use in machine learning models. This can be done by converting
categories to integers using MATLAB's categorical and double functions, ensuring that the data is
in a suitable format for analysis.
Outlier Detection and Removal:
Removes rows with outliers based on the z-score: Outliers can skew results and affect
model performance. We detect outliers using the z-score, which measures the number of standard
deviations a data point is from the mean. Data points with z-scores beyond a certain threshold
(e.g., 3) are considered outliers and can be removed to clean the dataset.
Data Smoothing:
Applies a moving average filter to smooth the data: Smoothing helps to reduce noise and
fluctuations in the data, making patterns more apparent. This can be achieved using a moving
average filter, implemented in MATLAB with the movmean function. By averaging data points
within a defined window, we produce a smoother dataset that is easier to analyze.
PRELAB QUESTIONS:

1. What is the purpose of data extraction in the context of this experiment?


2. Which MATLAB function is used to read data from a CSV file?
3. Why is it important to handle missing values in a dataset?
4. Explain how normalization differs from standardization.
5. Describe the purpose of applying a moving average filter to data.
6. What are the potential impacts of outliers on data analysis and model performance?
7. How can normalization improve the performance of machine learning algorithms?
8. What are some common techniques for handling missing data, and why is replacing with
the mean a valid approach?

POSTLAB QUESTIONS:

1. Which preprocessing technique had the most significant impact on the dataset, and why?
2. Were there any challenges encountered during data extraction or preprocessing? How were
they addressed?
3. Evaluate the effectiveness of the moving average filter in smoothing the data. Did it help
reveal underlying patterns?
4. Based on your results, what additional preprocessing steps would you recommend to
further enhance the dataset's quality?
RESULT:

CORE COMPETENCY:

MARKS ALLOCATION:

Details Marks Marks Awarded


Allotted
BOOPATHI V DHARANESH P
(73772213110) (73772213116)

Preparation 20

Conducting 20

Calculation / Graphs 15

Results 10

Basic understanding (Core 15


competency learned)

Viva 10

Record 10

Total 100
`

Signature of faculty
FLOW CHART:

You might also like