FLOWCHART:
Ex. No: 1 SIMULATE THE DATA EXTRACTION FROM THE DATABASE AND
VARIOUS DATA PRE-PROCESSING TECHNIQUES FOR A GIVEN
DATE:
DATASET
OBJECTIVES:
To perform data extraction techniques and preprocessing techniques for the given dataset
AIM:
To simulate data extraction from a database and apply specific preprocessing techniques,
such as handling missing values, normalization, encoding, and outlier detection, using
MATLAB/PYTHON to enhance data quality.
SOFTWARE REQUIRED:
MATLAB R2022a/ Open CV/ Google Colab
PROCEDURE FOR MATLAB:
1. Click on the MATLAB Icon on the desktop.
2. Click on the ‘FILE’ Menu on menu bar.
3. Click on NEW M-File from the file Menu.
4. Save the file in directory.
5. Click on DEBUG from Menu bar and Click Run.
6. Open the command window\ Figure window for the output
THEORY:
In machine learning, there are preprocessing techniques collectively improve the quality of the
dataset, making it more suitable for further analysis or machine learning applications. They are
Data Extraction:
Simulated by reading a CSV file: In this step, we simulate extracting data from a database
by reading it from a CSV file. This involves using MATLAB functions like readtable or csvread to
load the data into the workspace. The extracted data can then be processed and analyzed within
MATLAB.
PROGRAM:
% Read the Excel file
filename = 'Student_Details.xlsx'; % Ensure the file is in the working directory
data = readtable(filename);
% Find and replace missing values with column mean
numericCols = varfun(@isnumeric, data, 'OutputFormat', 'uniform'); % Identify numeric columns
for i = find(numericCols)
colData = data{:, i};
meanValue = mean(colData, 'omitnan'); % Compute mean ignoring NaN
colData(isnan(colData)) = meanValue; % Replace NaN with mean
data{:, i} = colData;
end
% Save the updated data back to a new Excel file
newFilename = 'Student_Details_Filled.xlsx';
writetable(data, newFilename);
disp('Missing values filled and saved successfully.');
Handling Missing Values:
Replaces missing values with the mean of the column: Missing data can cause issues in
analysis. To address this, we replace missing values with the mean value of their respective
columns. This is done using MATLAB's fill missing function, which ensures that the dataset
remains complete and reduces potential biases.
Normalization:
Scales the data to the range [0, 1]: Normalization adjusts the values in the dataset to a
common scale, typically between 0 and 1. This is crucial when the features have different units or
ranges. MATLAB's normalize function can be used to perform this scaling, helping to improve the
performance of machine learning algorithms.
Standardization:
Standardizes the data to have zero mean and unit variance: Standardization transforms the
data so that it has a mean of zero and a standard deviation of one. This is done using the zscore
function in MATLAB. Standardization is particularly useful when the data has varying scales and
is necessary for certain algorithms that assume normally distributed data.
Encoding Categorical Variables:
Converts categorical variables to numerical values: Categorical variables must be
converted to numerical values for use in machine learning models. This can be done by converting
categories to integers using MATLAB's categorical and double functions, ensuring that the data is
in a suitable format for analysis.
Outlier Detection and Removal:
Removes rows with outliers based on the z-score: Outliers can skew results and affect
model performance. We detect outliers using the z-score, which measures the number of standard
deviations a data point is from the mean. Data points with z-scores beyond a certain threshold
(e.g., 3) are considered outliers and can be removed to clean the dataset.
Data Smoothing:
Applies a moving average filter to smooth the data: Smoothing helps to reduce noise and
fluctuations in the data, making patterns more apparent. This can be achieved using a moving
average filter, implemented in MATLAB with the movmean function. By averaging data points
within a defined window, we produce a smoother dataset that is easier to analyze.
PRELAB QUESTIONS:
1. What is the purpose of data extraction in the context of this experiment?
2. Which MATLAB function is used to read data from a CSV file?
3. Why is it important to handle missing values in a dataset?
4. Explain how normalization differs from standardization.
5. Describe the purpose of applying a moving average filter to data.
6. What are the potential impacts of outliers on data analysis and model performance?
7. How can normalization improve the performance of machine learning algorithms?
8. What are some common techniques for handling missing data, and why is replacing with
the mean a valid approach?
POSTLAB QUESTIONS:
1. Which preprocessing technique had the most significant impact on the dataset, and why?
2. Were there any challenges encountered during data extraction or preprocessing? How were
they addressed?
3. Evaluate the effectiveness of the moving average filter in smoothing the data. Did it help
reveal underlying patterns?
4. Based on your results, what additional preprocessing steps would you recommend to
further enhance the dataset's quality?
RESULT:
CORE COMPETENCY:
MARKS ALLOCATION:
Details Marks Marks Awarded
Allotted
BOOPATHI V DHARANESH P
(73772213110) (73772213116)
Preparation 20
Conducting 20
Calculation / Graphs 15
Results 10
Basic understanding (Core 15
competency learned)
Viva 10
Record 10
Total 100
`
Signature of faculty
FLOW CHART: