Data Science Methodology Class 12 Notes Data science methodology
Data Science Methodology
A data science methodology is a structured approach to solving problems. A methodology gives the data
scientist a framework for designing an AI project. The framework will help the team to decide on the methods,
processes, and strategies that will be employed to obtain the correct output required from the AI project.
Definition: Data Science Methodology is a process with a prescribed sequence of iterative steps that data scientists
follow to approach a problem and find a solution.
Data Science Methodology which was introduced by John Rollins, a Data Scientist at IBM Analytics. It consists
of 10 steps.
The technique is broken down into five modules, each of which covers two stages and explains why each is
necessary.
1. From Problem to Approach
2. From Requirements to Collection
3. From Understanding to Preparation
4. From Modelling to Evaluation
5. From Deployment to Feedback
1. Business understanding In the first stage, we have to understand the problem and try to comprehend what is
exactly required in the business. This is also known as problem scoping and defining. The term can use the
5W1H Problem Canvas to deeply understand the issue. This stage also involves using the Design Thinking
Framework.
To solve a problem, it’s crucial to understand the customer’s needs. This can be achieved by asking relevant
questions and engaging in discussions with all stakeholders.
2. Analytic Approach In this stage, the data scientist identifies and collects the questions or clarification from
the stakeholders which is required for analysis. In this stage data scientist involves asking more questions to
stakeholders so that the AI project team can decide on the correct approach to solve the problem.
To solve a particular problem, there are four main types of data analytics.
1. Descriptive Analytics
2. Diagnostic Analytics
3. Predictive Analytics
4. Prescriptive Analytics
Descriptive Analytics: Descriptive analytics summarizes the past data to identify trends and patterns. Descriptive
analysitics use tools like graphs, charts and statistical measures like mean, median, mode to understand the data.
For example: To calculate the average marks of students in an exam or analyzing sales data from the previous
year.
Diagnostic Analystics: Diagnostic analytics understand the reason behind why some things have happened.
Diagnostic analytics analyze past data using techniques like root cause analysis, hypothesis testing, correlation
analysis, etc. For example, if the sales of a company dropped, diagnostic analysis will help to find the cause for it
by analyzing questions like “Is it due to poor customer service?” or “low product quality?”
Predictive Analytics: This analytics uses the past data to make predictions about future events or trends, using
techniques like regression, classification, clustering, etc. The main purpose is to foresee future outcomes and make
informed decisions. For example, a company can use predictive analytics to forecast its sales, demand, inventory,
customer purchase patterns, etc., based on previous sales data.
Prescriptive Analytics: Prescriptive analytics is a data-driven approach in machine learning and statistical
algorithms to recommend actions that can improve business outcomes. The techniques used in prescriptive
analytics are optimization, simulation, decision analysis, etc. For example, to design the right strategy to increase
the sales during festival season by analyzing past data and thus optimize pricing, marketing, production, etc.
We can summarize each of these analytics as given in Table Descriptive Analytics Diagnostic Analytics Predictive
Analytics Prescriptive Analytics Focus Questions on summarizing historical data Questions on understanding
why certain events occurred Questions on predicting future outcomes based on historical data patterns
Questions on determining the best course of action Purpose Identify patterns, trends, and anomalies in past
data Uncover root causes and factors contributing to specific outcomes Forecast future events or behaviors
Recommend specific actions or interventions based on predictive insights. May indirectly influence
classification through recommendations
3. Data requirements
In data requirements, the 5W1H questioning method is used to identify the data requirements and also wants to
find the purpose of data. Data requirements understand the steps involved in the processes that create, read, update,
or delete data and determine the correct use of data.
Determining the specific information needed for our analysis or project includes:
Identifying the types of data required, such as numbers, words, or images.
Considering the structure in which the data should be organized, whether it is in a table, text file, or database.
Identifying the sources from which we can collect the data, and Any necessary cleaning or organization steps
required before beginning the analysis.
4. Data collection Data collection is a process where the data is collected from different sources; it is a
fundamental step in data science. Data requirements are decision-makers deciding whether the data collected from
different sources requires more or less data. There are mainly two sources of data collection:
Primary data source: Primary data is raw and unprocessed data that is collected from the original source, like
direct observation, experimentation, surveys, interviews, or other methods.
Secondary data source: Secondary data is ready-to-use data. Secondary data sources refer to the data that is
already stored in different areas, like web scraping, databases, social media data, satellite data, etc.
5. Data Understanding
Data understanding is a process where we want to understand if the collected data can solve the problem or not.
We also want to check the relevance of the data and want to identify that the data can address the specific problem
or question that is going to be evaluated.
6. Data preparation This stage covers all the activities to build the set of data that will be used in the
modelling step. Data is transformed into a state where it is easier to work with.
Data preparation includes
1. Cleaning of data (dealing with invalid or missing values, removal of duplicate values and assigning a
suitable format)
2. Combine data from multiple sources (archives, tables and platforms)
3. Transform data into meaningful input variables
7. AI modelling AI modeling is a method of creating algorithms or models that can learn and make intelligent
decisions without human intervention. The modeling stage uses the initial version of the dataset prepared and
focuses on developing models according to the analytical approach previously defined.
Data modeling focuses on developing models that are either descriptive or predictive.
Descriptive Modeling: It is a concept in data science and statistics that focuses on summarizing and
understanding the characteristics of a dataset without making predictions or decisions. This includes
summarizing the main characteristics, patterns, and trends that are present in the data.
Common Descriptive Techniques:
Summary Statistics: This includes measures like: Mean (average), Median, Mode Standard deviation, Variance
Range (difference between the highest and lowest values) Percentiles (e.g., quartiles)
Visualizations: Graphs and charts to represent the data, such as: Bar charts Histograms Pie Charts Box Plots
Scatter Plots
Predictive modeling: It involves using data and statistical algorithms to identify patterns and trends in order
to predict future outcomes or values. It relies on historical data and uses it to create a model that can predict
future behavior or trends or forecast what might happen next. It involves techniques like regression,
classification, and time-series forecasting, and can be applied in a variety of fields, from predicting exam
scores to forecasting weather or stock prices.
8. Evaluation Evaluation in an AI project cycle is the process of assessing how well a model performs after
training. It involves using test data to measure metrics like accuracy, precision, recall, or F1 score. This
helps determine if the model is reliable and effective before deploying it in real-world situations.
Model evaluation can have two main phases.
First phase – Diagnostic measures
It is used to ensure the model is working as intended. If the model is a predictive model, a decision tree can
be used to evaluate the output of the model, check whether it is aligned to the initial design or requires any
adjustments.
Second phase – Statistical significance test
This type of evaluation can be applied to the model to verify that it accurately processes and interprets the
data. This is designed to avoid unnecessary second guessing when the answer is revealed.
9. Deployment Deployment refers to the stage where the trained AI model is made available to the users in
real-world applications. Once the model is evaluated and the data scientist is confident it will work, it is
deployed and put to the ultimate test.
10. Feedback The last stage in the data science methodology is feedback. Feedback from the users will help to
refine the model and assess it for performance and impact. Feedback from users can be received in many
ways.
Model Validation Model validation is a process that evaluates the performance and reliability of a model. Model
Validation offers a systematic approach to measure its accuracy and reliability, providing insights into how
well it generalizes to new, unseen data. The benefits of Model Validation include
Enhancing the model quality. Reduced risk of errors Prevents the model from overfitting and underfitting.
Model Validation Techniques The commonly used Validation techniques are Train-test split, K-Fold Cross
Validation, Leave One out Cross Validation, Time Series Cross Validation etc.
Train test split and K-Fold Cross Validation.
Train Test Split
The train-test split is a technique for evaluating the performance of a machine learning algorithm. It can be
used for classification or regression problems and can be used for any supervised learning algorithm.
The procedure involves taking a dataset and dividing it into two subsets,
1. The first subset is used to train the model and is referred to as the training dataset.
2. The second subset is used to test the model.
Train Dataset: Used to fit the machine learning model.
Test Dataset: Used to evaluate the fit machine learning model.
How to Configure the Train-Test Split The parameter is used for the size of the train and test datasets, normally
represented as percentages. For example, if 67% of data is allocated for training, then 33% is reserved for
testing. The training and testing split depends on the project goal.
Common split percentages include: ● Train: 80%, Test: 20% ● Train: 70%, Test: 30% ● Train: 67%, Test: 33%
K-Fold Cross Validation K-Fold cross-validation is a technique that splits a dataset into subsets, or folds, to
evaluate the model’s performance.
For example, suppose you have 100 data points you want to evaluate using K-Fold cross-validation.
Step 1: Divide the 100 data points into 5 equal parts (folds), each containing 20 data points.
Step 2: Use the 1st fold as the test set and the remaining 4 folds as the training set.
Step 3: Use the 2nd fold as the test set, and the remaining 4 will be the training set.
Step 4: Continue the above steps until each fold has been used as the test set once.
Use the performance metric like accuracy and F1 score to find the final average of these metrics to get the
overall model performance.
Difference between Train-Test Split and Cross Validation Train-Test Split Cross Validation Normally applied
on large datasets and Divides the data into training data set and testing dataset. Normally Cross Validation Train
applied on small datasets Divides a dataset into subsets (folds), trains the model on some folds, and evaluates
its performance on the remaining data. Clear demarcation on training data and testing data. Every data point at
some stage could be in either testing or training data set.
MODEL PERFORMANCE – EVALUATION METRICS Evaluation metrics are used to check the
performance and effectiveness of the machine learning model. Evaluation metrics help to compare different
models to identify the best-performing one for a specific task. The evaluation matrix is categorized into
classification problems and regression problems.
Classification Problems: The target variable is divided into distinct classes. Metrics include –accuracy,
precision, recall, F1-score, and AUC-ROC.
Regression Problems: The target variable is continuous. Metrics include –mean squared error (MSE), mean
absolute error (MAE), and R-squared.
Evaluation Metrics for Classification
Confusion Matrix A Confusion Matrix is used to evaluate the performance of a classification model. It
summarizes the predictions against the actual outcomes. It creates an N X N matrix, where N is the number of
classes or categories that are to be predicted. Suppose there is a problem, which is a binary classification, then N=2
(Yes/No). It will create a 2×2 matrix.
True Positives: It is the case where the model predicted Yes and the real output was also yes.
True Negatives: It is the case where the model predicted No and the real output was also No.
False Positives: It is the case where the model predicted Yes but it was actually No.
False Negatives: It is the case where the model predicted No but it was actually Yes.
Precision measures “What proportion of predicted Positives is truly Positive?” Precision should be as high as
possible.
Precision = (TP)/(TP+FP)
Recall measures “What proportion of actual Positives is correctly classified?”
Recall = (TP)/(TP+FN)
F1-score A good F1 score means that you have low false positives and low false negatives, so you’re correctly
identifying real threats, and you are not disturbed by false alarms.
An F1 score is considered perfect when it is 1, while the model is a total failure when it is 0.
F1 = 2* (precision * recall)/(precision + recall)
Accuracy Accuracy = Number of correct predictions / Total number of predictions
Accuracy = (TP+TN)/(TP+FP+FN+TN)
Evaluation Metrics for Regression
MAE (Mean Absolute Error) Mean Absolute Error is a sum of the absolute differences between predictions
and actual values. A value of 0 indicates no error or perfect predictions
MSE (Mean Square Error) Mean Square Error (MSE) is the most commonly used metric to evaluate the
performance of a regression model. MSE is the mean(average) of squared distances between our target variable
and predicted values.
RMSE (Root Mean Square Error) Root Mean Square Error (RMSE) is the standard deviation of the residuals
(prediction errors). RMSE is often preferred over MSE because it is easier to interpret since it is in the same units
as the target variable.