0% found this document useful (0 votes)

55 views8 pages

Notes Unit 1

Uploaded by

Ak Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views8 pages

Notes Unit 1

Uploaded by

Ak Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Data Science Methodology Class 12 Notes Data science methodology

Data Science Methodology

A data science methodology is a structured approach to solving problems. A methodology gives the data
scientist a framework for designing an AI project. The framework will help the team to decide on the methods,
processes, and strategies that will be employed to obtain the correct output required from the AI project.

Definition: Data Science Methodology is a process with a prescribed sequence of iterative steps that data scientists
follow to approach a problem and find a solution.

Data Science Methodology which was introduced by John Rollins, a Data Scientist at IBM Analytics. It consists
of 10 steps.

The technique is broken down into five modules, each of which covers two stages and explains why each is
necessary.

1. From Problem to Approach

2. From Requirements to Collection
3. From Understanding to Preparation
4. From Modelling to Evaluation
5. From Deployment to Feedback

1. Business understanding In the first stage, we have to understand the problem and try to comprehend what is
exactly required in the business. This is also known as problem scoping and defining. The term can use the
5W1H Problem Canvas to deeply understand the issue. This stage also involves using the Design Thinking
Framework.

To solve a problem, it’s crucial to understand the customer’s needs. This can be achieved by asking relevant
questions and engaging in discussions with all stakeholders.

2. Analytic Approach In this stage, the data scientist identifies and collects the questions or clarification from
the stakeholders which is required for analysis. In this stage data scientist involves asking more questions to
stakeholders so that the AI project team can decide on the correct approach to solve the problem.

To solve a particular problem, there are four main types of data analytics.
1. Descriptive Analytics
2. Diagnostic Analytics
3. Predictive Analytics
4. Prescriptive Analytics

Descriptive Analytics: Descriptive analytics summarizes the past data to identify trends and patterns. Descriptive
analysitics use tools like graphs, charts and statistical measures like mean, median, mode to understand the data.
For example: To calculate the average marks of students in an exam or analyzing sales data from the previous
year.

Diagnostic Analystics: Diagnostic analytics understand the reason behind why some things have happened.
Diagnostic analytics analyze past data using techniques like root cause analysis, hypothesis testing, correlation
analysis, etc. For example, if the sales of a company dropped, diagnostic analysis will help to find the cause for it
by analyzing questions like “Is it due to poor customer service?” or “low product quality?”

Predictive Analytics: This analytics uses the past data to make predictions about future events or trends, using
techniques like regression, classification, clustering, etc. The main purpose is to foresee future outcomes and make
informed decisions. For example, a company can use predictive analytics to forecast its sales, demand, inventory,
customer purchase patterns, etc., based on previous sales data.

Prescriptive Analytics: Prescriptive analytics is a data-driven approach in machine learning and statistical
algorithms to recommend actions that can improve business outcomes. The techniques used in prescriptive
analytics are optimization, simulation, decision analysis, etc. For example, to design the right strategy to increase
the sales during festival season by analyzing past data and thus optimize pricing, marketing, production, etc.

We can summarize each of these analytics as given in Table Descriptive Analytics Diagnostic Analytics Predictive
Analytics Prescriptive Analytics Focus Questions on summarizing historical data Questions on understanding
why certain events occurred Questions on predicting future outcomes based on historical data patterns
Questions on determining the best course of action Purpose Identify patterns, trends, and anomalies in past
data Uncover root causes and factors contributing to specific outcomes Forecast future events or behaviors
Recommend specific actions or interventions based on predictive insights. May indirectly influence
classification through recommendations
3. Data requirements

In data requirements, the 5W1H questioning method is used to identify the data requirements and also wants to
find the purpose of data. Data requirements understand the steps involved in the processes that create, read, update,
or delete data and determine the correct use of data.

Determining the specific information needed for our analysis or project includes:

Identifying the types of data required, such as numbers, words, or images.

Considering the structure in which the data should be organized, whether it is in a table, text file, or database.
Identifying the sources from which we can collect the data, and Any necessary cleaning or organization steps
required before beginning the analysis.

4. Data collection Data collection is a process where the data is collected from different sources; it is a
fundamental step in data science. Data requirements are decision-makers deciding whether the data collected from
different sources requires more or less data. There are mainly two sources of data collection:

Primary data source: Primary data is raw and unprocessed data that is collected from the original source, like
direct observation, experimentation, surveys, interviews, or other methods.

Secondary data source: Secondary data is ready-to-use data. Secondary data sources refer to the data that is
already stored in different areas, like web scraping, databases, social media data, satellite data, etc.

5. Data Understanding

Data understanding is a process where we want to understand if the collected data can solve the problem or not.
We also want to check the relevance of the data and want to identify that the data can address the specific problem
or question that is going to be evaluated.

6. Data preparation This stage covers all the activities to build the set of data that will be used in the
modelling step. Data is transformed into a state where it is easier to work with.

Data preparation includes

1. Cleaning of data (dealing with invalid or missing values, removal of duplicate values and assigning a
suitable format)
2. Combine data from multiple sources (archives, tables and platforms)
3. Transform data into meaningful input variables
7. AI modelling AI modeling is a method of creating algorithms or models that can learn and make intelligent
decisions without human intervention. The modeling stage uses the initial version of the dataset prepared and
focuses on developing models according to the analytical approach previously defined.

Data modeling focuses on developing models that are either descriptive or predictive.

Descriptive Modeling: It is a concept in data science and statistics that focuses on summarizing and
understanding the characteristics of a dataset without making predictions or decisions. This includes
summarizing the main characteristics, patterns, and trends that are present in the data.

Common Descriptive Techniques:

Summary Statistics: This includes measures like: Mean (average), Median, Mode Standard deviation, Variance
Range (difference between the highest and lowest values) Percentiles (e.g., quartiles)

Visualizations: Graphs and charts to represent the data, such as: Bar charts Histograms Pie Charts Box Plots
Scatter Plots

Predictive modeling: It involves using data and statistical algorithms to identify patterns and trends in order
to predict future outcomes or values. It relies on historical data and uses it to create a model that can predict
future behavior or trends or forecast what might happen next. It involves techniques like regression,
classification, and time-series forecasting, and can be applied in a variety of fields, from predicting exam
scores to forecasting weather or stock prices.

8. Evaluation Evaluation in an AI project cycle is the process of assessing how well a model performs after
training. It involves using test data to measure metrics like accuracy, precision, recall, or F1 score. This
helps determine if the model is reliable and effective before deploying it in real-world situations.

Model evaluation can have two main phases.

First phase – Diagnostic measures

It is used to ensure the model is working as intended. If the model is a predictive model, a decision tree can
be used to evaluate the output of the model, check whether it is aligned to the initial design or requires any
adjustments.
Second phase – Statistical significance test

This type of evaluation can be applied to the model to verify that it accurately processes and interprets the
data. This is designed to avoid unnecessary second guessing when the answer is revealed.

9. Deployment Deployment refers to the stage where the trained AI model is made available to the users in
real-world applications. Once the model is evaluated and the data scientist is confident it will work, it is
deployed and put to the ultimate test.

10. Feedback The last stage in the data science methodology is feedback. Feedback from the users will help to
refine the model and assess it for performance and impact. Feedback from users can be received in many
ways.

Model Validation Model validation is a process that evaluates the performance and reliability of a model. Model
Validation offers a systematic approach to measure its accuracy and reliability, providing insights into how
well it generalizes to new, unseen data. The benefits of Model Validation include

Enhancing the model quality. Reduced risk of errors Prevents the model from overfitting and underfitting.

Model Validation Techniques The commonly used Validation techniques are Train-test split, K-Fold Cross
Validation, Leave One out Cross Validation, Time Series Cross Validation etc.

Train test split and K-Fold Cross Validation.

Train Test Split

The train-test split is a technique for evaluating the performance of a machine learning algorithm. It can be
used for classification or regression problems and can be used for any supervised learning algorithm.

The procedure involves taking a dataset and dividing it into two subsets,

1. The first subset is used to train the model and is referred to as the training dataset.
2. The second subset is used to test the model.

Train Dataset: Used to fit the machine learning model.

Test Dataset: Used to evaluate the fit machine learning model.

How to Configure the Train-Test Split The parameter is used for the size of the train and test datasets, normally
represented as percentages. For example, if 67% of data is allocated for training, then 33% is reserved for
testing. The training and testing split depends on the project goal.

Common split percentages include: ● Train: 80%, Test: 20% ● Train: 70%, Test: 30% ● Train: 67%, Test: 33%

K-Fold Cross Validation K-Fold cross-validation is a technique that splits a dataset into subsets, or folds, to
evaluate the model’s performance.

For example, suppose you have 100 data points you want to evaluate using K-Fold cross-validation.

Step 1: Divide the 100 data points into 5 equal parts (folds), each containing 20 data points.

Step 2: Use the 1st fold as the test set and the remaining 4 folds as the training set.

Step 3: Use the 2nd fold as the test set, and the remaining 4 will be the training set.

Step 4: Continue the above steps until each fold has been used as the test set once.

Use the performance metric like accuracy and F1 score to find the final average of these metrics to get the
overall model performance.

Difference between Train-Test Split and Cross Validation Train-Test Split Cross Validation Normally applied
on large datasets and Divides the data into training data set and testing dataset. Normally Cross Validation Train
applied on small datasets Divides a dataset into subsets (folds), trains the model on some folds, and evaluates
its performance on the remaining data. Clear demarcation on training data and testing data. Every data point at
some stage could be in either testing or training data set.

MODEL PERFORMANCE – EVALUATION METRICS Evaluation metrics are used to check the
performance and effectiveness of the machine learning model. Evaluation metrics help to compare different
models to identify the best-performing one for a specific task. The evaluation matrix is categorized into
classification problems and regression problems.

Classification Problems: The target variable is divided into distinct classes. Metrics include –accuracy,
precision, recall, F1-score, and AUC-ROC.
Regression Problems: The target variable is continuous. Metrics include –mean squared error (MSE), mean
absolute error (MAE), and R-squared.

Evaluation Metrics for Classification

Confusion Matrix A Confusion Matrix is used to evaluate the performance of a classification model. It
summarizes the predictions against the actual outcomes. It creates an N X N matrix, where N is the number of
classes or categories that are to be predicted. Suppose there is a problem, which is a binary classification, then N=2
(Yes/No). It will create a 2×2 matrix.

True Positives: It is the case where the model predicted Yes and the real output was also yes.

True Negatives: It is the case where the model predicted No and the real output was also No.

False Positives: It is the case where the model predicted Yes but it was actually No.

False Negatives: It is the case where the model predicted No but it was actually Yes.

Precision measures “What proportion of predicted Positives is truly Positive?” Precision should be as high as
possible.

Precision = (TP)/(TP+FP)

Recall measures “What proportion of actual Positives is correctly classified?”

Recall = (TP)/(TP+FN)

F1-score A good F1 score means that you have low false positives and low false negatives, so you’re correctly
identifying real threats, and you are not disturbed by false alarms.

An F1 score is considered perfect when it is 1, while the model is a total failure when it is 0.

F1 = 2* (precision * recall)/(precision + recall)

Accuracy Accuracy = Number of correct predictions / Total number of predictions

Accuracy = (TP+TN)/(TP+FP+FN+TN)

Evaluation Metrics for Regression

MAE (Mean Absolute Error) Mean Absolute Error is a sum of the absolute differences between predictions
and actual values. A value of 0 indicates no error or perfect predictions

MSE (Mean Square Error) Mean Square Error (MSE) is the most commonly used metric to evaluate the
performance of a regression model. MSE is the mean(average) of squared distances between our target variable
and predicted values.

RMSE (Root Mean Square Error) Root Mean Square Error (RMSE) is the standard deviation of the residuals
(prediction errors). RMSE is often preferred over MSE because it is easier to interpret since it is in the same units
as the target variable.

Unit2data Science Methodology
No ratings yet
Unit2data Science Methodology
6 pages
Introduction To Data Science Methodology
No ratings yet
Introduction To Data Science Methodology
45 pages
Unit 2 - Data Science Methodology Notes
No ratings yet
Unit 2 - Data Science Methodology Notes
26 pages
AI Student HandbookXII
No ratings yet
AI Student HandbookXII
48 pages
CH 2
No ratings yet
CH 2
26 pages
Class 12 AI - Chapter 1
No ratings yet
Class 12 AI - Chapter 1
5 pages
Capstone Project - Unit2
No ratings yet
Capstone Project - Unit2
81 pages
Data Similarity and Dissimilarity
No ratings yet
Data Similarity and Dissimilarity
73 pages
IBM Q1 Technical Marketing ASSET2 - Data Science Methodology-Best Practices For Successful Implementations Ov37176 PDF
No ratings yet
IBM Q1 Technical Marketing ASSET2 - Data Science Methodology-Best Practices For Successful Implementations Ov37176 PDF
6 pages
Ds 3
No ratings yet
Ds 3
9 pages
Dsur Ea2352001010391 W3
No ratings yet
Dsur Ea2352001010391 W3
3 pages
Data Science Methodology
No ratings yet
Data Science Methodology
4 pages
EBook - Data Science 4
No ratings yet
EBook - Data Science 4
14 pages
Data Science Methodology
No ratings yet
Data Science Methodology
21 pages
Data Analytics
No ratings yet
Data Analytics
4 pages
Unit 2 MCQ 12th Class
No ratings yet
Unit 2 MCQ 12th Class
11 pages
FDS Introduction
No ratings yet
FDS Introduction
41 pages
DS Unit 1
No ratings yet
DS Unit 1
26 pages
DTS 201 Lecture Note
No ratings yet
DTS 201 Lecture Note
24 pages
Unit 2 - Data Science
No ratings yet
Unit 2 - Data Science
37 pages
Introduction Data Science Edited
No ratings yet
Introduction Data Science Edited
33 pages
Data Science Methodology
No ratings yet
Data Science Methodology
26 pages
Data Science
No ratings yet
Data Science
11 pages
Project Cycle 1-2-25
No ratings yet
Project Cycle 1-2-25
6 pages
DTS Modul Data Science Methodology
100% (1)
DTS Modul Data Science Methodology
56 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
Unit2 DATA SCIENCE
No ratings yet
Unit2 DATA SCIENCE
8 pages
Exploratory Data Analysis (Eda)
No ratings yet
Exploratory Data Analysis (Eda)
10 pages
Unit 2 - DS - 1st Year
No ratings yet
Unit 2 - DS - 1st Year
7 pages
DSBD
No ratings yet
DSBD
23 pages
Class Xi Chapter 2
No ratings yet
Class Xi Chapter 2
10 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Data Science Lifecycle
No ratings yet
Data Science Lifecycle
3 pages
Lecture02 Frameworks Platforms-Part1
No ratings yet
Lecture02 Frameworks Platforms-Part1
40 pages
FAI Notes - Unit 5
No ratings yet
FAI Notes - Unit 5
12 pages
PM Unit 1
No ratings yet
PM Unit 1
41 pages
DS Mod 1 To 2 Complete Notes
No ratings yet
DS Mod 1 To 2 Complete Notes
63 pages
Cs3352 Fods QB
No ratings yet
Cs3352 Fods QB
25 pages
Datascience Sum.23sol
No ratings yet
Datascience Sum.23sol
22 pages
QB Ese FDS
No ratings yet
QB Ese FDS
29 pages
FDSMSE Imp
No ratings yet
FDSMSE Imp
6 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
23 pages
Abhijitya Midsem
No ratings yet
Abhijitya Midsem
6 pages
Liceria Tech
No ratings yet
Liceria Tech
12 pages
AI Project Cycle
No ratings yet
AI Project Cycle
10 pages
Life Cycle of Data Science - Complete Step-By-step Guide
No ratings yet
Life Cycle of Data Science - Complete Step-By-step Guide
3 pages
Fundamental of Data Science
No ratings yet
Fundamental of Data Science
20 pages
FDS-Unit II-ECE
No ratings yet
FDS-Unit II-ECE
22 pages
Data Science in IOT
No ratings yet
Data Science in IOT
220 pages
Architecture of Data Science Projects: Components
No ratings yet
Architecture of Data Science Projects: Components
4 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
Data Science (Quick Guide) For College Exams
No ratings yet
Data Science (Quick Guide) For College Exams
34 pages
Unit I and Unit II Dev
No ratings yet
Unit I and Unit II Dev
36 pages
Fds Csheet and Read The Rule
No ratings yet
Fds Csheet and Read The Rule
4 pages
23SC3201 Data Science and Challenges-2
No ratings yet
23SC3201 Data Science and Challenges-2
28 pages
Classs12 Ai Practical Graph
No ratings yet
Classs12 Ai Practical Graph
4 pages
Function Stu
No ratings yet
Function Stu
11 pages
CS Lab and AI Lab Requirnments 2025-26
No ratings yet
CS Lab and AI Lab Requirnments 2025-26
3 pages
Data ScienceFinal Answers
No ratings yet
Data ScienceFinal Answers
2 pages
PGP 1445 Mlai
No ratings yet
PGP 1445 Mlai
31 pages
School of Basic Science
No ratings yet
School of Basic Science
4 pages
402-IT ClassX
0% (1)
402-IT ClassX
225 pages
All Units Java Handwritten
100% (3)
All Units Java Handwritten
195 pages
Save Water
No ratings yet
Save Water
1 page
Job Role
No ratings yet
Job Role
45 pages
419 DataSceincCLASS9
No ratings yet
419 DataSceincCLASS9
9 pages
Global Variables: Example 1: Create A Global Variable
No ratings yet
Global Variables: Example 1: Create A Global Variable
7 pages
Classxii SQL
No ratings yet
Classxii SQL
24 pages
Blue Print-Subjectwise
No ratings yet
Blue Print-Subjectwise
9 pages
Atomic Energy Central School No.4 Rawatbhata: MCQ Examination August (2020-2021)
No ratings yet
Atomic Energy Central School No.4 Rawatbhata: MCQ Examination August (2020-2021)
74 pages
Variable Length Arguments ( Args), Keyword Varargs ( Kwargs) in Python
No ratings yet
Variable Length Arguments ( Args), Keyword Varargs ( Kwargs) in Python
12 pages
Encryption and Its Application To E-Commerce
No ratings yet
Encryption and Its Application To E-Commerce
13 pages
Scheme Information Document: Reliance Capital Builder Fund IV - Series A
No ratings yet
Scheme Information Document: Reliance Capital Builder Fund IV - Series A
38 pages
Visitorid Visitorname Gender Comingfrom Amountpaid
No ratings yet
Visitorid Visitorname Gender Comingfrom Amountpaid
2 pages
File Handle Worksheet
No ratings yet
File Handle Worksheet
3 pages
Class III Progress Report
No ratings yet
Class III Progress Report
43 pages
Class 12 Computer CH-1 Day - 5
No ratings yet
Class 12 Computer CH-1 Day - 5
10 pages
SKD Academy, Vrindavan Yojna (CBSE) : SR No. Admission No. Roll No. Student Name
No ratings yet
SKD Academy, Vrindavan Yojna (CBSE) : SR No. Admission No. Roll No. Student Name
3 pages
PSG Vs TGG
No ratings yet
PSG Vs TGG
3 pages
Discrete Structures Notes - TutorialsDuniya
No ratings yet
Discrete Structures Notes - TutorialsDuniya
136 pages
Real Options Other Topics in Capital Budgeting
No ratings yet
Real Options Other Topics in Capital Budgeting
14 pages
U2 - Progress Check - Revisión Del Intento
No ratings yet
U2 - Progress Check - Revisión Del Intento
7 pages
Smart Notebook Lesson
No ratings yet
Smart Notebook Lesson
5 pages
Protean EGov Technologies Valuepickr
No ratings yet
Protean EGov Technologies Valuepickr
5 pages
Research Methods Lecture Notes PPT - MST 2023 UPD
No ratings yet
Research Methods Lecture Notes PPT - MST 2023 UPD
29 pages
English II: Personal Care & Beauty
No ratings yet
English II: Personal Care & Beauty
3 pages
Establishment of Naleemiah Institute of Islamic Studies
No ratings yet
Establishment of Naleemiah Institute of Islamic Studies
1 page
116 KC Expert Designation
No ratings yet
116 KC Expert Designation
23 pages
Lecture 1
No ratings yet
Lecture 1
33 pages
Pre Test Results in MAPEH 9
No ratings yet
Pre Test Results in MAPEH 9
3 pages
Introduction To Psychology
No ratings yet
Introduction To Psychology
23 pages
Sylvan Learning - Summer Smart Reading Math P-K (Etc.) (Z-Library)
100% (1)
Sylvan Learning - Summer Smart Reading Math P-K (Etc.) (Z-Library)
160 pages
G10 Pe Week 2
No ratings yet
G10 Pe Week 2
3 pages
2022 Top 30 Profitable Online Business
No ratings yet
2022 Top 30 Profitable Online Business
5 pages
Bluefield State College - Blue and Gold - Volume XIII Number 1
No ratings yet
Bluefield State College - Blue and Gold - Volume XIII Number 1
16 pages
Bab2
No ratings yet
Bab2
22 pages
Grade 3 COT in Math Q2 2024
No ratings yet
Grade 3 COT in Math Q2 2024
3 pages
Ethics in Business Communication PDF
33% (3)
Ethics in Business Communication PDF
2 pages
CV Muhammad Islaqudin ATS Fix
No ratings yet
CV Muhammad Islaqudin ATS Fix
2 pages
First Aid and Drug Education
No ratings yet
First Aid and Drug Education
35 pages
Unschooling: A Parent's Guide
No ratings yet
Unschooling: A Parent's Guide
2 pages
Scorecard
No ratings yet
Scorecard
1 page
Social Relations & Externality in PNG
No ratings yet
Social Relations & Externality in PNG
13 pages
ATOICV1 9 0 Magnetic Properties of Transition Metal Complexes
No ratings yet
ATOICV1 9 0 Magnetic Properties of Transition Metal Complexes
52 pages
Parishram 2026 Biology Molecular Basis of Inheritance: Android App iOS App PW Website
No ratings yet
Parishram 2026 Biology Molecular Basis of Inheritance: Android App iOS App PW Website
5 pages
Osho - Books I Have Loved (SUMMARY Book List)
100% (1)
Osho - Books I Have Loved (SUMMARY Book List)
4 pages
Study Plan
No ratings yet
Study Plan
4 pages
Give Scientific Reason. Sometimes, Higher Plants and Animals Too Perform Anaerobic Respiration. - Science and Technology 2 Sha
No ratings yet
Give Scientific Reason. Sometimes, Higher Plants and Animals Too Perform Anaerobic Respiration. - Science and Technology 2 Sha
1 page

Notes Unit 1

Uploaded by

Notes Unit 1

Uploaded by

Data Science Methodology Class 12 Notes Data science methodology

Data Science Methodology

1. From Problem to Approach

Identifying the types of data required, such as numbers, words, or images.

Data preparation includes

Common Descriptive Techniques:

Model evaluation can have two main phases.

First phase – Diagnostic measures

Train test split and K-Fold Cross Validation.

Train Test Split

Train Dataset: Used to fit the machine learning model.

Test Dataset: Used to evaluate the fit machine learning model.

Evaluation Metrics for Classification

Recall measures “What proportion of actual Positives is correctly classified?”

F1 = 2* (precision * recall)/(precision + recall)

Evaluation Metrics for Regression

You might also like