Ids PPT and PDF
Ids PPT and PDF
MODULE # 1 : INTRODUCTION
IDS Course Team
BITS Pilani
The instructor is gratefully acknowledging
the authors who made their course
materials freely available online.
1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING
CO1 Gain basic understanding of the role of Data Science in various scenarios in the
real-world of business, industry and government.
CO2 Understand various roles and stages in a Data Science Project and ethical issues to be
considered.
CO3 Explore the processes, tools and technologies for collection and analysis of
structured and unstructured data.
CO4 Appreciate the importance of techniques like data visualization, storytelling with data
for the effective presentations of the outcomes with the stakeholders.
Module 7: Module 9:
Part IV: Module 6: Module 8:
Assocation Anomaly
Modeling & Evaluation Classification Clustering
Mining Detection
Platform
) Python / Jupyter Notebook / Google Colab
Dataset
) Datasets as we deem appropriate.
Webinar
) 4 webinars
) Either Lab modules will be explained or numerical problems will be solved.
) As per schedule
1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING
) Statistician
) The most important aspect of data science is interpreting the results of the analysis in
Math
Statistics
Software
Research Development
Data
Science
Domain,
Business CS / IT
Knowledge Machine
Learning
Artificial Intelligence
) AI involves making machines capable of mimicking human behavior, particularly
cognitive functions like facial recognition, automated driving, sorting mail based on
postal code.
Machine Learning
) Considered a sub-field of or one of the tools of AI.
) Involves providing machines with the capability of learning from experience.
) Experience for machines comes in the form of data.
Data Science
) Data science is the application of machine learning, artificial intelligence, and other
quantitative fields like statistics, visualization, and mathematics to uncover insights from
data to enable better decision marking.
https://www.sciencedirect.com/topics/physics-and-astronomy/artificial-
intelligence
INTRODUCTION TO DATA SCIENCE 19 / 79
TABLE OF CONTENTS
1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING
DataFlair
INTRODUCTION TO DATA SCIENCE 21 / 79
DATA SCIENCE IN FACEBOOK
Social Analytics
Utilizes quantitative research to gain insights about the social interactions of among
people.
Makes use of deep learning, facial recognition, and text analysis.
In facial recognition, it uses powerful neural networks to classify faces in the
photographs.
In text analysis, it uses “DeepText” to understand people’s interest and aligns
photographs with texts.
It uses deep learning for targeted advertising.
Using the insights gained from data, it clusters users based on their preferences and
provides them with the advertisements that appeal to them.
INTRODUCTION TO DATA SCIENCE 22 / 79
DATA SCIENCE IN AMAZON
goes up.
hosts.
Detecting bounce rates
) Use of demographic analytics to analyze bounce rates from their websites.
Providing ideal lodgings and localities
) Uses knowledge graphs where the user’s preferences are matc hed with the various
parameters to provide ideal lodgings and localities.
Spotify uses data science to gain insights about which universities had the highest
percentage of party playlists and which ones spent the most time on it.
”Spotify Insights” publishes information about the ongoing trends in the music.
Spotify’s Niland, an API based product, uses machine learning to provide better
searches and recommendations to its users.
Spotify analyzes listening habits of its users to predict the Grammy Award Winners.
DataFlair
INTRODUCTION TO DATA SCIENCE 30 / 79
APPLICATIONS OF DATA SCIENCE
edureka.co
INTRODUCTION TO DATA SCIENCE 31 / 79
TABLE OF CONTENTS
1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING
Cognitive Biases are the distortions of reality because of the lens through which we
view the world.
Each of us sees things differently based on our preconceptions, past experiences,
cultural, environmental, and social factors. This doesn’t necessarily mean that the
way we think or feel about something is truly representative of reality.
1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 37 / 79
ROLES IN DATA SCIENCE TEAM [2/6]
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 38 / 79
ROLES IN DATA SCIENCE TEAM [3/6]
3 Business analyst
) A business analyst basically realizes a CAO’s functions but on the operational level.
) This implies converting business expectations into data analysis.
) If your core data scientist lacks domain expertise, a business analyst bridges this gulf.
4 Data scientist
) A data scientist is a person who solves business tasks using machine learning and data
mining techniques.
) The role can be narrowed down to data preparation and cleaning with further model
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 39 / 79
ROLES IN DATA SCIENCE TEAM [4/6]
Job of a data scientist is often divided into two roles
[4A] Machine Learning Engineer
) A machine learning engineer combines software engineering and modeling skills by
determining which model to use and what data should be used for each model.
) Probability and statistics are also their forte.
) Training, monitoring, and maintaining a model.
statistics.
) Preferred skills: SQL, Python, R, Scala, Carto, D3, QGIS, Tableau
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 40 / 79
ROLES IN DATA SCIENCE TEAM [5/6]
5 Data architect
) Working with Big Data.
) This role is critical to warehouse the data, define database architecture, centralize data,
and ensure integrity across different sources.
) Preferred skills: SQL, noSQL, XML, Hive, Pig, Hadoop, Spark
6 Data engineer
) Data engineers implement, test, and maintain infrastructural components that data
architects design.
) Realistically, the role of an engineer and the role of an architect can be combined in one
person.
) Preferred skills: SQL, noSQL, Hive, Pig, Matlab, SAS, Python, Java, Ruby, C++, Perl
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 41 / 79
ROLES IN DATA SCIENCE TEAM [6/6]
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 42 / 79
SKILLSET FOR A DATA SCIENTIST
PROGRAMMING: Most fundamental of a data scientist’s skill set. Programming improves
your statistics skills, helps you “analyze large datasets” and gives you the
ability to create your own tools.
QUANTITATIVE ANALYSIS: Improve your ability to run experimental analysis, scale your data
strategy and help you implement machine learning.
PRODUCT INTUITION: Understanding products will help you perform quantitative analysis. It
will also help you predict system behavior, establish metrics and improve
debugging skills.
COMMUNICATION: Strong communication skills will help you “leverage all of the previous
skills listed.”
TEAMWORK: It requires being selfless, embracing feedback and sharing your knowledge
with your team.
INTRODUCTION TO DATA SCIENCE 43 / 79
SKILLS REqUIRED FOR A DATA SCIENTIST
Communicative Qualitative
Data
Curious Technical
Scientist
Creative Skeptical
R
SQL
Python
Scala
Tools SAS
Hadoo
p
Julia
Tableau
Weka
Logistic
K-means Regression
Linear
clustering Regression
Decision
SVM
Tree
ANN
https://towardsdatascience.com/why-team-building-is-important-to-data-scientists-a8fa74dbc09b
INTRODUCTION TO DATA SCIENCE 48 / 79
TABLE OF CONTENTS
1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING
https://www.altexsoft.com/blog/dataops-
essentials/
INTRODUCTION TO DATA SCIENCE 57 / 79
DATAOPS
DataOps puts data pipelines into a CI/CD paradigm.
Development – involve building a new pipeline, changing a data model or redesigning
a dashboard.
Testing – checking the most minor update for data accuracy, potential deviation, and
errors.
Deployment – moving data jobs between environments, pushing them to the next
stage, or deploying the entire pipeline in production.
Monitoring – allows data professionals to identify bottlenecks, catch abnormal
patterns, and measure adoption of changes.
Orchestration – automates moving data between different stages, monitoring
progress, triggering autoscaling, and operations related to data flow management.
https://www.altexsoft.com/blog/dataops-
essentials/
INTRODUCTION TO DATA SCIENCE 58 / 79
DATAOPS TEAM
Machine
Learning
MLOps
Data
DevOps
Engineering
Real challenge isn’t building an ML model, but building an integrated ML system and
to continuously operate it in production.
To deploy and maintain ML systems in production reliably and efficiently.
Automating continuous integration (CI), continuous delivery (CD), and continuous
training (CT) for machine learning (ML) systems.
Frameworks
) Kubeflow and Cloud Build
) Amazon AWS MLOps
) Microsoft Azure MLOps
https://ml-ops.org/content/mlops-
principles
INTRODUCTION TO DATA SCIENCE 62 / 79
MLOPS
https://builtin.com/machine-
learning/mlops
INTRODUCTION TO DATA SCIENCE 63 / 79
MLOPS
https://builtin.com/machine-
learning/mlops
INTRODUCTION TO DATA SCIENCE 64 / 79
DATAOPS AND MLOPS
1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING
[2] Functional
) Resource allocation driven by a functional
agenda rather than an enterprise agenda.
) Analysts are located in the functions where
[3] Consulting
) Resources allocated based on availability on
a first-come first-served basis without
necessarily aligning to enterprise objectives
) Analysts work together in a central group
[6] Federated
) Same as “Center of Excellence” model with
need-based operational involvement to
provide SME support.
) A centralized group of advanced analysts is
https://data-flair.training/blogs/data-science-use-cases/ https:
//www.northeastern.edu/graduate/blog/what-does-a-data-scientist-do/
https://www.visual-paradigm.com/guide/software-development-process/ what-is-a-
software-process-model/
https://www.altexsoft.com/blog/datascience/
how-to-structure-data-science-team-key-models-and-roles/
https://www.cio.com/article/3217026/
what-is-a-data-scientist-a-key-data-analytics-role-and-a-lucrative-caree html
https://atlan.com/what-is-dataops/
THANK YOU
INTRODUCTION TO DATA SCIENCE 78 / 79
I NTRODUCTION TO DATA S CIENCE
M ODULE # 2 : DATA A NALYTICS
IDS Course Team
BITS Pilani
The instructor is gratefully acknowledging
the authors who made their course
materials freely available online.
I N T R OD U CT ION TO D AT A S C I E N C E 2 / 60
T ABLE OF C ONTENTS
1 A N A LY T I C S
2 D ATA A N A LY T I C S
3 D ATA A N A LY T I C S M E T H O D O L O G I E S
CRISP-DM
Big Data Life-cycle
SEMMA
SMAM
4 F U RT H E R R E A D I N G
I N T R OD U CT ION TO D AT A S C I E N C E 3 / 60
D EFINITION OF A NALY T I C S – D I C T I O N A RY
Source: Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti
I N T R OD U CT ION TO D AT A S C I E N C E 4 / 60
D EFINITION OF A NALY T I C S – WEBSITES
Source: Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti
I N T R OD U CT ION TO D AT A S C I E N C E 5 / 60
G OA L S OF D AT A A N A L Y T I C S
To predict something
) whether a transaction is a fraud or not
) whether it will rain on a particular day
) whether a tumour is benign or malignant
To find patterns in the data
) finding the top 10 coldest days in the year
) which pages are visited the most on a particular website
) finding the most searched celebrity in a particular year
To find relationships in the data
) finding similar news articles
) finding similar patients in an electronic health record system
) finding related products on an e-commerce website
) finding similar images
) finding correlation between news items and stock prices
I N T R OD U CT ION TO D AT A S C I E N C E 6 / 60
T ABLE OF C ONTENTS
1 A N A LY T I C S
2 D ATA A N A LY T I C S
3 D ATA A N A LY T I C S M E T H O D O L O G I E S
CRISP-DM
Big Data Life-cycle
SEMMA
SMAM
4 F U RT H E R R E A D I N G
I N T R OD U CT ION TO D AT A S C I E N C E 7 / 60
D AT A A N A L Y T I C S
I N T R OD U CT ION TO D AT A S C I E N C E 8 / 60
D ATA A N A LY T I C S
I N T R OD U CT ION TO D AT A S C I E N C E 9 / 60
D ESCRIPTIVE A NALY T I C S
I N T R OD U CT ION TO D AT A S C I E N C E 10 / 60
D E S C R I P T I V E A N A LY T I C S E X A M P L E
I N T R OD U CT ION TO D AT A S C I E N C E 11 / 60
D ESCRIPTIVE A NALY T I C S
Techniques:
) Descriptive Statistics - histogram, correlation
) Data Visualization
) Exploratory Analysis
I N T R OD U CT ION TO D AT A S C I E N C E 12 / 60
P RE D IC T IV E A NALY T I C S
I N T R OD U CT ION TO D AT A S C I E N C E 13 / 60
P R E D I C T I V E A N A LY T I C S E X A M P L E
I N T R OD U CT ION TO D AT A S C I E N C E 14 / 60
P RE D IC T IV E A NALY T I C S
Techniques / Algorithms:
) Regression
) Classification
) ML algorithms like Linear regression, Logistic regression, SVM
) Deep Learning techniques
I N T R OD U CT ION TO D AT A S C I E N C E 15 / 60
D IA G N O S T I C A N A LY T I C S
I N T R OD U CT ION TO D AT A S C I E N C E 16 / 60
D IA G N O S T I C A N A LY T I C S E X A M P L E
What is the effect of global warming in the Southwest monsoon?
I N T R OD U CT ION TO D AT A S C I E N C E 17 / 60
D IA G N O S T I C A N A L Y T I C S
I N T R OD U CT ION TO D AT A S C I E N C E 18 / 60
P RE S C RIPT IV E A NALY T I C S
I N T R OD U CT ION TO D AT A S C I E N C E 19 / 60
P R E S C R I P T I V E A N A LY T I C S E X A M P L E
How can we improve the crop production?
I N T R OD U CT ION TO D AT A S C I E N C E 20 / 60
C O G N I T I V E A N A LY T I C S
Cognitive Analytics – What Don’t I Know?
https: / / w w w. 10x ds. com/ blog /cognitive - analytics - to- reinvent - business/
I N T R OD U CT ION TO D AT A S C I E N C E 21 / 60
C OGNITIVE A NALY T I C S
1 A N A LY T I C S
2 D ATA A N A LY T I C S
3 D ATA A N A LY T I C S M E T H O D O L O G I E S
CRISP-DM
Big Data Life-cycle
SEMMA
SMAM
4 F U RT H E R R E A D I N G
I N T R OD U CT ION TO D AT A S C I E N C E 24 / 60
D AT A A N A L Y T I C S M E T H O D O L O G I E S
I N T R OD U CT ION TO D AT A S C I E N C E 25 / 60
N EED FOR A S TANDA R D P ROCESS
I N T R OD U CT ION TO D AT A S C I E N C E 26 / 60
D AT A S C I E N C E M E T H O D O L O G Y
10 Questions the process aims to answer
Problem to Approach
1 What is the problem that you are trying to solve?
2 How can you use data to answer the questions?
Working with Data
3 What data do you need to answer the question?
4 Where is the data coming from? Identify all Sources. How will you acquire it?
5 Is the data that you collected representative of the problem to be solved?
6 What additional work is required to manipulate and work with the data?
Delivering the Answer
7 In what way can the data be visualized to get to the answer that is required?
8 Does the model used really answer the initial question or does it need to be adjusted?
9 Can you put the model into practice?
10 Can you get constructive feedback into answering the question?
I N T R OD U CT ION TO D AT A S C I E N C E 27 / 60
CRISP-DM
CRISP-DM Phases
I N T R OD U CT ION TO D AT A S C I E N C E 28 / 60
C R I S P - D M P HASES
Business Understanding
) Understand project objectives and requirements.
) Data mining problem definition.
Data Understanding
) Initial data collection and familiarization.
) Identify data quality issues.
) Identify initial obvious results.
Data Preparation
) Record and attribute selection.
) Data cleansing.
I N T R OD U CT ION TO D AT A S C I E N C E 29 / 60
C R I S P - D M P HASES
Modeling
) Run the data mining tools.
Evaluation
) Determine if results meet business objectives.
) Identify business issues that should have been addressed earlier.
Deployment
) Put the resulting models into practice.
) Set up for continuous mining of the data.
I N T R OD U CT ION TO D AT A S C I E N C E 30 / 60
C R I S P - D M P HASES AND T ASKS
I N T R OD U CT ION TO D AT A S C I E N C E 31 / 60
WHY CRISP-
DM?
The data mining process must be reliable and repeatable by people with little data
mining skills.
CRISP-DM provides a uniform framework for
) guidelines.
) experience documentation.
CRISP-DM is flexible to account for differences.
) Different business/agency problems.
) Different data
I N T R OD U CT ION TO D AT A S C I E N C E 32 / 60
B I G D AT A L I F E - C Y C L E
Data Acquisition
) Acquiring information from a rich and varied data environment.
Data Awareness
) Connecting data from different sources into a coherent whole, including modeling
content, establishing context, and insuring search-ability.
Data Analytics
) Using contextual data to answer questions about the state of your organization.
Data Governance
) Establishing a framework for providing for the provenance, infrastructure and disposition
of that data.
I N T R OD U CT ION TO D AT A S C I E N C E 33 / 60
B I G D AT A L I F E - C Y C L E
Phase 7: Storage
Phase 1: Foundations
Phase 8: Integration
Phase 2: Acquisition
Phase 9: Analytics and Visualization
Phase 3: Preparation
Phase 10: Consumption
Phase 4: Input and Access
Phase 11: Retention, Backup, and
Phase 5: Processing
Archival
Phase 6: Output and Interpretation
Phase 12: Destruction
I N T R OD U CT ION TO D AT A S C I E N C E 34 / 60
B I G D ATA L I F E - C Y C L E
I N T R OD U CT ION TO D AT A S C I E N C E 35 / 60
B I G D AT A L I F E - C Y C L E
Phase 1: Foundations
) Understanding and validating data requirements, solution scope, roles and
responsibilities, data infrastructure preparation, technical and non-technical
considerations, and understanding data rules in an organization.
Phase 2: Data Acquisition
) Data Acquisition refers to collecting data.
) Data sets can be obtained from various sources, both internal and external to the
business organizations.
) Data sources can be in
2 structured forms such as transferred from a data warehouse, a data mart, various
transaction systems.
2 semi-structured sources such as Weblogs, system logs.
2 unstructured sources such as media files consisting of videos, audios, and pictures.
I N T R OD U CT ION TO D AT A S C I E N C E 36 / 60
B I G D AT A L I F E - C Y C L E
I N T R OD U CT ION TO D AT A S C I E N C E 37 / 60
B I G D AT A L I F E - C Y C L E
I N T R OD U CT ION TO D AT A S C I E N C E 38 / 60
B I G D AT A L I F E - C Y C L E
I N T R OD U CT ION TO D AT A S C I E N C E 39 / 60
B I G D AT A L I F E - C Y C L E
I N T R OD U CT ION TO D AT A S C I E N C E 40 / 60
B I G D AT A L I F E - C Y C L E
Phase 10: Data Consumption
) Data is turned into information ready for consumption by the internal or external users,
including customers of the business organization.
) Data consumption require architectural input for policies, rules, regulations, principles,
and guidelines.
Phase 11: Retention, Backup, and Archival
) Use established data backup strategies, techniques, methods, and tools.
) Identify, document, and obtain approval for the retention, backup, and archival decisions.
Phase 12: Data Destruction
) There may be regulatory requirements to destruct a particular type of data after a certain
amount of times.
) Confirm the destruction requirements with the data governance team in business
organizations.
I N T R OD U CT ION TO D AT A S C I E N C E 41 / 60
SEMMA
SAS Institute
Sample, Explore, Modify, Model,
Assess
5 stages
I N T R OD U CT ION TO D AT A S C I E N C E 42 / 60
S E M M A S TAGES
1 Sample
) Sampling the data by extracting a portion of a large data set big enough to contain the
significant information, yet small enough to manipulate quickly.
) Optional stage
2 Explore
) Exploration of the data by searching for unanticipated trends and anomalies in order to
gain understanding and ideas.
3 Modify
) Modification of the data by creating, selecting, and transforming the variables to focus
the model selection process.
I N T R OD U CT ION TO D AT A S C I E N C E 43 / 60
S E M M A S TAGES
1 Model
) Modeling the data by allowing the software to search automatically for a combination of
data that reliably predicts a desired outcome.
2 Assess
) Assessing the data by evaluating the usefulness and reliability of the findings from the
data mining process and estimate how well it performs.
I N T R OD U CT ION TO D AT A S C I E N C E 44 / 60
SEMMA
“SEMMA is not a data mining methodology but rather a logical organization of the
functional tool set of SAS Enterprise Miner for carrying out the core tasks of data
mining.
Enterprise Miner can be used as part of any iterative data mining methodology
adopted by the client. Naturally steps such as formulating a well defined business or
research problem and assembling quality representative data sources are critical to
the overall success of any data mining project.
SEMMA is focused on the model development aspects of data mining.”
I N T R OD U CT ION TO D AT A S C I E N C E 45 / 60
SMAM
Standard
Methodology for
Analytics Models
I N T R OD U CT ION TO D AT A S C I E N C E 46 / 60
S M A M P HASES
Phase Description
Use-case identification Selection of the ideal approach from a list of candidates
Model requirements Understanding the conditions required for the model to func-
gathering tion
Data preparation Getting the data ready for the modeling
Modeling experiments Scientific experimentation to solve the business question
Insight creation Visualization and dash-boarding to provide insight
Proof of Value: ROI Running the model in a small scale setting to prove the value
Operationalization Embedding the analytical model in operational systems
Model life-cycle Governance around model lifetime and refresh
I N T R OD U CT ION TO D AT A S C I E N C E 47 / 60
T ABLE OF C ONTENTS
1 A N A LY T I C S
2 D ATA A N A LY T I C S
3 D ATA A N A LY T I C S M E T H O D O L O G I E S
CRISP-DM
Big Data Life-cycle
SEMMA
SMAM
4 F U RT H E R R E A D I N G
I N T R OD U CT ION TO D AT A S C I E N C E 48 / 60
D ESCRIPTIVE A NALY T I C S – E X AM PL E # 1
Data captured
Problem Statement : Gender
“Market research team at Aqua Analytics Age (In years)
Pvt. Ltd is assigned a task to identify pro- Education (In years)
file of a typical customer for a Digital fit- Relationship Status (Single or Partnered)
Annual Household income
ness band that is offered by Titanic Corp.
Average number of times customer tracks activity each
The market research team decides to inves- week
tigate whether there are differences across Number of miles customer expect to walk each week
the usage patterns and product lines with Self-rated fitness on a scale 1 – 5 where 1 is poor shape
and 5 is excellent.
respect to customer characteristics”
Models of the product purchased - IQ75, MZ65, DX87
I N T R OD U CT ION TO D AT A S C I E N C E 50 / 60
D IA G N O S T I C A N A L Y T I C S – E X A M P L E # 1
Problem Statement :
“During the 1980s General Electric was selling different products to its customers such as
light bulbs, jet engines, windmills, and other related products. Also, they separately sell
parts and services this means they would sell you a certain product you would use it until it
needs repair either because of normal wear and tear or because it’s broken. And you would
come back to GE and then GE would sell you parts and services to fix it. Model for GE was
focusing on how much GE was selling, in sales of operational equipment, and in sales of
parts and services. And what does GE need to do to drive up those sales?”
https://medium.com/parrotai/
u n d ers ta n d - d a ta - a n a lytics - fra mework -w ith -a - ca s e - st u d y-in - th e -b u s in es s - world - 15b fb 421028d
I N T R OD U CT ION TO D AT A S C I E N C E 51 / 60
D IA G N O S T I C A N A L Y T I C S – E X A M P L E # 1
https://www.sganalytics.com/blog/change -management-analytics-adoption/
I N T R OD U CT ION TO D AT A S C I E N C E 52 / 60
P RE D IC T IV E A NALY T I C S – E X AM PL E # 1
Google launched Google Flu Trends (GFT), to collect predictive analytics regarding
the outbreaks of flu. It’s a great example of seeing big data analytics in action.
So, did Google manage to predict influenza activity in real-time by aggregating search
engine queries with this big data and adopting predictive analytics?
Even with a wealth of big data analytics on search queries, GFT overestimated the
prevalence of flu by over 50% in 2012-2013 and 2011-2012.
They matched the search engine terms conducted by people i n
d i f fe r e n t regions of the world. And, when these queries were
compared with t r a d i t i o n a l f l u s u r ve i l l a n c e systems, Google found
that the p re d ict ive a n a l y t i c s of the f l u season pointed towards a
co r re lat io n with higher search engine t r a f f i c f o r ce rtain phrases.
I N T R OD U CT ION TO D AT A S C I E N C E 53 / 60
P RE D IC T IV E A NALY T I C S – E X AM PL E # 1
https://www.slideshare.net/VasileiosLampos/
u s e r g e n e ra t ed - co n t en t - c o l l e ct i ve - a n d - p er s o n a l i s e d - i n f e re n c e - ta s ks
I N T R OD U CT ION TO D AT A S C I E N C E 54 / 60
P RE D IC T IV E A NALY T I C S – E X AM PL E # 2
Colleen Jones applied predictive analytics to FootSmart (a niche online catalog
retailer) on a content marketing product. It was called the FootSmart Health
Resource Center (FHRC) and it consisted of articles, diagrams, quizzes and the like.
On analyzing the data around increased search engine visibility, FHRC was found
to help FootSmart reach more of the right kind of target customers.
They were receiving more traffic, primarily consisting of people that cared about foot
health conditions and their treatments.
FootSmart decided to push more content at FHRC and also improve its
merchandising of the product.
The r e s u l t of such informed data-driven decision making?
A 36% increase i n weekly s a l e s .
https://www.footsmart.com/pages/health -resource-center
I N T R OD U CT ION TO D AT A S C I E N C E 55 / 60
P RE D IC T IV E A NALY T I C S – E X AM PL E # 2
I N T R OD U CT ION TO D AT A S C I E N C E 56 / 60
P RE S C RIPT IV E A NALY T I C S – E X AM PL E # 1
A health insurance company analyses its data and determines that many of its diabetic
patients also suffer from retinopathy.
With this information, the provider can now use predictive analytics to get an idea of how
many more ophthalmology claims it might receive during the next year.
Then, using prescriptive analytics, the company can look at scenarios where the
reimbursement costs for ophthalmology increases, decreases, or holds steady. These
scenarios then allow them to make an informed decision about how to proceed in a way that’s
both cost-effective and beneficial to their customers.
I N T R OD U CT ION TO D AT A S C I E N C E 57 / 60
P RE S C RIPT IV E A NALY T I C S – E X AM PL E # 2
Whenever you go to Amazon, the site recommends dozens and dozens of products to
you. These are based not only on your previous shopping history (reactive), but also
based on what you’ve searched for online, what other people who’ve shopped for the
same things have purchased, and about a million other factors (proactive).
Amazon and other large retailers are taking deductive, diagnostic, and predictive data
and then running it through a prescriptive analytics system to find products that you
have a higher chance of buying.
Every bit of data is broken down and examined with the end goal of helping the
company suggest products you may not have even known you wanted.
h ttp s : / / a ccen t -tech n ologies . com/ 2020/ 06/18/ ex a mp les -of- p res crip tive - a n a lytics /
I N T R OD U CT ION TO D AT A S C I E N C E 58 / 60
H E A LT H C A R E A NALY T I C S – C ASE S TUDY
Self study
https://integratedmp.com/
4 - key- h e a lt h ca re - analyt ics - so u rce s - i s- yo ur -pract ice - usin g -the m/
https://www.youtube.com/watch?v=olpuyn6kemg
I N T R OD U CT ION TO D AT A S C I E N C E 59 / 60
R EFERENCES
Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti
https://www.kdnuggets.com/2014/10/
cris p - d m - top - m eth od olog y - a n a l ytic s - d a t a - m in in g - d a t a - s c ie n ce - p roj ect h tm l
https://www.datasciencecentral.com/profiles/ blogs/
crisp-dm-a-standard-methodology-to-ensure-a-good-outcome
https://docu mentation.sas.com/?docsetId=emref&docsetTarget=n061bzurmej4j3n1jnj8bbjjm 1a2.htm&
docsetVersion=14.3&locale=en
http://jesshampton.com/2011/02/16/semma -and-crisp-dm-data-mining-methodologies/
https://www.kdnuggets.com/2015/08/new -standard-methodology-analytical-models.html
https://medium.com/illumination -curated/big-data-lifecycle-management-629dfe16b78d
https://www.esadeknowledge.com/view/
7 - ch a llen g es - a n d - op p ortu n ities - in - d a ta - b a sed - d ecis ion - ma kin g -193560
T HANK YOU
I N T R OD U CT ION TO D AT A S C I E N C E 60 / 60
INTRODUCTION TO DATA SCIENCE
MODULE # 3 : DATA
IDS Course Team
BITS Pilani
The instructor is gratefully acknowledging
the authors who made their course
materials freely available online.
1 DATA
2 DATA-SETS
3 DATA QUALITY
4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING
Numerical
Calendar dates,
Interval
temperature
Data
Categorical
ID numbers, eye
Nominal
color, zip codes
INTRODUCTION TO DATA SCIENCE 9 / 79
TYPES OF ATTRIBUTES
Continuous Attribute
) Real numbers as attribute values.
) temperature, height, or weight
) Continuous attributes are typically represented as floating-point variables.
Asymmetric Attribute
) only presence a non-zero attribute value-is considered.
) For a specific student, an attribute has a value of 1 if the student took the course
associated with that attribute and a value of 0 otherwise
) Asymmetric binary attributes.
Identify whether the attribute is discrete and continuous in the given data.
Identify whether the attribute is discrete and continuous in the given data.
1 DATA
2 DATA-SETS
3 DATA QUALITY
4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING
Web Click-Stream
1 Public data
) Data that has been collected and preprocessed for academic or research purposes and
made public.
) https://archive.ics.uci.edu/
2 Private data
) Data that is specific to an organization.
) Privacy rules like IT Act 2000 and GDPR applies.
https://mozanunal.com/2019/11/img2sh/
INTRODUCTION TO DATA SCIENCE 30 / 79
DIGITAL COLOUR IMAGE
https://www.analyticsvidhya.com/blog/2021/03/grayscale-and-rgb-format-for-storing-images/
INTRODUCTION TO DATA SCIENCE 31 / 79
DIGITAL COLOUR IMAGE
https://www.mathworks.com/help/matlab/creating_plots/image-types.html
INTRODUCTION TO DATA SCIENCE 32 / 79
GRAPH DATA EXAMPLE
https://lod-
cloud.net/
INTRODUCTION TO DATA SCIENCE 33 / 79
TABLE OF CONTENTS
1 DATA
2 DATA-SETS
3 DATA QUALITY
4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING
https://www.deltapartnersgroup.com/
INTRODUCTION TO DATA SCIENCE 35 / 79
DATA QUALITY ISSUES
Missing data
) Data that is not filled / available intentionally or otherwise.
) Attributes of interest may not always be available, such as customer information for sales
transaction data.
) Some data were not considered important at the time of entry.
malfunctions.
Duplicate data
Orphaned data
Text encoding errors
Data that is biased
1 DATA
2 DATA-SETS
3 DATA QUALITY
4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING
Apply the basic epicycle of analysis to the formal modelling portion of data analysis.
1 Setting expectations.
2 Develop a primary model that represents your best sense of what provides the answer
to your question. This model is chosen based on whatever information you have
currently available.
2 Collecting Information.
2 Create a set of secondary models that challenge the primary model in some way.
3 Revising expectations.
2 If our secondary models are successful in challenging our primary model and put the
primary model’s conclusions in some doubt, then we may need to adjust or modify
the primary model to better reflect what we have learned from the secondary
models.
Conduct a survey of 20 people to ask them how much they’d be willing to spend on a
product you’re developing.
The survey response
25, 20, 15, 5, 30, 7, 5, 10, 12, 40, 30, 30, 10, 25, 10, 20, 10, 10, 25, 5
The goal is to develop a benchmark model that serves us as a baseline, upon we’ll
measure the performance of a better and more attuned algorithm.
Benchmarking requires experiments to be comparable, measurable, and reproducible.
1 DATA
2 DATA-SETS
3 DATA QUALITY
4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING
customers tend to purchase first a laptop, followed by a digital camera, and then a
memory card.
) A substructure can refer to different structural forms (e.g., graphs, trees, or lattices) that
The term prediction refers to both numeric prediction and class label
prediction.
Classification and regression may need to be preceded by relevance analysis, which
attempts to identify attributes that are significantly relevant to the classification and
regression process.
Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts.
The model are derived based on the analysis of a set of training data (i.e., data
objects for which the class labels are known).
The model is used to predict the class label of objects for which the the class label is
unknown.
The derived model may be represented in as classification rules (i.e., IF-THEN rules),
decision trees, mathematical formulae, or neural networks, naive Bayesian
classification, support vector machines, and k-nearest-neighbor classification.
Classification predicts categorical (discrete, unordered) labels.
1 DATA
2 DATA-SETS
3 DATA QUALITY
4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING
Data pipelines are sets of processes that move and transform data from various
sources to a destination where new value can be derived.
In their simplest form, pipelines may extract only data from one source such as a
REST API and load to a destination such as a SQL table in a data warehouse.
In practice, data pipelines consist of multiple steps including data extraction, data
preprocessing, data validation, and at times training or running a machine learning
model before delivering data to its final destination.
Data engineers specialize in building and maintaining the data pipelines.
For every dashboard and insight that a data analyst generates and for each predictive
model developed by a data scientist, there are data pipelines working behind the
scenes.
A single dashboard, or a single metric may be derived from data originating in
multiple source systems.
Data pipelines extract data from sources and load them into simple database tables
or flat files for analysts to use. Raw data is refined along the way to clean, structure,
normalize, combine, aggregate, and anonymize or secure it.
) A shared network file system or cloud storage bucket containing logs, comma-separated
Data ingestion is traditionally both the extract and load steps of an ETL or ELT
process.
INTRODUCTION TO DATA SCIENCE 64 / 79
SIMPLE PIPELINE
E– extract step
) gathers data from various sources in preparation for loading and transforming.
L – load step
) brings either the raw data (in the case of ELT) or the fully transformed data (in the case of
ETL) into the final destination.
) load data into the data warehouse, data lake, or other destination.
T – transform step
) raw data from each source system is combined and formatted in a such a way that it’s
useful to analysts, visualization tools
Orchestration ensures that the steps in a pipeline are run in the correct order and that
dependencies between steps are managed properly.
Pipeline steps (tasks) are always directed, meaning they start with a task or multiple
tasks and end with a specific task or tasks. This is required to guarantee a path of
execution.
Pipeline graphs must also be acyclic, meaning that a task cannot point back to a
previously completed task.
Pipelines are implemented as DAGs (Directed Acyclic Graphs).
Orchestration tool – Apache Airflow
1 DATA
2 DATA-SETS
3 DATA QUALITY
4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING
A data lake is where data is stored, but without the structure or query
optimization of a data warehouse.
It will contain a high volume of data as well as a variety of data types.
It is not optimized for querying such data in the interest of reporting and analysis.
Eg: a single data lake might contain a collection of blog posts stored as text files, flat
file extracts from a relational database, and JSON objects containing events
generated by sensors in an industrial system.
THANK YOU
I N T R OD U CT ION TO D AT A S C I E N C E 2 / 67
T ABLE OF C ONTENTS
1 S TAT I S T I C A L D E S C R I P T I O N S O F D ATA S E L F S T U D Y
2 D ATA P R E PA R AT I O N
3 D ATA A G G R E G AT I O N , S A M P L I N G
4 D ATA S I M I L A R I T Y & D I S S I M I L A R I T Y M E A S U R E
5
8
9
I N T R OD U CT ION TO D AT A S C I E N C E 3 / 67
Statistical Descriptions of Data
I N T R OD U CT ION TO D AT A S C I E N C E 5 / 67
MEAN
Common and effective numeric measure of the ”center” of a set of data is the
(arithmetic) mean.
x1 + x2 + ... + xN
N
= Σ i=1 i
x
x̄ =
N N
Weighted average
) Sometimes, each value xi in a set may be associated with a weight wi .
) The weights reflect the significance, importance, or occurrence frequency attached to
their respective values.
w1 x1 + w2 x2 + ... + wN xN
= Σ i=1 i i
N
wx
x̄ =
N N
Issue: Mean is sensitive to extreme (e.g., outlier) values.
Issue: For skewed (asymmetric) data, a better measure of the center of data is the
median.
I N T R OD U CT ION TO D AT A S C I E N C E 6 / 67
MEDIAN
If N is odd, then the median is the middle value of the ordered set.
If N is even, then the median is not unique; it is the two middlemost values and any
value in between.
If X is a numeric attribute, the median is taken as the average of the two middlemost
values.
I N T R OD U CT ION TO D AT A S C I E N C E 7 / 67
MODE
Mode for a set of data is the value that occurs most frequently in the set.
Mode can be determined for qualitative and quantitative attributes.
Data sets with one, two, or three modes are respectively called unimodal, bimodal,
and trimodal. In general, a data set with two or more modes is multimodal.
I N T R OD U CT ION TO D AT A S C I E N C E 8 / 67
S Y M M E T R I C D AT A AND S K E W E D D AT A
In a unimodal frequency curve with perfect symmetric data distribution, the
mean, median, and mode are all at the same center value.
I N T R OD U CT ION TO D AT A S C I E N C E 9 / 67
MIDRANGE
I N T R OD U CT ION TO D AT A S C I E N C E 10 / 67
E XAMPLE
X = [30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110]
30 + 36 + 47 + 50 + 52 + 52 + 56 + 60 + 63 + 70 + 70 + 110
mean = = 58
12
52 + 56
median = = 54
2
mode = 52, 70
30 + 110
midrange = = 70
2
I N T R OD U CT ION TO D AT A S C I E N C E 11 / 67
D AT A D I S P E R S I O N M E A S U R E S
Range
Quartiles, and interquartile range
Five-number summary and boxplots
Variance and standard deviation
I N T R OD U CT ION TO D AT A S C I E N C E 12 / 67
R ANGE
The range of the set is the difference between the largest and smallest values.
I N T R OD U CT ION TO D AT A S C I E N C E 13 / 67
QUANTILES
Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal-size consecutive sets.
The k th q-quantile for a given data distribution is the value x such that at most k / q of
the data values are less than x and at most (q − k )/q of the data values are more
than x, where k is an integer such that 0 < k < q.
There are q − 1 q-quantiles.
I N T R OD U CT ION TO D AT A S C I E N C E 14 / 67
Q U A RT I L E S O R P ERCENTILES
Three data points that split the data distribution into four equal parts
Each part represents one-fourth of the data distribution.
Q1 is the 25th percentile and Q3 is the 75th percentile
Quartiles give an indication of a distribution’s center, spread, and shape
I N T R OD U CT ION TO D AT A S C I E N C E 15 / 67
I N T E R q U ART I L E R A N G E ( I Q R )
IQR = Q3 − Q1
Identifying outliers as values falling at least 1.5 × IQR above the third quartile or
below the first quartile.
I N T R OD U CT ION TO D AT A S C I E N C E 16 / 67
F I V E - N U M B E R S U M M A RY
The five-number summary of a distribution consists of the median (Q2), the quartiles
Q1 and Q3 , and the smallest and largest individual observations.
Written in the order
I N T R OD U CT ION TO D AT A S C I E N C E 17 / 67
Exercise
1. Sort :10, 11, 11, 11, 12, 12, 13, 14, 14, 15, 17, 22
2. Median: (12+13)/2=12.5=Q2
3. Q1=11(25th percentile)
4. Q3=14.5(75th percentile)
5. IQR=Q3-Q1=3.5
6. Min=Q1-1.5IQR=5.75
7. Max=Q3+1.5IQR=19.75
Outlier=22
Variance and standard deviation indicate how spread out a data distribution is.
I N T R OD U CT ION TO D AT A S C I E N C E 20 / 67
S T A N D A R D D E V I AT I O N
Standard deviation σ of the observations is the square root of the variance σ2.
A low standard deviation means that the data observations tend to be very close to
the mean.
A high standard deviation indicates that the data are spread out over a large range of
values.
σ measures spread about the mean and should be considered only when the mean is
chosen as the measure of center.
σ = 0 only when there is no spread, that is, when all observations have the same
value. Otherwise, σ > 0.
I N T R OD U CT ION TO D AT A S C I E N C E 21 / 67
E XAMPLE
X = [30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110]
Q1 = 47 3rd value
Q2 = 52 6th value
Q3 = 63 9th value
IQR = 63 − 47 = 16
1
σ2 = (302 + 362 + 472 ... + 1102 ) − 582 ≈ 379.17
12
σ = √379.17 ≈ 19.47
I N T R OD U CT ION TO D AT A S C I E N C E 22 / 67
T ABLE OF C ONTENTS
1 S TAT I S T I C A L D E S C R I P T I O N S O F D ATA S E L F S T U D Y
2 D ATA P R E PA R AT I O N
3 D ATA A G G R E G AT I O N , S A M P L I N G
4 D ATA S I M I L A R I T Y & D I S S I M I L A R I T Y M E A S U R E
5
8
9
I N T R OD U CT ION TO D AT A S C I E N C E 23 / 67
D ATA P R E PA R AT I O N
I N T R OD U CT ION TO D AT A S C I E N C E 24 / 67
D AT A C L E A N S I N G
Focuses on removing errors in your data so your data becomes a true and consistent
representation of the processes it originates from.
Two types of errors
) Interpretation error
Age < 100
Height of a person is less than 7feet.
Price is positive.
) Inconsistencies between data sources or against your company’s standardized values.
Female and F
Feet and meter
Dollars and Pounds
I N T R OD U CT ION TO D AT A S C I E N C E 25 / 67
D AT A C L E A N S I N G
I N T R OD U CT ION TO D AT A S C I E N C E 28 / 67
D ATA C L E A N S I N G
Missing values
I N T R OD U CT ION TO D AT A S C I E N C E 29 / 67
M ISSING V A L U E S
I N T R OD U CT ION TO D AT A S C I E N C E 30 / 67
M ISSING V A L U E S
I N T R OD U CT ION TO D AT A S C I E N C E 31 / 67
M ISSING V A L U E S
Use the attribute mean or median for all samples belonging to the same class as the
given tuple.
) For example, if classifying customers according to credit risk, we may replace the
missing value with the mean income value for customers in the same credit risk category
as that of the given tuple.
) If the data distribution for a given class is skewed, the median value is a better choice.
Use the most probable value to fill in the missing value.
) This may be determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction.
) For example, using the other customer attributes in the data set, we may construct a
decision tree to predict the missing values for income.
) Most popular strategy.
I N T R OD U CT ION TO D AT A S C I E N C E 32 / 67
D AT A C L E A N S I N G
I N T R OD U CT ION TO D AT A S C I E N C E 33 / 67
N O I S Y D AT A
I N T R OD U CT ION TO D AT A S C I E N C E 34 / 67
Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar (T1)
The Art of Data Science by Roger D Peng and Elizabeth Matsui (R1)
Data Mining: Concepts and Techniques, Third Edition by Jiawei Han and Micheline
Kamber Morgan Kaufmann Publishers, 2006 (T4)
On Being a Data Skeptic Publisher(s): O’Reilly Media, Inc. ISBN: 9781449374310
T HANK YOU
I N T R OD U CT ION TO D AT A S C I E N C E 36 / 67
INTRODUCTION TO DATA SCIENCE
MODULE # 4 : DATA WRANGLING(CONTD…)
IDS Course Team
BITS Pilani
The instructor is gratefully acknowledging
the authors who made their course
materials freely available online.
T4:Chapter 2.4
MEASURING DATA SIMILARITY AND DISSIMILARITY
Various proximity measures
• Data Matrix versus Dissimilarity Matrix
• Proximity Measures for Nominal Attributes
• Proximity Measures for Binary Attributes
• Symmetric Binary Attributes
• Asymmetric Binary Attributes
• Proximity Measures for Ordinal Attributes
• Proximity Measures for Numeric Data
• Proximity Measures for Mixed Types
• Cosine Similarity
Dissimilarity matrix
– n data points, but registers only the distance
– A triangular matrix
– Single mode
Calculate the dissimilarity matrix and similarity matrix for the ordinal
attributes
– where q is the number of attributes that equal 1 for both objects i and j,
– r is the number of attributes that equal 1 for object i but equal 0 for object j,
– s is the number of attributes that equal 0 for object i but equal 1 for object j,
– t is the number of attributes that equal 0 for both objects i and j.
– The total number of attributes is p, where p = q+r+s+t .
Where is given by
Visualization.ipynb
Techniques are
Discretization – Convert numeric data into discrete categories
Binarization – Convert numeric data into binary categories
Normalization – Scale numeric data to a specific range
Smoothing
• which works to remove noise from the data. Techniques include binning, regression, and
clustering.
• random method, simple moving average, random walk, simple exponential, and
exponential moving average (Will learn in ISM)
T4:Chapter 3.5
46 / 113
DISCRETIZATION
Unsupervised discretization
) Binning [ Equal-interval, Equal-frequency] (Top-down split)
) Histogram analysis (Top-down split)
) Clustering analysis (Top-down split or Bottom-up merge)
Supervised discretization
) Entropy-based discretization (Top-down split)
T1:Cahpter 2.3.6
50 / 113
UNSUPERVISED DISCRETIZATION
width = interval =
max − min
#bins
) Highly sensitive to outliers.
) If outliers are present, the width of each bin is large, resulting in skewed data.
2 Equal Depth (frequency) binning
) Specify the number of values that have to be stored in each bin.
) Number of entries in each bin are equal.
) Some values can be stored in different bins.
T4:Cahpter 3.4.6
52 / 113
BINNING EXAMPLE
Discretize the following data into 3 discrete categories using binning technique.
70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 80, 81, 53, 56, 57, 63, 66, 67, 67, 67, 68, 69, 70, 70.
Original 53, 56, 57, 63, 66, 67, 67, 67, 68, 69, 70, 70,
Data 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 80, 81
Method Bin1 Bin 2 Bin 3
Equal width= [53, 62) = [62, 72) = [72, 81] =
Width 81-53 = 28 53, 56, 57 63, 66, 67, 67, 72, 73, 75, 75,
28/3 = 9.33 67, 68, 69, 70, 76, 76, 78,
70, 70, 70 79, 80, 81
Equal depth = 53, 56, 57, 63, 68, 69, 70, 70, 75, 75, 76, 76,
Depth 24 /3 = 8 66, 67, 67, 67 70, 70, 72, 73 78, 79, 80, 81
Binning.ipynb
T1:Chapter 2.3.7
SIMPLE FUNCTIONAL TRANSFORMATION
Variable transformations should be applied with caution since they change the nature
of the data.
For instance, the transformation 1xreduces the magnitude of values that are 1 or
larger, but increases the magnitude of values between 0 and 1.
To understand the effect of a transformation, it is important to ask questions such as:
) Does the order need to be maintained?
) Does the transformation apply to all values, especially negative values and 0?
) What is the effect of the transformation on the values between 0 and 1?
Features with bigger magnitude dominate over the features with smaller magnitudes.
Good practice to have all variables within a similar scale.
Euclidean distances are sensitive to feature magnitude.
Gradient descent converges faster when all the variables are in the similar scale.
Feature scaling helps decrease the time of finding support vectors.
Scale the feature magnitude to a standard range like [0, 1] or [−1, +1] or any
other. Techniques
) Min-Max normalization
) z-score normalization
) Normalization by decimal scaling
T4:Chapter 3.5.2
65 / 113
MIN-MAX SCALING
Min-max scaling squeezes (or stretches) all feature values to be within the range of
[0, 1].
Min-Max normalization preserves the relationships among the original data values.
It will encounter an ”out-of-bounds” error if a future input case for normalization falls
outside of the original data range for X .
Suppose that the minimum and maximum values for the attribute income are $12,000 and
$98,000, respectively. The new range is [0.0,1.0]. Apply min-max normalization to value of
$73,600.
xˆ = x − µ(x )
σ(x )
z-score normalization is useful when the actual minimum and maximum of attribute X
are unknown, or when there are outliers that dominate the min-max normalization.
Suppose that the mean and standard deviation of the values for the attribute income are
$54,000 and $16,000, respectively. Apply z-score normalization to value of $73,600.
Example 1
CGPA Formula Normalized CGPA
2 2/10 0.2
3 3/10 0.3
Example 2
Bonus Formula Normalized Bonus
450 450/1000 0.45
310 310/1000 0.31
Example 3
Salary Formula Normalized Salary
48000 48000/100000 0.48
67000 67000/100000 0.67
Normalization.ipynb
74 / 113
CATEGORICAL ENCODING TECHNIQUES
One-hot encoding
Label Encoding
Disadvantages
) Expands the feature space.
) Does not add extra information while encoding.
) Many dummy variables may be identical, introducing redundant information .
Disadvantages
) Does not add extra information while encoding.
) Not suitable for linear models.
) Does not handle new categories in test set automatically.
Used for features which have multiple values into domain. eg: colour, protocol types
INTRODUCTION TO DATA SCIENCE 79 / 113
LABEL ENCODING EXAMPLE
Assume an ordinal attribute for representing service of a restaurant: (Awful, Poor, OK,
Good, Great)
Encoding.ipynb
Create a matrix where each column consists of a token and the cells show the counts
of the number of times a token appears.
Each token is now an attribute in standard data science parlance and each document
is an example (record).
Unstructured raw data is now transformed into a format that is recognized by machine
learning algorithms for training.
The matrix / table is referred to as Document Vector or Term Document Matrix (TDM)
As more new statements are added that have little in common, we end up with a very
sparse matrix.
We could also choose to use the term frequencies (TF) for each token instead of
simply counting the number of occurrences.
INTRODUCTION TO DATA SCIENCE 86 / 113
TERM DOCUMENT MATRIX – EXAMPLE
There are common words such as ”a,” ”this,” ”and,” and other similar
terms. They do not really convey specific meaning.
Most parts of speech such as articles, conjunctions, prepositions, and pronouns need
to be filtered before additional analysis is performed.
Such terms are called stop words.
Stop word filtering is usually the second step that follows immediately after
tokenization.
The document vector gets reduced significantly after applying standard English stop
word filtering.
Lexical substitution is the process of finding an alternative for a word in the context
of a clause.
It is used to align all the terms to the same term based on the field or subject which is
being analyzed.
This is especially important in areas with specific jargon, e.g., in clinical settings.
Example: common salt, NaCl, sodium chloride can be replaced by NaCl.
Domain specific
90 / 113
STEMMING
The most common stemming technique for text mining in English is the Porter
Stemming method.
Porter stemming works on a set of rules where the basic idea is to remove and/or
replace the suffix of words.
) Replace all terms which end in ’ies’ by ’y,’ such as replacing the term ”anomalies” with
”anomaly.”
) Stem all terms ending in ”s” by removing the ”s,” as in ”algorithms” to ”algorithm.”
While the Porter stemmer is extremely efficient, it can make mistakes that could prove
costly.
) ”arms” and ”army” would both be stemmed to ”arm,” which would result in somewhat
different contextual meanings.
Lemmatization convert a word to its root form, in a more grammatically sensitive way.
) While both stemming and lemmatization would reduce ”cars” to ”car,” lemmatization can
also bring back conjugated verbs to their unconjugated forms such as ”are” to ”be.”
Lemmatization uses POS Tagging (Part of Speech Tagging) heavily.
POS Tagging is the process of attributing a grammatical label to every part of a
sentence.
) Eg: ”Game of Thrones is a television series.”
) POS Tagging:
({”game”:”NN”},{”of”:”IN”},{”thrones”:”NNS”},{”is”:”VBZ”},{”a”:”DT”},
{”television”:”NN”},{”series”:”NN”})
where: NN = noun, IN = preposition, NNS = noun in its plural form, VBZ = third-person
singular verb, and DT = determiner.
INTRODUCTION TO DATA SCIENCE 93 / 113
DEMO CODE
NLP.ipynb
1
Table of Contents
2
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Learning Experience
3
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Supervised Learning
4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Supervised Learning
5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Supervised Learning -
Classification
6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Unsupervised Learning
7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Train / Validate / Test Sets
• Training set
• Approx 70 to 90% of the actual dataset is used for training the
algorithm.
• Used to learn the parameters of the model.
• Validation set
• Approx 10 to 20% of the training dataset is used for validating
the algorithm.
• Used to tune the parameters of the model.
• Testing set
• Approx 10 to 20% of the actual dataset is used for testing the
algorithm.
• Used to test against new data.
8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification
9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification Algorithms
• Decision Tree
• Random Forest [Discuss in ML course]
• Logistic Regression [Discuss in ML course]
• Naive Bayes Classifier [Discuss in ML course]
• Support Vector Machine [Discuss in ML course]
• Neural Network [Discuss in DL course]
10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Algorithm
11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Trees
• Decision tree
• A flow-chart-like tree structure
• Internal node denotes a test on an attribute
• Branch represents an outcome of the test
• Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
• Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
• Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
• Test the attribute values of the sample against the decision tree
12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Construction:
Hunt’s Algorithm
• Construct a tree T from a training set D.
• If all the records in D belong to class C or if D is sufficiently pure,
then the node is a leaf node and assigned class label C.
• Purity of a node is defined as the probability of corresponding class.
• If an attribute A does not partition D in a sufficiently pure manner,
then choose another attribute A’ and partition D according to A’
values.
• Recursively construct tree and sub-trees until
• All leaf nodes satisfy the minimum purity threshold.
• Tree cannot be further split.
• Maximum depth of tree is achieved.
13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Construction:
Example
14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions
15
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Splitting
Methods
Binary Attributes Nominal Attributes
Ordinal Attributes
Continuous Attributes
16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Selecting
the best split
• p(i|t): fraction of records associated with node t belonging to class i
• Best split is selected based on the degree of impurity of the child
nodes
• Class distribution (0,1) has high purity
• Class distribution (0.5,0.5) has the smallest purity (highest impurity)
i =1
17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Selecting
the best split
c
Entropy (t ) = − p(i | t ) log p(i | t )
i =1
c
Gini (t ) = 1 − p (i | t )
2
i =1
Gini = ?
Entropy = ?
Error = ?
18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions -
Information Gain
• In general the different impurity measures are consistent
• Gain of a test condition: compare the impurity of the
parent node with the impurity of the child nodes
k N (v j )
= I ( parent ) − I (v j )
j =1 N
• I(.) is the impurity measure of a given node, N is the total no: of
records at the parent node, k is the no: of attribute values, and
N(vj) is the no: of nodes associated with the child node vj
With B:
Gini for N1 = 1 – 1/25 – 16/25 = 0.32
Gini for N2 = 1 – 25/49 – 4/49 = 0.408
Weighted avg Gini = 0.32 x 5/12 + 0.408 x 7/12 = 0.37
21
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Example –
Splitting Nominal Attributes
22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Example –
Splitting Continuous Attributes
• Brute force method – high complexity
• Sort training records:
• Based on their annual income - O(N log N) complexity
• Candidate split positions are identified by taking the midpoints
between two adjacent sorted values
• Measure Gini index for each split position, and choose the one that
gives the lowest value
• Further optimization: consider only candidate split positions located
between two adjacent records with different class labels
23
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Summary so far - Decision
Tree
• Build a model (based on past vector) in the form of a tree structure,
that predicts the value of the output variable based on the input
variables in the feature vector
• Each node (decision node) of a decision tree corresponds to one
feature vector
• Root node, Branch node, Leaf node
• Building a Decision Tree - Recursive partitioning
• Splits data into multiple subsets on the basis of feature values
• Root node – entire dataset
• First selects the feature which predicts the target class in the strongest
way
• Splits the dataset into multiple partitions
• Stopping criteria
• All or most of the examples at a particular node have the same class
• All features have been used up in the partitioning
• The tree has grown to a pre-defined threshold limit
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
25
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
• Outcome = false for
all the cases where
Aptitude = Low,
irrespective of other
conditions
• So feature Aptitude
can be taken up as
the first node of the
decision tree
26
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
• For Aptitude = HIGH, job offer condition is TRUE for all the cases
where Communication = Good.
• For cases where Communication = Bad, job offer condition is TRUE
for all the cases where CGPA = HIGH
• Use the below decision tree to predict outcome for (Aptitude = high,
Communication = Bad and CGPA = High)
27
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Entropy & Information
Gain Calculation - Level 1
28
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 1
29
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 1
31
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2
• We will have only one branch to navigate: Aptitude =
High
32
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2
33
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2
34
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2
Entropy values at the end of Level 2:
• 0.85 before the split
• 0.33 when CGPA is used for split
• 0.30 when Communication is used for split
• 0.80 when Programming skill is used for split
Information Gains
• After split with CGPA = 0.52
• After split with Communication = 0.55
• After split with Programming skill = 0.05
35
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 3
Entropy values at the end of Level 3:
• 0.81 before the split
• 0 CGPA is used for split
• 0.50 when Programming skill is used for split
36
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2
37
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Problems with information gain
39
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Algorithms
• Iterative Dichotomiser 3 (ID 3)
– Entropy based criteria
– Gives an exhaustive decision tree.
– Categorical inputs are handled.
• C 4.5
– Entropy based criteria
– Handle missing data.
– Categorical and continuous inputs are handled.
– Uses Tree Pruning to addresses over-fitting problem of ID
3.
• CART (Classification and Regression Tree)
– Gini Index is used.
– Categorical and continuous inputs are handled.
41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Backup
42
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues in Decision Tree
Learning
• Handling training examples with missing data, attributes
with differing costs
• Model overfitting
• Causes of model overfitting
• Estimating generalization error
• Handling overfitting
• Evaluating classifier performance
43
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Alternative measures for selecting
attributes – E.g.: Gain Ratio
• Impurity measures such as entropy and Gini index tend to favor
attributes that have a large number of distinct values
• Even in a less extreme situation, a test condition that results in a large
number of outcomes may not be desirable because the number of
records associated with each partition is too small to enable us to
make any reliable predictions
• Solution 1: restrict the test conditions to binary splits only (CART)
• Solution 2: modify the splitting criterion to take into account the
number of outcomes produced by the attribute test condition
• In the C4.5 decision tree algorithm, a splitting criterion known as gain
ratio is used to determine the goodness of a split
44
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Introduction to Data Science
1
Evaluation Metrics
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
• Accuracy is the fraction of predictions (TP) (FN)
ACTUAL
our model got right. Formally, accuracy CLASS Class=No c d
has the following definition: (FP) (TN)
PREDICTED CLASS
Class=Yes Class=No
Class=Yes 0 10
ACTUAL
CLASS Class=No 0 990
PREDICTED
A Class=Yes Class=No
ACTUAL Class=Yes 0 10 Accuracy:
Class=No 0 990
99%
PREDICTED
B Class=Yes Class=No
ACTUAL Class=Yes 10 0 Accuracy:
Class=No 500 490 50%
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL Class=No c d
CLASS
PREDICTED CLASS
Yes No
ACTUA
Yes TP FN
L
CLASS No FP TN
100%
80%
0%
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
• ROC Curve
• ROC curve can be used to select a threshold for a classifier, which maximizes the
true positives and in turn minimizes the false positives.
• ROC Curves help determine the exact trade-off between the true positive rate
and false-positive rate for a model using different measures of probability
thresholds.
• ROC curves are more appropriate to be used when the observations present are
balanced between each class.
13
Test Result
Test Result
Call these patients Call these patients Call these patients Call these patients
“negative” “positive” “negative” “positive”
True
Positives
Test Result
Test Result False Positives
Call these patients Call these patients Call these patients Call these patients
“negative” “positive” “negative” “positive”
False True
negatives negatives
Test Result
Test Result
‘‘-’’ ‘‘+’’
Test Result
100
%
True Positive Rate
(Recall)
0
% 0 100
False Positive
% Rate (1- %
specificity)
19
BITS Pilani, Pilani Campus
Feature Engineering for Text Data
20
BITS Pilani, Pilani Campus
N-Grams
21
BITS Pilani, Pilani Campus
Example
22
BITS Pilani, Pilani Campus
Term Frequency-Inverse Document
Frequency
• Consider a web search problem where the user types in some keywords
and the search engine extracts all the documents (essentially, web pages)
that contain these keywords.
• How does the search engine know which web pages to serve up?
• In addition to using network rank or page rank, the search engine also runs
some form of text mining to identify the most relevant web pages.
– Example, the user types in the following keywords: ”RapidMiner books
that describe text mining.”
• In this case, the search engines run on the following basic logic:
– Give a high weight-age to those keywords that are relatively rare.
– Give a high weight-age to those web pages that contain a large number
of instances of the rare keywords.
23
BITS Pilani, Pilani Campus
Term Frequency-Inverse Document
Frequency
24
BITS Pilani, Pilani Campus
Term Frequency-Inverse Document
Frequency
25
BITS Pilani, Pilani Campus
Example
26
BITS Pilani, Pilani Campus
Example
27
BITS Pilani, Pilani Campus
Example
28
BITS Pilani, Pilani Campus
Example
29
BITS Pilani, Pilani Campus
Backup
30
BITS Pilani, Pilani Campus
How to Construct an ROC curve
Class + - + - - - + - + +
P
Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
ROC Curve:
100 100
% %
AUC =
True Positive
True Positive
100%
Rate
Rate
AUC =
50%
0
0 %
% 0 10
0 10
% False Positive 0%
% False Positive 0% Rate
Rate
100 100
% %
AUC =
True Positive
True Positive
90% AUC =
Rate
Rate
65%
0 0
% %
I N T R OD U CT ION TO D AT A S C I E N C E 2 / 83
T ABLE OF C ONTENTS
1 F E AT U R E E N G I N E E R I N G
2 F E AT U R E S E L E C T I O N
3 F I LT E R M E T H O D S
Pearson’s Correlation Coefficient
Chi-Squared Statistic
Information Theory Metrics
Gini Index
4 WRAPPER M ETHODS
5 E VA L U AT I O N O F F E AT U R E S E L E C T I O N
6 F E AT U R E E N G I N E E R I N G F O R T E X T
I N T R OD U CT ION TO D AT A S C I E N C E 3 / 83
F E AT U R E S
Building Area Common Area Type of Flooring Distance From Sale Price
Bus Depot per square feet
11345 350 Marble 16503.22 6,715
2000 1334 Vitrified Tiles 16321.19 3,230
2544 924 Wood Vitrified Tiles 15619.92 6,588
I N T R OD U CT ION TO D AT A S C I E N C E 4 / 83
F E AT U R E E N G I N E E R I N G
I N T R OD U CT ION TO D AT A S C I E N C E 5 / 83
M O T I VAT I O N FOR F E AT U R E E N G I N E E R I N G
H UGHES P HENOMENON
Given fixed number of data points, performance of a regressor or a classifier first increases
but later decreases as the number of dimensions of the data increases.
I N T R OD U CT ION TO D AT A S C I E N C E 6 / 83
F E AT U R E C R E AT I O N
Create new attributes that can capture important information in a dataset much more
efficiently than the original attributes.
Two general methodologies:
) Feature Extraction
) Feature Construction
2 Create dummy features
2 Create derived features
I N T R OD U CT ION TO D AT A S C I E N C E 7 / 83
F E AT U R E E X T R A C T I O N
I N T R OD U CT ION TO D AT A S C I E N C E 8 / 83
F E AT U R E E X T R A C T I O N
Bag of Words
I N T R OD U CT ION TO D AT A S C I E N C E 9 / 83
F E AT U R E E X T R A C T I O N
Image Features
I N T R OD U CT ION TO D AT A S C I E N C E 10 / 83
F E AT U R E E X T R A C T I O N
Facial Landmarks
I N T R OD U CT ION TO D AT A S C I E N C E 11 / 83
F E AT U R E E X T R A C T I O N
I N T R OD U CT ION TO D AT A S C I E N C E 12 / 83
F E AT U R E C O N S T R U C T I O N
I N T R OD U CT ION TO D AT A S C I E N C E 14 / 83
F E AT U R E C O N S T R U C T I O N
Customer ID Gender Payment Method
C001 Female Online banking
C002 Male Online banking
C003 Female Credit card
C004 Male Debit Card
I N T R OD U CT ION TO D AT A S C I E N C E 16 / 83
F E AT U R E C O N S T R U C T I O N
1 F E AT U R E E N G I N E E R I N G
2 F E AT U R E S E L E C T I O N
3 F I LT E R M E T H O D S
Pearson’s Correlation Coefficient
Chi-Squared Statistic
Information Theory Metrics
Gini Index
4 WRAPPER M ETHODS
5 E VA L U AT I O N O F F E AT U R E S E L E C T I O N
6 F E AT U R E E N G I N E E R I N G F O R T E X T
I N T R OD U CT ION TO D AT A S C I E N C E 18 / 83
F E AT U R E S E L E C T I O N
Feature selection is the process of identifying relevant and important features from
irrelevant or redundant features.
It intends to select a subset of attributes or features that makes the most meaningful
contribution to a machine learning activity.
I N T R OD U CT ION TO D AT A S C I E N C E 19 / 83
F A C T O R S A F F E C T I N G F E AT U R E S E L E C T I O N
Feature Relevance
) In supervised algorithms, it is important for each feature to contribute towards the class
label, otherwise it is irrelevant.
) Need to determine : Strongly relevant, Moderately relevant and Weakly relevant features.
) In case of unsupervised algorithms, there is no labelled data. During the grouping
process, the algorithm identifies the irrelevant features.
Feature Redundancy
) A feature may contribute to information that is similar to the information contributed by
one or more features.
) All features having potential redundancy are candidates for rejection in the final feature
subset.
) If two features X1, X2 are highly correlated, then the two features become redundant
features since they have same information in terms of correlation measure.
I N T R OD U CT ION TO D AT A S C I E N C E 20 / 83
F E AT U R E S U B S E T S E L E C T I O N
Given: D initial set of features F = { f 1 , f2, f3, ..., fD } and target class label T .
Find: Minimum subset F J = { f1J , f2J , f3J ,..., fMJ } that achieves maximum classification
performance where F J ⊆ F .
There are 2D possible subsets.
Need a criteria to decide which subset is the best:
) Classifier based on these M features has the lowest probability of error of all such
classifiers.
Evaluating 2D possible subsets is time consuming and expensive.
Use heuristics to reduce the search space.
I N T R OD U CT ION TO D AT A S C I E N C E 21 / 83
S TEPS IN F E AT U R E S E L E C T I O N
I N T R OD U CT ION TO D AT A S C I E N C E 22 / 83
F E AT U R E S E L E C T I O N A P P R OA C H E S
• Filter approaches: Features are selected before the data mining algorithm is run,
using some approach that is independent of the data mining task. For example, we
might select sets of attributes whose pairwise correlation is as low as possible.
• Wrapper approaches: These methods use the target data mining algorithm as a
black box to find the best subset of attributes, in a way similar to that of the ideal
algorithm described above, but typically without enumerating all possible subsets.
I N T R OD U CT ION TO D AT A S C I E N C E 23 / 83
T ABLE OF C ONTENTS
1 F E AT U R E E N G I N E E R I N G
2 F E AT U R E S E L E C T I O N
3 F I LT E R M E T H O D S
Pearson’s Correlation Coefficient
Chi-Squared Statistic
Information Theory Metrics
Gini Index
4 WRAPPER M ETHODS
5 E VA L U AT I O N O F F E AT U R E S E L E C T I O N
6 F E AT U R E E N G I N E E R I N G F O R T E X T
I N T R OD U CT ION TO D AT A S C I E N C E 24 / 83
F ILT E R M E T H O D S
The Predictive power of individual feature is evaluated.
Rank each feature according to some uni-variate metric and select the highest
ranking features.
Compute a score for each feature.
The score should reflect the discriminative power of each feature.
Advantages
) Fast
) Provides generically useful feature set.
Disadvantages
) Cause higher error than wrapper methods.
) A feature that is not useful by itself can be very useful when combined with others. Filter
methods can miss it.
I N T R OD U CT ION TO D AT A S C I E N C E 25 / 83
F ILT E R M E T H O D S
Algorithm
Given Input: large feature set F .
1 Identify candidate subset S ⊆ F .
2 While ! stop criterion()
1 Evaluate utility function J using S.
2 Adapt S.
3 Return S.
I N T R OD U CT ION TO D AT A S C I E N C E 26 / 83
T YPES OF F ILT E R S
Correlation-based
Information-theoretic metrics
) Pearson correlation ) Mutual Information (Information
) Spearman rank correlation
Gain)
) Kendall concordance
) Gain Ratio
Statistical/probabilistic independence
Others
metrics
) Gini index
) Chi-square statistic
) Fisher score
) F-statistic
) Cramer’s V
) Welch’s statistic
I N T R OD U CT ION TO D AT A S C I E N C E 27 / 83
WHICH F I LT E R ?
I N T R OD U CT ION TO D AT A S C I E N C E 29 / 83
P E A R S O N ’ S C O R R E L AT I O N C O E F F I C I E N T
Used to measure the strength of association between two continuous features.
Both positive and negative correlation are useful.
We use Pearson Correlation to compute the correlation matrix or heat map.
Steps
1 Compute the Pearson’s Correlation Coefficient for each feature.
2 Sort according the score.
Retain the highest ranked features, discard the lowest ranked.
3
Limitation
Pearson assumes all features are independent.
Pearson identifies only linear correlations
) Positive linear relationship – In children, as the height increases, weight also increases.
) Negative linear relationship – If the vehicle increases its speed, the time taken to travel
decreases.
I N T R OD U CT ION TO D AT A S C I E N C E 30 / 83
P E A R S O N ’ S C O R R E L AT I O N C O E F F I C I E N T
I N T R OD U CT ION TO D AT A S C I E N C E 31 / 83
I N T E R P R E T AT I O N OF THE P E A R S O N C O R R E L AT I O N
−1 ≤ rA,B ≤ +1
If rA,B > 0
) A and B are positively correlated.
) The values of A increase as the values of B increase.
) The higher the value, the stronger the correlation (i.e., the more each attribute implies
the other).
If rA,B < 0
) A and B are negatively correlated.
) The values of one attribute increase as the values of the other attribute decrease.
If rA,B = 0
) A and B are independent and there is no correlation between them.
If rA,B = −1 or + 1
) linear fit is perfect: all data points lie on one line.
Use scatter plot for visualizing.
I N T R OD U CT ION TO D AT A S C I E N C E 32 / 83
P E A R S O N ’ S C O R R E L AT I O N C O E F F I C I E N T
P E A R S O N ’ S C O R R E L AT I O N E X A M P L E
Check whether sale of ice creams and sun glasses are related?
I N T R OD U CT ION TO D AT A S C I E N C E 34 / 83
P E A R S O N ’ S C O R R E L AT I O N E X A M P L E
I N T R OD U CT ION TO D AT A S C I E N C E 35 / 83
P E A R S O N ’ S C O R R E L AT I O N E X A M P L E
I N T R OD U CT ION TO D AT A S C I E N C E 36 / 83
D EMO C ODE
PearsonExample.py
CorrelationCoeffecient.ipynb
PearsonCorrelation Covid Data.ipynb
I N T R OD U CT ION TO D AT A S C I E N C E 37 / 83
χ 2 S TAT I S T I C
Chi-square test of independence allow us to see whether or not two categorical
variables are related or not.
The probability density function for the χ2 distribution with r degrees of freedom (df) .
I N T R OD U CT ION TO D AT A S C I E N C E 38 / 83
χ 2 S T AT I S T I C E X A M P L E
Let’s say you want to know if gender has anything to do with political party preference. You
poll 440 voters in a simple random sample to find out which political party they prefer. The
results of the survey are shown in the table below:
I N T R OD U CT ION TO D AT A S C I E N C E 39 / 83
χ 2 S T AT I S T I C E X A M P L E
I N T R OD U CT ION TO D AT A S C I E N C E 40 / 83
χ 2 S T AT I S T I C E X A M P L E
I N T R OD U CT ION TO D AT A S C I E N C E 41 / 83
χ 2 S T AT I S T I C E X A M P L E
I N T R OD U CT ION TO D AT A S C I E N C E 42 / 83
χ 2 S T AT I S T I C E X A M P L E
I N T R OD U CT ION TO D AT A S C I E N C E 43 / 83
χ 2 S T AT I S T I C E X A M P L E
A group of customers were classified in terms of personality (introvert, extrovert or normal)
and in terms of color preference (red, yellow or green) with the purpose of seeing whether
there is an association (relationship) between personality and color preference.
Data was collected from 400 customers and presented in the 3(rows)×3(cols) contingency
table below.
I N T R OD U CT ION TO D AT A S C I E N C E 44 / 83
χ 2 S T AT I S T I C E X A M P L E
Step 1:
Set up hypotheses and determine level of significance.
Null hypothesis(H0): Color preference is independent of personality.
Alternative hypothesis(HA): Color preference is dependent on personality .
Level of significance: specifies the probability of error. Generally it is set as 5%.
= 0.05
Assume that H0 is always true unless the evidence portraits something else in which
case we will reject H0 and accept HA.
I N T R OD U CT ION TO D AT A S C I E N C E 45 / 83
χ 2 S TAT I S T I C E X AM P L E
Step 2:
Compute the expected count.
I N T R OD U CT ION TO D AT A S C I E N C E 46 / 83
χ 2 S T AT I S T I C E X A M P L E
Step 3:
Compute the Chi-Squared Statistic.
I N T R OD U CT ION TO D AT A S C I E N C E 47 / 83
χ2 S T AT I S T I C E X A M P L E
Step 4
Compute degrees of freedom.
df = (r − 1)(c − 1)
r is the number of categories in one variable and c is the number of categories in the
other.
df = (3 − 1) × (3 − 1) = 4 (contingency table)
Step 5
From the table, critical value = 9.488(df = 4, alpha = 0.05)
Since Calculated value of χ2 > Critical Value of χ2 H0 is rejected ; HA is accepted.
Interpretation: There is sufficient evidence to say that Color Preference depends on
the Personality.
I N T R OD U CT ION TO D AT A S C I E N C E 48 / 83
D EMO C ODE
ChiSquareGeneral.ipynb
ChiSquareCovidExample.ipynb
I N T R OD U CT ION TO D AT A S C I E N C E 49 / 83
I N FO R M AT I O N T H E O RY METRICS
I N T R OD U CT ION TO D AT A S C I E N C E 50 / 83
I N FO R M AT I O N G A I N
I N T R OD U CT ION TO D AT A S C I E N C E
I N FO R M AT I O N G A I N
Compute the Information Gain for the attribute Travel Cost wrt Transport Mode.
Gender Car Ownership Travel Cost Income Level Transport Mode
Male 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Female 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Female 1 Expensive High Car
Male 2 Expensive Medium Car
Female 2 Expensive High Car
Female 1 Cheap Medium Train
Male 0 Standard Medium Train
Female 1 Standard Medium Train
I N T R OD U CT ION TO D AT A S C I E N C E 52 / 83
I N FO R M AT I O N G A I N
Transport Mode
Bus Car Train
4 3 3
I N T R OD U CT ION TO D AT A S C I E N C E 53 / 83
I N FO R M AT I O N G A I N
Step 2: Compute the Entropy of target given one feature.
Feature Transport Mode
Bus Train Car
Cheap 4 1 0
Expensive 0 0 3
Standard 0 2 0
I N T R OD U CT ION TO D AT A S C I E N C E 54 / 83
I N FO R M AT I O N G A I N
I N T R OD U CT ION TO D AT A S C I E N C E 55 / 83
D EMO C ODE
InformationGainCovidData.ipynb
I N T R OD U CT ION TO D AT A S C I E N C E 56 / 83
G INI I N D E X
I N T R OD U CT ION TO D AT A S C I E N C E 57 / 83
G INI I N D E X
Compute the Gini Index for the feature Travel Cost wrt Transport Mode.
Gender Car Ownership Travel Cost Income Level Transport Mode
Male 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Female 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Female 1 Expensive High Car
Male 2 Expensive Medium Car
Female 2 Expensive High Car
Female 1 Cheap Medium Train
Male 0 Standard Medium Train
Female 1 Standard Medium Train
I N T R OD U CT ION TO D AT A S C I E N C E 58 / 83
G INI I N D E X
I N T R OD U CT ION TO D AT A S C I E N C E 59 / 83
G I N I I N D E X I N T E R P R E T AT I O N
I N T R OD U CT ION TO D AT A S C I E N C E 60 / 83
T ABLE OF C ONTENTS
1 F E AT U R E E N G I N E E R I N G
2 F E AT U R E S E L E C T I O N
3 F I LT E R M E T H O D S
Pearson’s Correlation Coefficient
Chi-Squared Statistic
Information Theory Metrics
Gini Index
4 WRAPPER M ETHODS
5 E VA L U AT I O N O F F E AT U R E S E L E C T I O N
6 F E AT U R E E N G I N E E R I N G F O R T E X T
I N T R OD U CT ION TO D AT A S C I E N C E 61 / 83
WRAPPER METHODS
Wrappers require some method to search the space of all possible subsets of
features, assessing their quality by learning and evaluating a classifier with that
feature subset.
The feature selection process is based on a specific machine learning algorithm that
we are trying to fit on a given dataset.
It follows a greedy search approach by evaluating all the possible combinations of
features against the evaluation criterion.
The wrapper methods usually result in better predictive accuracy than filter methods.
I N T R OD U CT ION TO D AT A S C I E N C E 62 / 83
WRAPPER METHODS
Greedy Based algorithms.
Performance of the method depends on the machine learning models chosen.
Sequential feature selection algorithm add or remove one feature at a time based on
the classifier performance until a desired criterion is met.
Two methods
) Sequential Forward Selection(SFS)
) Sequential Backward Selection(SBS)
Advantages
) Highest performance
Disadvantages
) Computationally expensive
) Memory intensive
I N T R OD U CT ION TO D AT A S C I E N C E 63 / 83
WRAPPER METHODS T YPES
Forward selection
) starts with one predictor and adds more iteratively.
) At each subsequent iteration, the best of the remaining original predictors are added
based on performance criteria.
) SequentialFeatureSelector class from mlxtend
Backward elimination
) starts with all predictors and eliminates one-by-one iteratively.
) One of the most popular algorithms is Recursive Feature Elimination (RFE) which
eliminates less important predictors based on feature importance ranking.
) RFE class from sklearn
I N T R OD U CT ION TO D AT A S C I E N C E 64 / 83
S E q U E N T I A L F O RWA R D S E L E C T I O N
I N T R OD U CT ION TO D AT A S C I E N C E 65 / 83
S F S E XAMPLE – W I N E D ATA
I N T R OD U CT ION TO D AT A S C I E N C E 66 / 83
S E q U E N T I A L B A C K WA R D S E L E C T I O N
I N T R OD U CT ION TO D AT A S C I E N C E 67 / 83
S B S E XAMPLE – W I N E D ATA
I N T R OD U CT ION TO D AT A S C I E N C E 68 / 83
E MBEDDED METHODS
I N T R OD U CT ION TO D AT A S C I E N C E 69 / 83
Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar (T1)
Data Mining: Concepts and Techniques, Third Edition by Jiawei Han and Micheline
Kamber Morgan Kaufmann Publishers, 2006 (T4)
T HANK YOU
I N T R OD U CT ION TO D AT A S C I E N C E 70 / 83