0% found this document useful (0 votes)
181 views493 pages

Ids PPT and PDF

This document provides an introduction to a data science course. It outlines the course objectives, structure, modules, textbooks, evaluation schedule, learning platform, and why data science is an important field. The key topics covered are fundamentals of data science, real-world applications, challenges, teams, software engineering, and further reading. The course aims to provide a basic understanding of data science roles and applications while teaching essential data analytics skills.

Uploaded by

ygaurav160599
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
181 views493 pages

Ids PPT and PDF

This document provides an introduction to a data science course. It outlines the course objectives, structure, modules, textbooks, evaluation schedule, learning platform, and why data science is an important field. The key topics covered are fundamentals of data science, real-world applications, challenges, teams, software engineering, and further reading. The course aims to provide a basic understanding of data science roles and applications while teaching essential data analytics skills.

Uploaded by

ygaurav160599
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 493

INTRODUCTION TO DATA SCIENCE

MODULE # 1 : INTRODUCTION
IDS Course Team
BITS Pilani
The instructor is gratefully acknowledging
the authors who made their course
materials freely available online.

INTRODUCTION TO DATA SCIENCE 2 / 79


TABLE OF CONTENTS

1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 3 / 79


COURSE OBJECTIVES

CO1 Gain basic understanding of the role of Data Science in various scenarios in the
real-world of business, industry and government.

CO2 Understand various roles and stages in a Data Science Project and ethical issues to be
considered.

CO3 Explore the processes, tools and technologies for collection and analysis of
structured and unstructured data.

CO4 Appreciate the importance of techniques like data visualization, storytelling with data
for the effective presentations of the outcomes with the stakeholders.

CO5 Understand techniques of preparing real-world data for data analytics.


CO6 Implement data analytic techniques for discovering interesting patterns from data.
INTRODUCTION TO DATA SCIENCE 4 / 79
COURSE STRUCTURE
M1 Introduction to Data Science
M2 Data Analytics
M3 Data and Data Models
M4 Data Wrangling
M5 Feature Engineering
M6 Classification and Prediction
M7 Association Analysis
M8 Clustering
M9 Anomaly Detection
M10 Storytelling with Data
M11 Ethics for Data Science
INTRODUCTION TO DATA SCIENCE 5 / 79
MODULES OVERVIEW
Module 2:
Part 1: Module 10: Module 11:
Process &
Data Science Story Telling Ethics
Analytics

Part II: Module 3: Module 3:


Data Data & Sources Data Pipelines

Part III: Module 4: Module 5:


Preprocessing Data Wrangling Feature Engg

Module 7: Module 9:
Part IV: Module 6: Module 8:
Assocation Anomaly
Modeling & Evaluation Classification Clustering
Mining Detection

INTRODUCTION TO DATA SCIENCE 6 / 79


TEXT BOOKS

T1 Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar


T2 Introducing Data Science by Cielen, Meysman and Ali
T3 Storytelling with Data, A data visualization guide for business professionals, by
Cole, Nussbaumer Knaflic; Wiley
T4 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han and
Micheline Kamber Morgan Kaufmann Publishers, 2006

INTRODUCTION TO DATA SCIENCE 7 / 79


REFERENCE BOOKS

R1 The Art of Data Science by Roger D Peng and Elizabeth Matsui


R2 Ethics and Data Science by DJ Patil, Hilary Mason, Mike Loukides
R3 Python Data Science Handbook: Essential tools for working with data by Jake
VanderPlas
R4 KDD, SEMMA and CRISP-DM: A Parallel Overview , Ana Azevedo and M.F.
Santos, IADS-DM, 2008

INTRODUCTION TO DATA SCIENCE 8 / 79


EVALUATION SCHEDULE

No Name Type Duration Weight Remarks


EC1 Quiz I Online 1 hr 5% Average of both quizzes
Quiz II Online 1 hr 5%

Assignment Part I Online 4 weeks 10% Sum of both


Assignment Part II Online 4 weeks 15% Assignments

EC2 Mid-sem Online As announced 30%

EC3 Compre-sem Regular Online As announced 40%

INTRODUCTION TO DATA SCIENCE 9 / 79


LEARNING PLATFORM

Most relevant and up to date info on Canvas


Handout
Schedule for Webinar, Quiz, and Assignments.
Session Slide Deck
Demo Lab Sheets
Quiz-I, Quiz-II
Assignment I, Assignment II

The video recording will be available in Lecture delivery platform.

INTRODUCTION TO DATA SCIENCE 10 / 79


PLATFORM / DATASET

Platform
) Python / Jupyter Notebook / Google Colab
Dataset
) Datasets as we deem appropriate.
Webinar
) 4 webinars
) Either Lab modules will be explained or numerical problems will be solved.
) As per schedule

INTRODUCTION TO DATA SCIENCE 11 / 79


TABLE OF CONTENTS

1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 12 / 79


WHY DATA SCIENCE?

”Data Science is the sexiest job in the 21st century” – IBM.


Data Science is one of the fastest growing fields in the world.
According to the U.S. Bureau of Labor Statistics, 11.5 million new jobs will be created
by the year 2026.
Even with COVID-19 situation, and the amount of shortage in talent, there might not
be a dip in data science as a career option.

INTRODUCTION TO DATA SCIENCE 13 / 79


WHY DATA SCIENCE?

In India, the average salary of a data scientist as of January 2020 is Rs.10L/yr. –


Glassdoor, 2020.
The increase in data science as a career choice in 2020 will also see the rise in its
various job roles.
) Data Engineer
) Data Administrator
) Machine Learning Engineer

) Statistician

) Data and Analytics Manager

INTRODUCTION TO DATA SCIENCE 14 / 79


DATA SCIENCE

Data Science is a study of data.


Data Science is an art of uncovering insights and trends that are hiding behind the
data.
Data Science helps to translate data into a story. The story telling helps in uncovering
insights. The insights help in making decision or strategic choices.
Data Science is the process of using data to understand different things.
) Requires a major effort of preparing, cleaning, scrubbing, or standardizing the data.
) Algorithms are then applied to crunch pre-processed data.
) This process is iterative and requires analysts’ awareness of the best practices.

) The most important aspect of data science is interpreting the results of the analysis in

order to make decisions.

INTRODUCTION TO DATA SCIENCE 15 / 79


DATA SCIENCE – MULTIPLE DISCIPLINES

Math
Statistics

Software
Research Development

Data
Science
Domain,
Business CS / IT
Knowledge Machine
Learning

INTRODUCTION TO DATA SCIENCE 16 / 79


NEED OF DATA SCIENCE

Data deluge, tons of data.


Powerful algorithms.
Open software and tools.
Computational speed, accuracy and cost.
Data storage in terms of capacity and cost.

INTRODUCTION TO DATA SCIENCE 17 / 79


DATA SCIENCE, AI AND ML

Artificial Intelligence
) AI involves making machines capable of mimicking human behavior, particularly
cognitive functions like facial recognition, automated driving, sorting mail based on
postal code.
Machine Learning
) Considered a sub-field of or one of the tools of AI.
) Involves providing machines with the capability of learning from experience.
) Experience for machines comes in the form of data.

Data Science
) Data science is the application of machine learning, artificial intelligence, and other
quantitative fields like statistics, visualization, and mathematics to uncover insights from
data to enable better decision marking.

INTRODUCTION TO DATA SCIENCE 18 / 79


DATA SCIENCE, AI AND ML

https://www.sciencedirect.com/topics/physics-and-astronomy/artificial-
intelligence
INTRODUCTION TO DATA SCIENCE 19 / 79
TABLE OF CONTENTS

1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 20 / 79


USE CASES OF DATA SCIENCE

DataFlair
INTRODUCTION TO DATA SCIENCE 21 / 79
DATA SCIENCE IN FACEBOOK
Social Analytics
Utilizes quantitative research to gain insights about the social interactions of among
people.
Makes use of deep learning, facial recognition, and text analysis.
In facial recognition, it uses powerful neural networks to classify faces in the
photographs.
In text analysis, it uses “DeepText” to understand people’s interest and aligns
photographs with texts.
It uses deep learning for targeted advertising.
Using the insights gained from data, it clusters users based on their preferences and
provides them with the advertisements that appeal to them.
INTRODUCTION TO DATA SCIENCE 22 / 79
DATA SCIENCE IN AMAZON

Improving E-Commerce Experience


Personalized recommendation
) Predictive analytics (a personalized recommender system) to increase customer
satisfaction.
) Purchase history of customers, other customer suggestions, and user ratings are

analyzed to recommend products.


Anticipatory shipping model
) Predict the products that are most likely to be purchased by its users.
) Analyzes pattern of customer purchases and keeps products in the nearest warehouse
which the customers may utilize in the future.

INTRODUCTION TO DATA SCIENCE 23 / 79


DATA SCIENCE IN AMAZON – CONTD...

Improving E-Commerce Experience


Price discounts
) Using parameters such as the user activity, order history, prices offered by the
competitors, product availability, etc., Amazon provides discounts on popular items and
earns profits on less popular items.
Fraud Detection
) Detect fraud sellers and fraudulent purchases.
Improving Packaging Efficiency
) Optimize packaging of products in warehouses and increases efficiency of packaging
lines through the data collected from the workers.

INTRODUCTION TO DATA SCIENCE 24 / 79


DATA SCIENCE IN UBER
Improving Rider Experience
Uber maintains large database of drivers, customers, and several other records.
Makes extensive use of Big Data and crowdsourcing to derive insights and provide
best services to its customers.
Dynamic pricing
) Use of big Data and data science to calculate fares based on specific parameters.
) Uber matches customer profile with the most suitable driver and charges them based on
the time it takes to cover the distance rather than the distance itself.
) The time of travel is calculated using algorithms that make use of data related to traffic

density and weather conditions.


) When the demand is higher (more riders) than supply (less drivers), the price of the ride

goes up.

INTRODUCTION TO DATA SCIENCE 25 / 79


DATA SCIENCE IN BANK OF AMERICA
Improving Customer Experience
Erica – a virtual financial assistant (BoA)
) Erica serves as a customer advisor to over 45 million users around the world.
) Erica makes use of Speech Recognition to take customer inputs.
Fraud detection
) Uses data science and predictive analytics to detect frauds in payments, insurance,
credit cards, and customer information.
Risk modeling
) Use data science for risk modeling to regulate financial activities.
Customer segmentation
) Segment their customers in the high-value and low-value segments.
) Data scientists makes use of clustering, logistic regression, decision trees to help the
banks to understand the Customer Lifetime Value (CLV) and take group them in the
appropriate segments.
INTRODUCTION TO DATA SCIENCE 26 / 79
DATA SCIENCE IN AIRBNB

Improving Customer Experience


Providing better search results
) Uses big data of customer and host information, homestays and lodge records, and
website traffic.
) Uses data science to provide better search results to its customers and find compatible

hosts.
Detecting bounce rates
) Use of demographic analytics to analyze bounce rates from their websites.
Providing ideal lodgings and localities
) Uses knowledge graphs where the user’s preferences are matc hed with the various
parameters to provide ideal lodgings and localities.

INTRODUCTION TO DATA SCIENCE 27 / 79


DATA SCIENCE IN SPOTIFY

Improving Customer Experience and recommendation


Providing better music streaming experience
) Provide personalized music recommendations.
) Uses over 600 GBs of daily data generated by the users to build its algorithms to boost
user experience.
Improving experience for artists and managers
) Spotify for Artists application allows the artists and managers to analyze their streams,
fan approval and the hits they are generating through Spotify’s playlists.

INTRODUCTION TO DATA SCIENCE 28 / 79


DATA SCIENCE IN SPOTIFY... CONTD..

Spotify uses data science to gain insights about which universities had the highest
percentage of party playlists and which ones spent the most time on it.
”Spotify Insights” publishes information about the ongoing trends in the music.
Spotify’s Niland, an API based product, uses machine learning to provide better
searches and recommendations to its users.
Spotify analyzes listening habits of its users to predict the Grammy Award Winners.

INTRODUCTION TO DATA SCIENCE 29 / 79


APPLICATIONS OF DATA SCIENCE

DataFlair
INTRODUCTION TO DATA SCIENCE 30 / 79
APPLICATIONS OF DATA SCIENCE

edureka.co
INTRODUCTION TO DATA SCIENCE 31 / 79
TABLE OF CONTENTS

1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 32 / 79


DATA SCIENCE CHALLENGES

Data science challenges can be categorized as:


Data related
Organization related
Technology related
People related
Skill related

INTRODUCTION TO DATA SCIENCE 33 / 79


CHALLENGES IN DATA SCIENCE

Complexity of Data Reality


Identifying the problem
Access to right data – Data quantity
Data Cleansing – Data quality - Data Security
Granularity, Consistency Availability of Data
Lack of domain expertise
Cognitive Bias
Content and Source Bias

INTRODUCTION TO DATA SCIENCE 34 / 79


COGNITIVE BIAS

Cognitive Biases are the distortions of reality because of the lens through which we
view the world.
Each of us sees things differently based on our preconceptions, past experiences,
cultural, environmental, and social factors. This doesn’t necessarily mean that the
way we think or feel about something is truly representative of reality.

INTRODUCTION TO DATA SCIENCE 35 / 79


TABLE OF CONTENTS

1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 36 / 79


ROLES IN DATA SCIENCE TEAM [1/6]

[1] Chief Analytics Officer / Chief Data


Officer
) CAO, a “business translator,” bridges
the gap between data science and
domain expertise acting both as a
visionary and a technical lead.
) Preferred skills: data science and

analytics, programming skills,


domain expertise, leadership and
visionary abilities.

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 37 / 79
ROLES IN DATA SCIENCE TEAM [2/6]

[2] Data analyst


) The data analyst role implies proper data collection and interpretation activities.
) An analyst ensures that collected data is relevant and exhaustive while also interpreting
the analytics results.
) May require data analysts to have visualization skills to convert alienating numbers into

tangible insights through graphics. (eg: IBM or HP)


) Preferred skills: R, Python, JavaScript, C/C++, SQL

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 38 / 79
ROLES IN DATA SCIENCE TEAM [3/6]
3 Business analyst
) A business analyst basically realizes a CAO’s functions but on the operational level.
) This implies converting business expectations into data analysis.
) If your core data scientist lacks domain expertise, a business analyst bridges this gulf.

) Preferred skills: data visualization, business intelligence, SQL.

4 Data scientist
) A data scientist is a person who solves business tasks using machine learning and data
mining techniques.
) The role can be narrowed down to data preparation and cleaning with further model

training and evaluation.


) Preferred skills: R, SAS, Python, Matlab, SQL, noSQL, Hive, Pig, Hadoop, Spark

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 39 / 79
ROLES IN DATA SCIENCE TEAM [4/6]
Job of a data scientist is often divided into two roles
[4A] Machine Learning Engineer
) A machine learning engineer combines software engineering and modeling skills by
determining which model to use and what data should be used for each model.
) Probability and statistics are also their forte.
) Training, monitoring, and maintaining a model.

) Preferred skills: R, Python, Scala, Julia, Java

[4B] Data Journalist


) Data journalists help make sense of data output by putting it in the right context.
) Articulating business problems and shaping analytics results into compelling stories.
) Present the idea to stakeholders and represent the data team with those unfamiliar with

statistics.
) Preferred skills: SQL, Python, R, Scala, Carto, D3, QGIS, Tableau

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 40 / 79
ROLES IN DATA SCIENCE TEAM [5/6]
5 Data architect
) Working with Big Data.
) This role is critical to warehouse the data, define database architecture, centralize data,
and ensure integrity across different sources.
) Preferred skills: SQL, noSQL, XML, Hive, Pig, Hadoop, Spark
6 Data engineer
) Data engineers implement, test, and maintain infrastructural components that data
architects design.
) Realistically, the role of an engineer and the role of an architect can be combined in one

person.
) Preferred skills: SQL, noSQL, Hive, Pig, Matlab, SAS, Python, Java, Ruby, C++, Perl

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 41 / 79
ROLES IN DATA SCIENCE TEAM [6/6]

[7] Application/data visualization engineer


) This role is only necessary for a specialized data science model.
) An application engineer or other developers from front-end units will oversee end-user
data visualization.
) Preferred skills: programming, JavaScript (for visualization), SQL, noSQL.

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 42 / 79
SKILLSET FOR A DATA SCIENTIST
PROGRAMMING: Most fundamental of a data scientist’s skill set. Programming improves
your statistics skills, helps you “analyze large datasets” and gives you the
ability to create your own tools.
QUANTITATIVE ANALYSIS: Improve your ability to run experimental analysis, scale your data
strategy and help you implement machine learning.
PRODUCT INTUITION: Understanding products will help you perform quantitative analysis. It
will also help you predict system behavior, establish metrics and improve
debugging skills.
COMMUNICATION: Strong communication skills will help you “leverage all of the previous
skills listed.”
TEAMWORK: It requires being selfless, embracing feedback and sharing your knowledge
with your team.
INTRODUCTION TO DATA SCIENCE 43 / 79
SKILLS REqUIRED FOR A DATA SCIENTIST

Communicative Qualitative

Data
Curious Technical
Scientist

Creative Skeptical

INTRODUCTION TO DATA SCIENCE 44 / 79


SKILLSET OF A DATA SCIENTIST

INTRODUCTION TO DATA SCIENCE 45 / 79


TOOLS AVAILABLE TO A DATA SCIENTIST

R
SQL
Python

Scala

Tools SAS

Hadoo
p
Julia
Tableau
Weka

INTRODUCTION TO DATA SCIENCE 46 / 79


ALGORITHMS FOR A DATA SCIENTIST

Logistic
K-means Regression
Linear
clustering Regression

PCA Algorithms Apriori

Decision
SVM
Tree
ANN

INTRODUCTION TO DATA SCIENCE 47 / 79


DATA SCIENCE TEAM BUILDING

Get to know each other for better communication


Foster team cohesion and teamwork
Encourage collaboration to boost team productivity and performance.

https://towardsdatascience.com/why-team-building-is-important-to-data-scientists-a8fa74dbc09b
INTRODUCTION TO DATA SCIENCE 48 / 79
TABLE OF CONTENTS

1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 49 / 79


SOFTWARE ENGINEERING
In general,
Software engineering is an engineering discipline that is concerned with all aspects of
software production.
Software includes computer programs, all associated documentation, and
configuration data that are needed for software to work correctly.
Waterfall model, Iterative models, Agile models

INTRODUCTION TO DATA SCIENCE 50 / 79


DATA SCIENCE PROCESS

INTRODUCTION TO DATA SCIENCE 51 / 79


DATA SCIENCE VS. SOFTWARE ENGINEERING

Data Science Software Engineering


Data science involves analyzing Software engineering focuses on creat-
huge amounts of data, with some ing software that serves a specific pur-
aspects of programming and devel- pose.
opment.
Uses a methodology involving vari- Uses a methodology involving various
ous phases beginning from require- phases beginning from requirements
ments specification through model specification through software deploy-
deployment to better decision mak- ment into production.
ing.

INTRODUCTION TO DATA SCIENCE 52 / 79


DATA SCIENCE VS. SOFTWARE ENGINEERING

Data Science Software Engineering


Involves collecting and analyzing Concerned with creating useful appli-
data cations
Data scientists utilize the ETL (Ex- Software engineers use the SDLC pro-
tract, Tranform, Load) process cess
More process-oriented Uses frameworks like Waterfall, Agile,
and Spiral
Data scientists use tools like Ama- Software engineers use tools like Rails,
zon S3, MongoDB, Hadoop, and Django, Flask, and Vue.js
MySQL
Skills include machine learning, Skills are focused on coding languages
statistics, and data visualization
INTRODUCTION TO DATA SCIENCE 53 / 79
DATAOPS

DATAOPS AS DEFINED BY GARTNER


DataOps is a collaborative data management practice, really focused on improving
communication, integration, and automation of data flow between managers and
consumers of data within an organization.

INTRODUCTION TO DATA SCIENCE 54 / 79


DATAOPS
DataOps applies Agile development, DevOps and lean manufacturing to data
analytics development and operations.
Agile governs analytics development.
DevOps optimizes code verification, builds and delivery of new analytics.
Lean-manufacturing tool, statistical process control (SPC) orchestrates, monitors and
validates the data factory.

INTRODUCTION TO DATA SCIENCE 55 / 79


DATAOPS

INTRODUCTION TO DATA SCIENCE 56 / 79


DATAOPS

Data analytics pipeline


1 Data ingestion – Data, extracted from various sources, is explored, validated, and
loaded into a downstream system.
2 Data transformation – Data is cleansed and enriched. Initial data models are
designed to meet business needs.
3 Data analysis – produce insights using different data analysis techniques.
4 Data visualization/reporting – Data insights are represented in the form of reports or
interactive dashboards.

https://www.altexsoft.com/blog/dataops-
essentials/
INTRODUCTION TO DATA SCIENCE 57 / 79
DATAOPS
DataOps puts data pipelines into a CI/CD paradigm.
Development – involve building a new pipeline, changing a data model or redesigning
a dashboard.
Testing – checking the most minor update for data accuracy, potential deviation, and
errors.
Deployment – moving data jobs between environments, pushing them to the next
stage, or deploying the entire pipeline in production.
Monitoring – allows data professionals to identify bottlenecks, catch abnormal
patterns, and measure adoption of changes.
Orchestration – automates moving data between different stages, monitoring
progress, triggering autoscaling, and operations related to data flow management.

https://www.altexsoft.com/blog/dataops-
essentials/
INTRODUCTION TO DATA SCIENCE 58 / 79
DATAOPS TEAM

INTRODUCTION TO DATA SCIENCE 59 / 79


TECHNOLOGIES TO RUN DATAOPS

Git for version control


Jenkins for CI/CD practices
Docker for containerization and Kubernetes for managing containers
Tableau for data visualizations
Apache Overflow for data pipeline tools
Automated testing and monitoring tools
DataOps Platforms
) DataKitchen
) Saagie
) StreamSets

INTRODUCTION TO DATA SCIENCE 60 / 79


MLOPS

MLOps is an ML engineering culture and practice that aims at unifying ML system


development (Dev) and ML system operation (Ops).

Machine
Learning

MLOps

Data
DevOps
Engineering

INTRODUCTION TO DATA SCIENCE 61 / 79


MLOPS

Real challenge isn’t building an ML model, but building an integrated ML system and
to continuously operate it in production.
To deploy and maintain ML systems in production reliably and efficiently.
Automating continuous integration (CI), continuous delivery (CD), and continuous
training (CT) for machine learning (ML) systems.
Frameworks
) Kubeflow and Cloud Build
) Amazon AWS MLOps
) Microsoft Azure MLOps

https://ml-ops.org/content/mlops-
principles
INTRODUCTION TO DATA SCIENCE 62 / 79
MLOPS

https://builtin.com/machine-
learning/mlops
INTRODUCTION TO DATA SCIENCE 63 / 79
MLOPS

Same data transformations but different implementations .


e.g training pipeline usually runs over batch files that contain all features, while the
serving pipeline often runs online and receives only part of the features in the
requests
Two pipelines are consistent, so code reuse and data reuse.
Each trained model need to tied to the exact versions of code, data and
hyperparameters that were used.

https://builtin.com/machine-
learning/mlops
INTRODUCTION TO DATA SCIENCE 64 / 79
DATAOPS AND MLOPS

INTRODUCTION TO DATA SCIENCE 65 / 79


TABLE OF CONTENTS

1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 66 / 79


DATA SCIENCE VS. BUSINESS INTELLIGENCE

INTRODUCTION TO DATA SCIENCE 67 / 79


DATA SCIENCE VS. BUSINESS INTELLIGENCE

Data Science Business Intelligence


Perspective Looking forward Looking backward
Analysis Predictive Descriptive
Explorative Comparative
Data Same data, New Data,
New analysis Same analysis
Listens to data Speaks for data
Distributed Warehoused
Scope Specific to business question Unlimited
Expertise Data scientist Business analyst
Deliverable Insight or story Table or report
Applicability Future, correction for influences Historic, confounding factors
INTRODUCTION TO DATA SCIENCE 68 / 79
DATA SCIENTIST VS. BUSINESS ANALYST

INTRODUCTION TO DATA SCIENCE 69 / 79


DATA SCIENCE VS. STATISTICS
Data Science Statistics
Type of problem Semi structured or unstruc- Well structured
tured
Inference model Explicit inference No inference
Analysis Objective Need not be well formed Well formed objective
Type of Analysis Explorative Confirmative
Data collection Data collection is not linked to Data collected based on
the objective the objective
Size of dataset Large Small
Heterogeneous Homogeneous
Paradigm Theory and heuristic Theory based
(deductive & inductive) ( deductive)
INTRODUCTION TO DATA SCIENCE 70 / 79
ORGANISATION OF DATA SCIENCE TEAM
[1] Decentralized
) Data scientists report into specific business
units (ex: Marketing) or functional units (ex:
Product Recommendations) within a
company.
) Resources allocated only to projects within

their silos with no view of analytics activities


or priorities outside their function or
business unit.
) Analytics are scattered across the

organization in different functions and


business units.
) Little to no coordination

) Drawback – lead to isolated teams

INTRODUCTION TO DATA SCIENCE 71 / 79


ORGANISATION OF DATA SCIENCE TEAM

[2] Functional
) Resource allocation driven by a functional
agenda rather than an enterprise agenda.
) Analysts are located in the functions where

the most analytical activity takes place, but


may also provide services to rest of the
corporation.
) Little coordination

INTRODUCTION TO DATA SCIENCE 72 / 79


ORGANISATION OF DATA SCIENCE TEAM

[3] Consulting
) Resources allocated based on availability on
a first-come first-served basis without
necessarily aligning to enterprise objectives
) Analysts work together in a central group

but act as internal consultants who charge


“clients” (business units) for their services
) No centralized coordination

INTRODUCTION TO DATA SCIENCE 73 / 79


ORGANISATION OF DATA SCIENCE TEAM
[4] Centralized
) Data scientists are members of a core
group, reporting to a head of data science
or analytics.
) Stronger ownership and management of

resource allocation and project prioritization


within a central pool.
) Analysts reside in central group, where they

serve a variety of functions and business


units and work on diverse projects.
) Coordination by central analytic unit

) Challenge – Hard to assess and meet

demands for incoming data science


projects. (esp in smaller teams)
INTRODUCTION TO DATA SCIENCE 74 / 79
ORGANISATION OF DATA SCIENCE TEAM

[5] Center of Excellence


) Better alignment of analytics initiatives and
resource allocation to enterprise priorities
without operational involvement.
) Analysts are allocated to units throughout

the organization and their activities are


coordinated by a central entity.
) Flexible model with right balance of

centralized and distributed coordination.

INTRODUCTION TO DATA SCIENCE 75 / 79


ORGANISATION OF DATA SCIENCE TEAM

[6] Federated
) Same as “Center of Excellence” model with
need-based operational involvement to
provide SME support.
) A centralized group of advanced analysts is

strategically deployed to enterprise-wide


initiatives.
) Flexible model with right balance of

centralized and distributed coordination.

INTRODUCTION TO DATA SCIENCE 76 / 79


REFERENCES

Introducing Data Science by Cielen, Meysman and Ali

The Art of Data Science by Roger D Peng and Elizabeth Matsui

https://data-flair.training/blogs/data-science-use-cases/ https:
//www.northeastern.edu/graduate/blog/what-does-a-data-scientist-do/

https://www.visual-paradigm.com/guide/software-development-process/ what-is-a-
software-process-model/

Building an Analytics-Driven Organization, Accenture

INTRODUCTION TO DATA SCIENCE 77 / 79


REFERENCES

https://www.altexsoft.com/blog/datascience/
how-to-structure-data-science-team-key-models-and-roles/

https://www.cio.com/article/3217026/
what-is-a-data-scientist-a-key-data-analytics-role-and-a-lucrative-caree html

https://atlan.com/what-is-dataops/

THANK YOU
INTRODUCTION TO DATA SCIENCE 78 / 79
I NTRODUCTION TO DATA S CIENCE
M ODULE # 2 : DATA A NALYTICS
IDS Course Team
BITS Pilani
The instructor is gratefully acknowledging
the authors who made their course
materials freely available online.

I N T R OD U CT ION TO D AT A S C I E N C E 2 / 60
T ABLE OF C ONTENTS

1 A N A LY T I C S
2 D ATA A N A LY T I C S
3 D ATA A N A LY T I C S M E T H O D O L O G I E S
CRISP-DM
Big Data Life-cycle
SEMMA
SMAM

4 F U RT H E R R E A D I N G

I N T R OD U CT ION TO D AT A S C I E N C E 3 / 60
D EFINITION OF A NALY T I C S – D I C T I O N A RY

O X F O R D Analytics is the systematic computational analysis of data or statistics.


C A M B R I D G E Analytics is a process in which a computer examines information using
mathematical methods in order to find useful patterns.
D I C T I O N A RY . C O M Analytics is the analysis of data, typically large sets of business data,
by the use of mathematics, statistics, and computer software.

Analytics is treated as both a noun and a verb.

Source: Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti
I N T R OD U CT ION TO D AT A S C I E N C E 4 / 60
D EFINITION OF A NALY T I C S – WEBSITES

O R A C L E Analytics is the process of discovering, interpreting, and communicating


significant patterns in data and using tools to empower your entire
organization to ask any question of any data in any environment on any
device.
E D U R E K A Data Analytics refers to the techniques used to analyze data to enhance
productivity and business gain.
I N FO R M AT I C A Data analytics is the pursuit of extracting meaning from raw data using
specialized computer systems.

Source: Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti
I N T R OD U CT ION TO D AT A S C I E N C E 5 / 60
G OA L S OF D AT A A N A L Y T I C S
To predict something
) whether a transaction is a fraud or not
) whether it will rain on a particular day
) whether a tumour is benign or malignant
To find patterns in the data
) finding the top 10 coldest days in the year
) which pages are visited the most on a particular website
) finding the most searched celebrity in a particular year
To find relationships in the data
) finding similar news articles
) finding similar patients in an electronic health record system
) finding related products on an e-commerce website
) finding similar images
) finding correlation between news items and stock prices
I N T R OD U CT ION TO D AT A S C I E N C E 6 / 60
T ABLE OF C ONTENTS

1 A N A LY T I C S
2 D ATA A N A LY T I C S
3 D ATA A N A LY T I C S M E T H O D O L O G I E S
CRISP-DM
Big Data Life-cycle
SEMMA
SMAM

4 F U RT H E R R E A D I N G

I N T R OD U CT ION TO D AT A S C I E N C E 7 / 60
D AT A A N A L Y T I C S

Data analysis is defined as a process of cleaning, transforming, and modelling data to


discover useful information for business decision-making.
4 different types of analytics
1 Descriptive Analytics
2 Diagnostic Analytics
3 Predictive Analytics
4 Prescriptive Analytics

I N T R OD U CT ION TO D AT A S C I E N C E 8 / 60
D ATA A N A LY T I C S

I N T R OD U CT ION TO D AT A S C I E N C E 9 / 60
D ESCRIPTIVE A NALY T I C S

Answers the question of what happened.


Summarize past data usually in the form of dashboards.
Insights into the past.
Also known as statistical analysis.
Raw data from multiple data sources.

I N T R OD U CT ION TO D AT A S C I E N C E 10 / 60
D E S C R I P T I V E A N A LY T I C S E X A M P L E

I N T R OD U CT ION TO D AT A S C I E N C E 11 / 60
D ESCRIPTIVE A NALY T I C S

Techniques:
) Descriptive Statistics - histogram, correlation
) Data Visualization
) Exploratory Analysis

I N T R OD U CT ION TO D AT A S C I E N C E 12 / 60
P RE D IC T IV E A NALY T I C S

Answers the question of what is likely to happen.


Predict future trends.
Being able to predict allows one to make better decisions.
Analysis based on machine or deep learning.
Accuracy of the forecasting or prediction highly depends on data quality and stability
of the situation.

I N T R OD U CT ION TO D AT A S C I E N C E 13 / 60
P R E D I C T I V E A N A LY T I C S E X A M P L E

I N T R OD U CT ION TO D AT A S C I E N C E 14 / 60
P RE D IC T IV E A NALY T I C S

Techniques / Algorithms:
) Regression
) Classification
) ML algorithms like Linear regression, Logistic regression, SVM
) Deep Learning techniques

I N T R OD U CT ION TO D AT A S C I E N C E 15 / 60
D IA G N O S T I C A N A LY T I C S

Answers the question of why something happened.


Gives in-depth insights into data.
Identify relationship between data and identify patterns of behaviour.

I N T R OD U CT ION TO D AT A S C I E N C E 16 / 60
D IA G N O S T I C A N A LY T I C S E X A M P L E
What is the effect of global warming in the Southwest monsoon?

I N T R OD U CT ION TO D AT A S C I E N C E 17 / 60
D IA G N O S T I C A N A L Y T I C S

Pattern recognition to identify patterns.


Linear / Logistic regression to identify relationship.
Neural Network
Deep Learning techniques

I N T R OD U CT ION TO D AT A S C I E N C E 18 / 60
P RE S C RIPT IV E A NALY T I C S

Answers the question of what might happen.


Data-driven decision making and corrective actions, recommendations and
suggestions
Prescribe what action to take to eliminate a future problem or take full advantage of a
promising trend.
Need historical internal data and external information like trends.
Analysis based on machine or deep learning, business rules.
Use of AI to improve decision making.

I N T R OD U CT ION TO D AT A S C I E N C E 19 / 60
P R E S C R I P T I V E A N A LY T I C S E X A M P L E
How can we improve the crop production?

I N T R OD U CT ION TO D AT A S C I E N C E 20 / 60
C O G N I T I V E A N A LY T I C S
Cognitive Analytics – What Don’t I Know?

https: / / w w w. 10x ds. com/ blog /cognitive - analytics - to- reinvent - business/
I N T R OD U CT ION TO D AT A S C I E N C E 21 / 60
C OGNITIVE A NALY T I C S

Next level of Analytics


Human cognition is based on the context and reasoning.
Cognitive systems mimic how humans reason and process.
Cognitive systems analyse information and draw inferences using probability.
They continuously learn from data and reprogram themselves.
According to one source:
”The essential distinction between cognitive platforms and artificial
intelligence systems is that you want an AI to do something for you. A
cognitive platform is something you turn to for collaboration or for advice.”

h ttp s : / / in teres tin g en g in eer in g . com / cog n iti ve - c om p u ti n g - m ore - h u m a n - t h a n - a rt if ic ia l - in t el l ig en c e


I N T R OD U CT ION TO D AT A S C I E N C E 22 / 60
C OGNITIVE A NALY T I C S
Involves Semantics, AI, Machine learning, Deep Learning, Natural Language
Processing, and Neural Networks.
Simulates human thought process to learn from the data and extract the hidden
patterns from data.
Uses all types of data: audio, video, text, images in the analytics process.
Although this is the top tier of analytics maturity, Cognitive Analytics can be used in
the prior levels.
According to Jean Francois Puget:
”It extends the analytics journey to areas that were unreachable with more
classical analytics techniques like business intelligence, statistics, and
operations research.”

h ttp s : / / w w w. eca p ita la d vis ors .com/ b log/ a n a lytics - ma tu rity/


h ttp s : / / w w w. x en on s ta ck.com/in s ig h ts / wh a t - is -cogn itive -a n a lytics /
I N T R OD U CT ION TO D AT A S C I E N C E 23 / 60
T ABLE OF C ONTENTS

1 A N A LY T I C S
2 D ATA A N A LY T I C S
3 D ATA A N A LY T I C S M E T H O D O L O G I E S
CRISP-DM
Big Data Life-cycle
SEMMA
SMAM

4 F U RT H E R R E A D I N G

I N T R OD U CT ION TO D AT A S C I E N C E 24 / 60
D AT A A N A L Y T I C S M E T H O D O L O G I E S

Use standard methodology to ensure a good outcome.


1 CRISP-DM
2 Big Data Life-cycle
3 SEMMA
4 SMAM

I N T R OD U CT ION TO D AT A S C I E N C E 25 / 60
N EED FOR A S TANDA R D P ROCESS

Framework for recording experience.


) Allows projects to be replicated
Aid to project planning and management.
“Comfort factor” for new adopters
) Demonstrates maturity of Data Mining
) Reduces dependency on “stars”
Encourage best practices and help to obtain better results.

I N T R OD U CT ION TO D AT A S C I E N C E 26 / 60
D AT A S C I E N C E M E T H O D O L O G Y
10 Questions the process aims to answer
Problem to Approach
1 What is the problem that you are trying to solve?
2 How can you use data to answer the questions?
Working with Data
3 What data do you need to answer the question?
4 Where is the data coming from? Identify all Sources. How will you acquire it?
5 Is the data that you collected representative of the problem to be solved?
6 What additional work is required to manipulate and work with the data?
Delivering the Answer
7 In what way can the data be visualized to get to the answer that is required?
8 Does the model used really answer the initial question or does it need to be adjusted?
9 Can you put the model into practice?
10 Can you get constructive feedback into answering the question?
I N T R OD U CT ION TO D AT A S C I E N C E 27 / 60
CRISP-DM
CRISP-DM Phases

Cross Industry Standard Process for


Data Mining
conceived around 1996
6 high-level phases
Used in IBM SPSS Modeler tool
Iterative approach to the development
of analytical models.

I N T R OD U CT ION TO D AT A S C I E N C E 28 / 60
C R I S P - D M P HASES

Business Understanding
) Understand project objectives and requirements.
) Data mining problem definition.
Data Understanding
) Initial data collection and familiarization.
) Identify data quality issues.
) Identify initial obvious results.
Data Preparation
) Record and attribute selection.
) Data cleansing.

I N T R OD U CT ION TO D AT A S C I E N C E 29 / 60
C R I S P - D M P HASES

Modeling
) Run the data mining tools.
Evaluation
) Determine if results meet business objectives.
) Identify business issues that should have been addressed earlier.
Deployment
) Put the resulting models into practice.
) Set up for continuous mining of the data.

I N T R OD U CT ION TO D AT A S C I E N C E 30 / 60
C R I S P - D M P HASES AND T ASKS

I N T R OD U CT ION TO D AT A S C I E N C E 31 / 60
WHY CRISP-
DM?

The data mining process must be reliable and repeatable by people with little data
mining skills.
CRISP-DM provides a uniform framework for
) guidelines.
) experience documentation.
CRISP-DM is flexible to account for differences.
) Different business/agency problems.
) Different data

I N T R OD U CT ION TO D AT A S C I E N C E 32 / 60
B I G D AT A L I F E - C Y C L E

Data Acquisition
) Acquiring information from a rich and varied data environment.
Data Awareness
) Connecting data from different sources into a coherent whole, including modeling
content, establishing context, and insuring search-ability.
Data Analytics
) Using contextual data to answer questions about the state of your organization.
Data Governance
) Establishing a framework for providing for the provenance, infrastructure and disposition
of that data.

I N T R OD U CT ION TO D AT A S C I E N C E 33 / 60
B I G D AT A L I F E - C Y C L E

Phase 7: Storage
Phase 1: Foundations
Phase 8: Integration
Phase 2: Acquisition
Phase 9: Analytics and Visualization
Phase 3: Preparation
Phase 10: Consumption
Phase 4: Input and Access
Phase 11: Retention, Backup, and
Phase 5: Processing
Archival
Phase 6: Output and Interpretation
Phase 12: Destruction

PS: Some phases may overlap and can be done in parallel.

I N T R OD U CT ION TO D AT A S C I E N C E 34 / 60
B I G D ATA L I F E - C Y C L E

I N T R OD U CT ION TO D AT A S C I E N C E 35 / 60
B I G D AT A L I F E - C Y C L E

Phase 1: Foundations
) Understanding and validating data requirements, solution scope, roles and
responsibilities, data infrastructure preparation, technical and non-technical
considerations, and understanding data rules in an organization.
Phase 2: Data Acquisition
) Data Acquisition refers to collecting data.
) Data sets can be obtained from various sources, both internal and external to the
business organizations.
) Data sources can be in
2 structured forms such as transferred from a data warehouse, a data mart, various
transaction systems.
2 semi-structured sources such as Weblogs, system logs.
2 unstructured sources such as media files consisting of videos, audios, and pictures.

I N T R OD U CT ION TO D AT A S C I E N C E 36 / 60
B I G D AT A L I F E - C Y C L E

Phase 3: Data Preparation


) Collected data (Raw Data) is rigorously checked for inconsistencies, errors, and
duplicates.
) Redundant, duplicated, incomplete, and incorrect data are removed.
) The objective is to have clean and useable data sets.
Phase 4: Data Input and Access
) Data input refers to sending data to planned target data repositories, systems, or
applications.
) Data can be stored in CRM (Customer Relationship Management) application, a data
lake or a data warehouse.
) Data access refers to accessing data using various methods.
) NoSQL is widely used to access big data.

I N T R OD U CT ION TO D AT A S C I E N C E 37 / 60
B I G D AT A L I F E - C Y C L E

Phase 5: Data Processing


) Processing the raw form of data.
) Convert data into a readable format giving it the form and the context.
) Interpret the data using the selected data analytics tools such as Hadoop MapReduce,
Impala, Hive, Pig, and Spark SQL.
) Data processing also includes activities
2 Data annotation – refers to labeling the data.
2 Data integration – aims to combine data existing in different sources, and provide a unified
view of data to the data consumers.
2 Data representation – refers to the way data is processed, transmitted, and stored.
2 Data aggregation – aims to compile data from databases to combined data-sets to be used
for data processing.

I N T R OD U CT ION TO D AT A S C I E N C E 38 / 60
B I G D AT A L I F E - C Y C L E

Phase 6: Data Output and Interpretation


) In the data output phase, the data is in a format which is ready for consumption by the
business users.
) Transform data into usable formats such as plain text, graphs, processed images, or
video files.
) This phase is also called the data ingestion.
) Common Big Data ingestion tools are Sqoop, Flume, and Spark streaming.
) Interpreting the ingested data requires analyzing ingested data and extract information
or meaning out of it to answer the questions related to the Big Data business solutions.

I N T R OD U CT ION TO D AT A S C I E N C E 39 / 60
B I G D AT A L I F E - C Y C L E

Phase 7: Data Storage


) Store data in designed and designated storage units.
) Storage infrastructure can consist of storage area networks (SAN), network-attached
storage (NAS), or direct access storage (DAS) formats.
Phase 8: Data Integration
) Integration of stored data to different systems for various purposes.
) Integration of data lakes with a data warehouse or data marts.
Phase 9: Data Analytics and Visualization
) Integrated data can be useful and productive for data analytics and visualization.
) Business value is gained in this phase.

I N T R OD U CT ION TO D AT A S C I E N C E 40 / 60
B I G D AT A L I F E - C Y C L E
Phase 10: Data Consumption
) Data is turned into information ready for consumption by the internal or external users,
including customers of the business organization.
) Data consumption require architectural input for policies, rules, regulations, principles,
and guidelines.
Phase 11: Retention, Backup, and Archival
) Use established data backup strategies, techniques, methods, and tools.
) Identify, document, and obtain approval for the retention, backup, and archival decisions.
Phase 12: Data Destruction
) There may be regulatory requirements to destruct a particular type of data after a certain
amount of times.
) Confirm the destruction requirements with the data governance team in business
organizations.
I N T R OD U CT ION TO D AT A S C I E N C E 41 / 60
SEMMA

SAS Institute
Sample, Explore, Modify, Model,
Assess
5 stages

I N T R OD U CT ION TO D AT A S C I E N C E 42 / 60
S E M M A S TAGES

1 Sample
) Sampling the data by extracting a portion of a large data set big enough to contain the
significant information, yet small enough to manipulate quickly.
) Optional stage
2 Explore
) Exploration of the data by searching for unanticipated trends and anomalies in order to
gain understanding and ideas.
3 Modify
) Modification of the data by creating, selecting, and transforming the variables to focus
the model selection process.

I N T R OD U CT ION TO D AT A S C I E N C E 43 / 60
S E M M A S TAGES

1 Model
) Modeling the data by allowing the software to search automatically for a combination of
data that reliably predicts a desired outcome.
2 Assess
) Assessing the data by evaluating the usefulness and reliability of the findings from the
data mining process and estimate how well it performs.

I N T R OD U CT ION TO D AT A S C I E N C E 44 / 60
SEMMA

“SEMMA is not a data mining methodology but rather a logical organization of the
functional tool set of SAS Enterprise Miner for carrying out the core tasks of data
mining.
Enterprise Miner can be used as part of any iterative data mining methodology
adopted by the client. Naturally steps such as formulating a well defined business or
research problem and assembling quality representative data sources are critical to
the overall success of any data mining project.
SEMMA is focused on the model development aspects of data mining.”

I N T R OD U CT ION TO D AT A S C I E N C E 45 / 60
SMAM

Standard
Methodology for
Analytics Models

I N T R OD U CT ION TO D AT A S C I E N C E 46 / 60
S M A M P HASES

Phase Description
Use-case identification Selection of the ideal approach from a list of candidates
Model requirements Understanding the conditions required for the model to func-
gathering tion
Data preparation Getting the data ready for the modeling
Modeling experiments Scientific experimentation to solve the business question
Insight creation Visualization and dash-boarding to provide insight
Proof of Value: ROI Running the model in a small scale setting to prove the value
Operationalization Embedding the analytical model in operational systems
Model life-cycle Governance around model lifetime and refresh

I N T R OD U CT ION TO D AT A S C I E N C E 47 / 60
T ABLE OF C ONTENTS

1 A N A LY T I C S
2 D ATA A N A LY T I C S
3 D ATA A N A LY T I C S M E T H O D O L O G I E S
CRISP-DM
Big Data Life-cycle
SEMMA
SMAM

4 F U RT H E R R E A D I N G

I N T R OD U CT ION TO D AT A S C I E N C E 48 / 60
D ESCRIPTIVE A NALY T I C S – E X AM PL E # 1

Data captured
Problem Statement : Gender
“Market research team at Aqua Analytics Age (In years)
Pvt. Ltd is assigned a task to identify pro- Education (In years)

file of a typical customer for a Digital fit- Relationship Status (Single or Partnered)
Annual Household income
ness band that is offered by Titanic Corp.
Average number of times customer tracks activity each
The market research team decides to inves- week
tigate whether there are differences across Number of miles customer expect to walk each week
the usage patterns and product lines with Self-rated fitness on a scale 1 – 5 where 1 is poor shape
and 5 is excellent.
respect to customer characteristics”
Models of the product purchased - IQ75, MZ65, DX87

https://medium.com/@as h is hp ah wa7/ firs t -case -stud y- in- d escriptive- an a lytics- a744140c39a4


I N T R OD U CT ION TO D AT A S C I E N C E 49 / 60
D E S C R I P T I V E A N A LY T I C S – E X A M P L E # 1

I N T R OD U CT ION TO D AT A S C I E N C E 50 / 60
D IA G N O S T I C A N A L Y T I C S – E X A M P L E # 1

Problem Statement :
“During the 1980s General Electric was selling different products to its customers such as
light bulbs, jet engines, windmills, and other related products. Also, they separately sell
parts and services this means they would sell you a certain product you would use it until it
needs repair either because of normal wear and tear or because it’s broken. And you would
come back to GE and then GE would sell you parts and services to fix it. Model for GE was
focusing on how much GE was selling, in sales of operational equipment, and in sales of
parts and services. And what does GE need to do to drive up those sales?”

https://medium.com/parrotai/
u n d ers ta n d - d a ta - a n a lytics - fra mework -w ith -a - ca s e - st u d y-in - th e -b u s in es s - world - 15b fb 421028d
I N T R OD U CT ION TO D AT A S C I E N C E 51 / 60
D IA G N O S T I C A N A L Y T I C S – E X A M P L E # 1

https://www.sganalytics.com/blog/change -management-analytics-adoption/
I N T R OD U CT ION TO D AT A S C I E N C E 52 / 60
P RE D IC T IV E A NALY T I C S – E X AM PL E # 1

Google launched Google Flu Trends (GFT), to collect predictive analytics regarding
the outbreaks of flu. It’s a great example of seeing big data analytics in action.
So, did Google manage to predict influenza activity in real-time by aggregating search
engine queries with this big data and adopting predictive analytics?
Even with a wealth of big data analytics on search queries, GFT overestimated the
prevalence of flu by over 50% in 2012-2013 and 2011-2012.
They matched the search engine terms conducted by people i n
d i f fe r e n t regions of the world. And, when these queries were
compared with t r a d i t i o n a l f l u s u r ve i l l a n c e systems, Google found
that the p re d ict ive a n a l y t i c s of the f l u season pointed towards a
co r re lat io n with higher search engine t r a f f i c f o r ce rtain phrases.

I N T R OD U CT ION TO D AT A S C I E N C E 53 / 60
P RE D IC T IV E A NALY T I C S – E X AM PL E # 1

https://www.slideshare.net/VasileiosLampos/
u s e r g e n e ra t ed - co n t en t - c o l l e ct i ve - a n d - p er s o n a l i s e d - i n f e re n c e - ta s ks
I N T R OD U CT ION TO D AT A S C I E N C E 54 / 60
P RE D IC T IV E A NALY T I C S – E X AM PL E # 2
Colleen Jones applied predictive analytics to FootSmart (a niche online catalog
retailer) on a content marketing product. It was called the FootSmart Health
Resource Center (FHRC) and it consisted of articles, diagrams, quizzes and the like.
On analyzing the data around increased search engine visibility, FHRC was found
to help FootSmart reach more of the right kind of target customers.
They were receiving more traffic, primarily consisting of people that cared about foot
health conditions and their treatments.
FootSmart decided to push more content at FHRC and also improve its
merchandising of the product.
The r e s u l t of such informed data-driven decision making?
A 36% increase i n weekly s a l e s .

https://www.footsmart.com/pages/health -resource-center
I N T R OD U CT ION TO D AT A S C I E N C E 55 / 60
P RE D IC T IV E A NALY T I C S – E X AM PL E # 2

Predictive Policing (Self study)


https://www.brennancenter.org/our-work/research-reports/
p r e d i c t i ve - p o l i c i n g - ex p l a i n e d
https://www.youtube.com/watch?v=YxvyeaL7NEM

I N T R OD U CT ION TO D AT A S C I E N C E 56 / 60
P RE S C RIPT IV E A NALY T I C S – E X AM PL E # 1
A health insurance company analyses its data and determines that many of its diabetic
patients also suffer from retinopathy.

With this information, the provider can now use predictive analytics to get an idea of how
many more ophthalmology claims it might receive during the next year.

Then, using prescriptive analytics, the company can look at scenarios where the
reimbursement costs for ophthalmology increases, decreases, or holds steady. These
scenarios then allow them to make an informed decision about how to proceed in a way that’s
both cost-effective and beneficial to their customers.

Analysing data on patients, treatments, appointments, surgeries, and even radiologic


techniques can ensure hospitals are properly staffed, the doctors are devising tests and
treatments based on probability rather than gut instinct, and the facility can save costs on
everything from medical supplies to transport fees to food budgets.

I N T R OD U CT ION TO D AT A S C I E N C E 57 / 60
P RE S C RIPT IV E A NALY T I C S – E X AM PL E # 2

Whenever you go to Amazon, the site recommends dozens and dozens of products to
you. These are based not only on your previous shopping history (reactive), but also
based on what you’ve searched for online, what other people who’ve shopped for the
same things have purchased, and about a million other factors (proactive).
Amazon and other large retailers are taking deductive, diagnostic, and predictive data
and then running it through a prescriptive analytics system to find products that you
have a higher chance of buying.
Every bit of data is broken down and examined with the end goal of helping the
company suggest products you may not have even known you wanted.

h ttp s : / / a ccen t -tech n ologies . com/ 2020/ 06/18/ ex a mp les -of- p res crip tive - a n a lytics /
I N T R OD U CT ION TO D AT A S C I E N C E 58 / 60
H E A LT H C A R E A NALY T I C S – C ASE S TUDY

Self study
https://integratedmp.com/
4 - key- h e a lt h ca re - analyt ics - so u rce s - i s- yo ur -pract ice - usin g -the m/
https://www.youtube.com/watch?v=olpuyn6kemg

I N T R OD U CT ION TO D AT A S C I E N C E 59 / 60
R EFERENCES
Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti
https://www.kdnuggets.com/2014/10/
cris p - d m - top - m eth od olog y - a n a l ytic s - d a t a - m in in g - d a t a - s c ie n ce - p roj ect h tm l
https://www.datasciencecentral.com/profiles/ blogs/
crisp-dm-a-standard-methodology-to-ensure-a-good-outcome
https://docu mentation.sas.com/?docsetId=emref&docsetTarget=n061bzurmej4j3n1jnj8bbjjm 1a2.htm&
docsetVersion=14.3&locale=en
http://jesshampton.com/2011/02/16/semma -and-crisp-dm-data-mining-methodologies/
https://www.kdnuggets.com/2015/08/new -standard-methodology-analytical-models.html
https://medium.com/illumination -curated/big-data-lifecycle-management-629dfe16b78d
https://www.esadeknowledge.com/view/
7 - ch a llen g es - a n d - op p ortu n ities - in - d a ta - b a sed - d ecis ion - ma kin g -193560

T HANK YOU
I N T R OD U CT ION TO D AT A S C I E N C E 60 / 60
INTRODUCTION TO DATA SCIENCE
MODULE # 3 : DATA
IDS Course Team
BITS Pilani
The instructor is gratefully acknowledging
the authors who made their course
materials freely available online.

INTRODUCTION TO DATA SCIENCE 2 / 79


TABLE OF CONTENTS

1 DATA
2 DATA-SETS
3 DATA QUALITY

4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 3 / 79


DATA

Data is a collection of data objects and their


attributes.
The type of data determines which tools
and techniques can be used to analyze the
data.

INTRODUCTION TO DATA SCIENCE 4 / 79


DATA
Data is a collection of data objects and their
attributes.
An attribute is a property or characteristic of
an object.
Examples: eye color of a person,
temperature
Attribute is also known as variable, field,
characteristic, or feature.
A collection of attributes describe an object.
Object is also known as record, point, case,
sample, entity, or instance.

INTRODUCTION TO DATA SCIENCE 5 / 79


ATTRIBUTE / FEATURE

An attribute is a property or characteristic of


an object.
) eye color of a person, temperature
Attribute is also known as variable, field,
characteristic, or feature.
The values used to represent an attribute
may have properties that are not properties
of the attribute itself.
) Average age of an employee may have a
meaning , whereas it makes no sense to
talk about the average employee ID.

INTRODUCTION TO DATA SCIENCE 6 / 79


ATTRIBUTE / FEATURE

The type of an attribute should tell us what


properties of the attribute are reflected in
the values used to measure it.
) For the age attribute, the properties of the
integers used to represent age are very
much the properties of the attribute. Even
so, ages have a maximum while integers
do not.
) The ID attribute is distinct. The only valid

operation for employee IDs is to test


whether they are equal.

INTRODUCTION TO DATA SCIENCE 7 / 79


PROPERTIES OF ATTRIBUTES

Specify the type of an attribute by identifying the properties that


correspond to underlying properties of the attribute.
Properties include
) Distinctiveness =, /=
) Order <, >, ≥, ≤
) Addition +, −
) Multiplication ∗, /
Based on these properties, we define four types of attributes: nominal, ordinal,
interval, and ratio.
Each attribute type possesses all of the properties and operations of the attribute
types above it.

INTRODUCTION TO DATA SCIENCE 8 / 79


TYPES OF ATTRIBUTES
Ratio Length, time, counts

Numerical
Calendar dates,
Interval
temperature

Data

Ordinal Grades, shirt size

Categorical
ID numbers, eye
Nominal
color, zip codes
INTRODUCTION TO DATA SCIENCE 9 / 79
TYPES OF ATTRIBUTES

INTRODUCTION TO DATA SCIENCE 10 / 79


TYPES OF ATTRIBUTES

INTRODUCTION TO DATA SCIENCE 11 / 79


ATTRIBUTES AND TRANSFORMATIONS

Introduction to Data Mining by


Tan
INTRODUCTION TO DATA SCIENCE 12 / 79
ATTRIBUTES BY THE NUMBER OF VALUES
Discrete Attribute
) only a finite or countable infinite set of values.
) zip codes, counts, or the set of words in a collection of documents
) Often represented as integer variables.

) Note: binary attributes are a special case of discrete attributes

Continuous Attribute
) Real numbers as attribute values.
) temperature, height, or weight
) Continuous attributes are typically represented as floating-point variables.

Asymmetric Attribute
) only presence a non-zero attribute value-is considered.
) For a specific student, an attribute has a value of 1 if the student took the course
associated with that attribute and a value of 0 otherwise
) Asymmetric binary attributes.

INTRODUCTION TO DATA SCIENCE 13 / 79


TYPES OF ATTRIBUTES EXAMPLE

Identify the types of attributes in the given data.

ID Age Gender Course ID Percentage Grade


19001 24 Female CS 104 74 Good
19002 23 Male CS 102 75 Good
19003 25 Female CS 103 67 Fair
19004 24 Female CS 104 79 Good
19005 23 Male CS 102 75 Good
19006 24 Female CS 103 87 Excellent
19007 26 Male CS 105 70 Good

INTRODUCTION TO DATA SCIENCE 14 / 79


TYPES OF ATTRIBUTES EXAMPLE

Identify the types of attributes in the given data.

ID Age Gender Course ID Percentage Grade


19001 24 Female CS 104 74 Good
19002 23 Male CS 102 75 Good
19003 25 Female CS 103 67 Fair
19004 24 Female CS 104 79 Good
19005 23 Male CS 102 75 Good
19006 24 Female CS 103 87 Excellent
19007 26 Male CS 105 70 Good
Nominal Ratio Nominal Nominal Ratio Ordinal

INTRODUCTION TO DATA SCIENCE 15 / 79


TYPES OF ATTRIBUTES EXAMPLE

Identify whether the attribute is discrete and continuous in the given data.

ID Age Gender Course ID Percentage Grade


19001 24 Female CS 104 74 Good
19002 23 Male CS 102 75 Good
19003 25 Female CS 103 67 Fair
19004 24 Female CS 104 79 Good
19005 23 Male CS 102 75 Good
19006 24 Female CS 103 87 Excellent
19007 26 Male CS 105 70 Good

INTRODUCTION TO DATA SCIENCE 16 / 79


TYPES OF ATTRIBUTES EXAMPLE

Identify whether the attribute is discrete and continuous in the given data.

ID Age Gender Course ID Percentage Grade


19001 24 Female CS 104 74 Good
19002 23 Male CS 102 75 Good
19003 25 Female CS 103 67 Fair
19004 24 Female CS 104 79 Good
19005 23 Male CS 102 75 Good
19006 24 Female CS 103 87 Excellent
19007 26 Male CS 105 70 Good
Discrete Continuous Discrete Discrete Continuous Discrete

INTRODUCTION TO DATA SCIENCE 17 / 79


TABLE OF CONTENTS

1 DATA
2 DATA-SETS
3 DATA QUALITY

4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 18 / 79


TYPES OF DATA-SETS
1 Structured data
) Data containing a defined data type, format and structure.
) Example: transaction data, online analytical processing , OLAP data cubes, traditional
RDBMS, CSV file and spreadsheets.
2 Semi structured data
) Textual data file with discernible pattern that enables parsing
) Example: XML data file, HTML of a web page
3 Quasi structured data
) Textual data with erratic data format that can be formatted with effort, tools and time
) Example: Web click-stream data
4 Unstructured data
) Data that has no inherent structure.
) Example: text document, PDF, images and video, email
INTRODUCTION TO DATA SCIENCE 19 / 79
STRUCTURED DATA
RDBMS Data

INTRODUCTION TO DATA SCIENCE 20 / 79


SEMI-STRUCTURED DATA
JSON Data

INTRODUCTION TO DATA SCIENCE 21 / 79


QUASI-STRUCTURED DATA

Web Click-Stream

INTRODUCTION TO DATA SCIENCE 22 / 79


TYPES OF DATA-SETS

INTRODUCTION TO DATA SCIENCE 23 / 79


DATA-SETS

1 Public data
) Data that has been collected and preprocessed for academic or research purposes and
made public.
) https://archive.ics.uci.edu/
2 Private data
) Data that is specific to an organization.
) Privacy rules like IT Act 2000 and GDPR applies.

INTRODUCTION TO DATA SCIENCE 24 / 79


RECORD DATA
Record data – flat file (CSV), RDBMS
Transaction data – set of items – banking, retail, e-commerce
Data Matrix – record data with only numeric attributes. – SPSS data matrix
Sparse Data Matrix – binary asymmetric data. 0/1 entries.
Document term matrix – Frequency of terms that appears in documents

INTRODUCTION TO DATA SCIENCE 25 / 79


ORDERED DATA EXAMPLE
Sequential data or temporal data – Record data + time. Eg: Money transfer
transaction in Banking
Sequence data – Positions instead of time stamp. Eg:DNA sequence bases (G, T, A,
C)
Time series data – temporal autocorrelation

INTRODUCTION TO DATA SCIENCE 26 / 79


TEXT DATA

Text is considered as 1-D data


Eg: Email body, PDF document, word document

INTRODUCTION TO DATA SCIENCE 27 / 79


AUDIO DATA

Audio is considered as 1-D time series data


Eg: Speech, Music

INTRODUCTION TO DATA SCIENCE 28 / 79


IMAGE DATA

Images are considered as 2-D data in Euclidean space


Digital Images are stored in a matrix or grid form where the intensity or colour
information is stored in the (x,y) position.
Black and white image – intensity is represented as 0 and 1 respectively
Greyscale image – intensity is represented as an integer between 0 and 255. 0
represents black, grey is 125 and 255 is white.
Colour image – contains 3 bands or channels – Red, Green and Blue – each colour is
represented as an integer between 0 and 255.

INTRODUCTION TO DATA SCIENCE 29 / 79


DIGITAL GRAYSCALE IMAGE

Pixel intensities = I(x, y )

https://mozanunal.com/2019/11/img2sh/
INTRODUCTION TO DATA SCIENCE 30 / 79
DIGITAL COLOUR IMAGE

https://www.analyticsvidhya.com/blog/2021/03/grayscale-and-rgb-format-for-storing-images/
INTRODUCTION TO DATA SCIENCE 31 / 79
DIGITAL COLOUR IMAGE

https://www.mathworks.com/help/matlab/creating_plots/image-types.html
INTRODUCTION TO DATA SCIENCE 32 / 79
GRAPH DATA EXAMPLE

Data with relationships among objects – Web pages


Data with objects as graphs – chemical compound

https://lod-
cloud.net/
INTRODUCTION TO DATA SCIENCE 33 / 79
TABLE OF CONTENTS

1 DATA
2 DATA-SETS
3 DATA QUALITY

4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 34 / 79


DATA QUALITY INDEX

https://www.deltapartnersgroup.com/
INTRODUCTION TO DATA SCIENCE 35 / 79
DATA QUALITY ISSUES

Missing data
) Data that is not filled / available intentionally or otherwise.
) Attributes of interest may not always be available, such as customer information for sales
transaction data.
) Some data were not considered important at the time of entry.

) Relevant data may not be recorded due to a misunderstanding or because of equipment

malfunctions.
Duplicate data
Orphaned data
Text encoding errors
Data that is biased

INTRODUCTION TO DATA SCIENCE 36 / 79


DATA QUALITY ISSUES

Noise and outliers


) Noise is a random error or variance in a measured data object.
) Data objects with behaviors that are very different from expectation are called outliers or
anomalies.
Inaccurate data
) Inaccurate data – data having incorrect attribute values
) Caused by, faulty data collection instruments, human or computer errors occurring at data
entry, users may purposely submit incorrect data values, errors in data transmission
Inconsistent data
) inconsistencies in naming conventions or data codes, or inconsistent formats for input
fields (e.g., date).

INTRODUCTION TO DATA SCIENCE 37 / 79


EXAMPLE: DATA QUALITY ISSUES

Find the issues in the given data.

Name Age Date of Birth Course ID Percentage


Amy 24 01-Jan-1995 CS 104 74
Ben 23 Dec-01-1996 CS 102 75
Cathy 25 01-Nov-1994 67
Diana 24 Oct-01-1995 CS 104 79
Ben 23 Dec-01-1996 CS 102 75
Eden 24 CS 103 175
Fischer 01-01-1959 CS 105 70

INTRODUCTION TO DATA SCIENCE 38 / 79


EXAMPLE: DATA QUALITY ISSUES

Missing data – age, date of birth, course ID


Inconsistent data – date of birth
Duplicate data – Ben is duplicated
Data Conformity – Percentage = 175

INTRODUCTION TO DATA SCIENCE 39 / 79


TABLE OF CONTENTS

1 DATA
2 DATA-SETS
3 DATA QUALITY

4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 40 / 79


FORMAL DATA MODELS

Model is something we construct to help us understand the real world.


One key goal of formal modelling is to develop a precise specification of your question
and how your data can be used to answer that question.
Formal models allow you to identify clearly what you are trying to infer from data and
what form the relationships between features of the population take.

INTRODUCTION TO DATA SCIENCE 41 / 79


GENERAL FRAMEWORK FOR MODELLING

Apply the basic epicycle of analysis to the formal modelling portion of data analysis.
1 Setting expectations.
2 Develop a primary model that represents your best sense of what provides the answer

to your question. This model is chosen based on whatever information you have
currently available.
2 Collecting Information.
2 Create a set of secondary models that challenge the primary model in some way.

3 Revising expectations.
2 If our secondary models are successful in challenging our primary model and put the
primary model’s conclusions in some doubt, then we may need to adjust or modify
the primary model to better reflect what we have learned from the secondary
models.

INTRODUCTION TO DATA SCIENCE 42 / 79


STATISTICAL MODEL

A statistical model serves two key purposes in a data analysis,


) quantitative summary of data.
) impose a specific structure on the population from which the data were sampled.
A statistic is any summary of the data.
The sample mean, median, the standard deviation, the maximum, the minimum, and
the range are statistics.

INTRODUCTION TO DATA SCIENCE 43 / 79


MODELS AS EXPECTATIONS

A statistical model must impose some structure on the data.


A statistical model provides a description of how the world works and how the
data were generated.
The model is essentially an expectation of the relationships between various factors in
the real world and in your dataset.
Mimics the Population behavior, realized through Sample of data.
A statistical model allows for some randomness in generating the data.

INTRODUCTION TO DATA SCIENCE 44 / 79


DATA MODEL - CASE STUDY

Conduct a survey of 20 people to ask them how much they’d be willing to spend on a
product you’re developing.
The survey response

25, 20, 15, 5, 30, 7, 5, 10, 12, 40, 30, 30, 10, 25, 10, 20, 10, 10, 25, 5

What do the data say?


Note: The example is hypothetical, generally we select higher sample size for
modelling.

INTRODUCTION TO DATA SCIENCE 45 / 79


STEP 1: SETTING EXPECTATIONS

The sample data represents the overall


population likely to purchase the product.
Mean - $17.2 and Standard Deviation - $10.39
Expectation under the Normally distributed data
(Normal model) is that the distribution of prices
that people are willing to pay.
According to the model, about 68% of the
population would be willing to pay somewhere
between $6.81 and $27.59 for this new product.

INTRODUCTION TO DATA SCIENCE 46 / 79


STEP 2: COMPARING MODEL EXPECTATIONS WITH REALITY

Given the parameters, our expectation under the


Normal model is that the distribution of prices
that people are willing to pay looks like a
bell-shaped curve.
E.g. Normal curve on top of the histogram of the
20 data points of the amount people say they are
willing to pay. The histogram has a large spike
around 10.
Normal distribution allows for negative values on
the left-hand side of the plot, but there are no
data points in that region of the plot.

INTRODUCTION TO DATA SCIENCE 47 / 79


STEP 3: REFINING OUR EXPECTATIONS

When the model and the data don’t match very


well.
) Get a different model.
) Get different data.
) Do both.

E.g. Choose a different statistical model to


represent the population, the Gamma
distribution, which has the feature that it only
allows positive.

INTRODUCTION TO DATA SCIENCE 48 / 79


STEP 3: REFINING OUR EXPECTATIONS

Normal vs Gamma Distribution – which model to


choose?
Qn 1: What percentage of the population is
willing to pay atleast $30 for this product?
) Normal distribution – 11% would pay $30 or
more
) Gamma distribution – 7% would pay $30 or more

Based on which model suits your problem at


hand, choose the appropriate model.

INTRODUCTION TO DATA SCIENCE 49 / 79


DEVELOPING A BENCHMARK MODEL

The goal is to develop a benchmark model that serves us as a baseline, upon we’ll
measure the performance of a better and more attuned algorithm.
Benchmarking requires experiments to be comparable, measurable, and reproducible.

INTRODUCTION TO DATA SCIENCE 50 / 79


TABLE OF CONTENTS

1 DATA
2 DATA-SETS
3 DATA QUALITY

4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 51 / 79


CLASS OR CONCEPT DESCRIPTIONS

Class or Concept Descriptions describe individual classes and concepts in


summarized, concise, and yet precise terms.
Concept descriptions can be derived using
1 data characterization, by summarizing the data of the class under study
2 data discrimination, by comparison of the target class with one or a set of comparative
classes
3 both data characterization and discrimination.

INTRODUCTION TO DATA SCIENCE 52 / 79


DATA CHARACTERIZATION

Data characterization is a summarization of the general characteristics or


features of a target class of data.
Methods for data characterization
) data summaries based on statistical measures and plots
) data cube-based OLAP roll-up operation
) attribute-oriented induction technique

Output of data characterization


) bar charts, curves, multidimensional data cubes, and multidimensional tables.
) generalized relations or in rule form called characteristic rules.

INTRODUCTION TO DATA SCIENCE 53 / 79


DATA DISCRIMINATION

Data discrimination is a comparison of the general features of the target class


data objects against the general features of objects from one or multiple
contrasting classes.
The target and contrasting classes can be specified by a user, and the corresponding
data objects can be retrieved through database queries.
The methods used for data discrimination are similar to those used for data
characterization.
Output presentation
) Discrimination descriptions expressed in the form of rules are referred to as
discriminant rules.

INTRODUCTION TO DATA SCIENCE 54 / 79


ASSOCIATION ANALYSES

Frequent patterns are patterns that occur frequently in data.


Many kinds of frequent patterns
) A frequent itemset refers to a set of items that often appear together in a transactional
data set. E.g: milk and bread
) A frequently occurring subsequence is a (frequent) sequential pattern. Eg:

customers tend to purchase first a laptop, followed by a digital camera, and then a
memory card.
) A substructure can refer to different structural forms (e.g., graphs, trees, or lattices) that

may be combined with itemsets or subsequences. If a substructure occurs frequently,


it is called a (frequent) structured pattern.
) Mining frequent patterns leads to the discovery of interesting associations and

correlations within data.

INTRODUCTION TO DATA SCIENCE 55 / 79


PREDICTION ANALYSES

The term prediction refers to both numeric prediction and class label
prediction.
Classification and regression may need to be preceded by relevance analysis, which
attempts to identify attributes that are significantly relevant to the classification and
regression process.

INTRODUCTION TO DATA SCIENCE 56 / 79


CLASSIFICATION ANALYSES

Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts.
The model are derived based on the analysis of a set of training data (i.e., data
objects for which the class labels are known).
The model is used to predict the class label of objects for which the the class label is
unknown.
The derived model may be represented in as classification rules (i.e., IF-THEN rules),
decision trees, mathematical formulae, or neural networks, naive Bayesian
classification, support vector machines, and k-nearest-neighbor classification.
Classification predicts categorical (discrete, unordered) labels.

INTRODUCTION TO DATA SCIENCE 57 / 79


REGRESSION ANALYSES

Regression models continuous-valued functions.


Regression is used to predict missing or unavailable numerical data values rather
than (discrete) class labels.
Regression analysis is a statistical methodology that is most often used for numeric
prediction.
Regression also encompasses the identification of distribution trends based on the
available data.

INTRODUCTION TO DATA SCIENCE 58 / 79


CLUSTER ANALYSIS

Clustering analyzes data objects without consulting class labels.


Clustering can be used to generate class labels for a group of data. The objects are
clustered or grouped based on the principle of maximizing the intraclass similarity and
minimizing the interclass similarity.
clusters of objects are formed so that objects within a cluster have high similarity in
comparison to one another, but are rather dissimilar to objects in other clusters.
Each cluster so formed can be viewed as a class of objects, from which rules can be
derived.
Clustering can also facilitate taxonomy formation, that is, the organization of
observations into a hierarchy of classes that group similar events together.

INTRODUCTION TO DATA SCIENCE 59 / 79


TABLE OF CONTENTS

1 DATA
2 DATA-SETS
3 DATA QUALITY

4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 60 / 79


DATA PIPELINE STAGES

Data pipelines are sets of processes that move and transform data from various
sources to a destination where new value can be derived.
In their simplest form, pipelines may extract only data from one source such as a
REST API and load to a destination such as a SQL table in a data warehouse.
In practice, data pipelines consist of multiple steps including data extraction, data
preprocessing, data validation, and at times training or running a machine learning
model before delivering data to its final destination.
Data engineers specialize in building and maintaining the data pipelines.

INTRODUCTION TO DATA SCIENCE 61 / 79


WHY BUILD DATA PIPELINES?

For every dashboard and insight that a data analyst generates and for each predictive
model developed by a data scientist, there are data pipelines working behind the
scenes.
A single dashboard, or a single metric may be derived from data originating in
multiple source systems.
Data pipelines extract data from sources and load them into simple database tables
or flat files for analysts to use. Raw data is refined along the way to clean, structure,
normalize, combine, aggregate, and anonymize or secure it.

INTRODUCTION TO DATA SCIENCE 62 / 79


DIVERSITY OF DATA SOURCES

INTRODUCTION TO DATA SCIENCE 63 / 79


DATA INGESTION
The term data ingestion refers to extracting data from one source and loading it into
another.
Ingestion Interface and data structure
) A database behind an application, such as a Postgres or MySQL database or NoSQL
database
) JSON from a REST API
) A stream processing platform such as Apache Kafka

) A shared network file system or cloud storage bucket containing logs, comma-separated

value (CSV) files, and other flat files


) Semi-structured log data

) A data warehouse or data lake

) Data in HDFS or HBase database

Data ingestion is traditionally both the extract and load steps of an ETL or ELT
process.
INTRODUCTION TO DATA SCIENCE 64 / 79
SIMPLE PIPELINE

INTRODUCTION TO DATA SCIENCE 65 / 79


ETL AND ELT

E– extract step
) gathers data from various sources in preparation for loading and transforming.
L – load step
) brings either the raw data (in the case of ELT) or the fully transformed data (in the case of
ETL) into the final destination.
) load data into the data warehouse, data lake, or other destination.

T – transform step
) raw data from each source system is combined and formatted in a such a way that it’s
useful to analysts, visualization tools

INTRODUCTION TO DATA SCIENCE 66 / 79


ELT PIPELINE

INTRODUCTION TO DATA SCIENCE 67 / 79


ORCHESTRATING PIPELINES

Orchestration ensures that the steps in a pipeline are run in the correct order and that
dependencies between steps are managed properly.
Pipeline steps (tasks) are always directed, meaning they start with a task or multiple
tasks and end with a specific task or tasks. This is required to guarantee a path of
execution.
Pipeline graphs must also be acyclic, meaning that a task cannot point back to a
previously completed task.
Pipelines are implemented as DAGs (Directed Acyclic Graphs).
Orchestration tool – Apache Airflow

INTRODUCTION TO DATA SCIENCE 68 / 79


ORCHESTRATION DAG

INTRODUCTION TO DATA SCIENCE 69 / 79


VARIOUS DATA SOURCE

INTRODUCTION TO DATA SCIENCE 70 / 79


VARIOUS DATA SOURCE

INTRODUCTION TO DATA SCIENCE 71 / 79


VARIOUS DATA SOURCE

INTRODUCTION TO DATA SCIENCE 72 / 79


TABLE OF CONTENTS

1 DATA
2 DATA-SETS
3 DATA QUALITY

4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 73 / 79


DATABASE DATA

Database management system (DBMS), consists of a collection of interrelated


data, known as a database, and a set of software programs to manage and access
the data.
The software programs provide mechanisms for defining database structures and
data storage; for specifying and managing concurrent, shared, or distributed data
access; and for ensuring consistency and security of the information stored despite
system crashes or attempts at unauthorized access.

INTRODUCTION TO DATA SCIENCE 74 / 79


RDBMS DATA

A relational database (RDBMS) is a collection of tables, each of which is assigned a


unique name.
Each table consists of a set of attributes (columns or fields) and usually stores a large
set of tuples (records or rows).
Each tuple in a relational table represents an object identified by a unique key and
described by a set of attribute values.

INTRODUCTION TO DATA SCIENCE 75 / 79


DATA WAREHOUSE

A data warehouse is a repository of information collected from multiple


sources, stored under a unified schema, and usually residing at a single site.
Data warehouses are constructed via a process of data cleaning, data integration,
data transformation, data loading, and periodic data refreshing.
Data in a data warehouse is structured and optimized for reporting and analysis
queries.

INTRODUCTION TO DATA SCIENCE 76 / 79


TRANSACTIONAL DATA

Each record in a transactional database captures a transaction, such as a customer’s


purchase, a flight booking, or a user’s clicks on a web page.
A transaction includes a unique transaction identity number (trans ID) and a list of the
items making up the transaction.
A transactional database may have additional tables, which contain other information
related to the transactions, such as item description, information about the
salesperson or the branch, and so on.

INTRODUCTION TO DATA SCIENCE 77 / 79


DATA LAKES

A data lake is where data is stored, but without the structure or query
optimization of a data warehouse.
It will contain a high volume of data as well as a variety of data types.
It is not optimized for querying such data in the interest of reporting and analysis.
Eg: a single data lake might contain a collection of blog posts stored as text files, flat
file extracts from a relational database, and JSON objects containing events
generated by sensors in an industrial system.

INTRODUCTION TO DATA SCIENCE 78 / 79


Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar (T1)
The Art of Data Science by Roger D Peng and Elizabeth Matsui (R1)
Data Mining: Concepts and Techniques, Third Edition by Jiawei Han and Micheline
Kamber Morgan Kaufmann Publishers, 2006 (T4)
On Being a Data Skeptic Publisher(s): O’Reilly Media, Inc. ISBN: 9781449374310

THANK YOU

INTRODUCTION TO DATA SCIENCE 79 / 79


I NTRODUCTION TO DATA S CIENCE
M ODULE # 4
Febin.A.Vahab
IDS Course Team
BITS Pilani
The instructor is gratefully acknowledging
the authors who made their course
materials freely available online.

I N T R OD U CT ION TO D AT A S C I E N C E 2 / 67
T ABLE OF C ONTENTS
1 S TAT I S T I C A L D E S C R I P T I O N S O F D ATA S E L F S T U D Y
2 D ATA P R E PA R AT I O N
3 D ATA A G G R E G AT I O N , S A M P L I N G
4 D ATA S I M I L A R I T Y & D I S S I M I L A R I T Y M E A S U R E
5

8
9

I N T R OD U CT ION TO D AT A S C I E N C E 3 / 67
Statistical Descriptions of Data

• Measuring the Central Tendency


• Measuring the Dispersion of Data
• Boxplot Analysis

BITS Pilani, Pilani Campus


MEASURES OF C ENTRAL T ENDENCY

Gives an idea of the central tendency of the data.


Measures of central tendency include the mean, median, mode, and midrange.
Let x1, x2, ..., xN be the set of N observed values or observations for numeric
attribute X . Assume X is sorted in increasing order.

I N T R OD U CT ION TO D AT A S C I E N C E 5 / 67
MEAN
Common and effective numeric measure of the ”center” of a set of data is the
(arithmetic) mean.
x1 + x2 + ... + xN
N
= Σ i=1 i
x
x̄ =
N N
Weighted average
) Sometimes, each value xi in a set may be associated with a weight wi .
) The weights reflect the significance, importance, or occurrence frequency attached to
their respective values.
w1 x1 + w2 x2 + ... + wN xN
= Σ i=1 i i
N
wx
x̄ =
N N
Issue: Mean is sensitive to extreme (e.g., outlier) values.
Issue: For skewed (asymmetric) data, a better measure of the center of data is the
median.
I N T R OD U CT ION TO D AT A S C I E N C E 6 / 67
MEDIAN

If N is odd, then the median is the middle value of the ordered set.
If N is even, then the median is not unique; it is the two middlemost values and any
value in between.
If X is a numeric attribute, the median is taken as the average of the two middlemost
values.

I N T R OD U CT ION TO D AT A S C I E N C E 7 / 67
MODE

Mode for a set of data is the value that occurs most frequently in the set.
Mode can be determined for qualitative and quantitative attributes.
Data sets with one, two, or three modes are respectively called unimodal, bimodal,
and trimodal. In general, a data set with two or more modes is multimodal.

I N T R OD U CT ION TO D AT A S C I E N C E 8 / 67
S Y M M E T R I C D AT A AND S K E W E D D AT A
In a unimodal frequency curve with perfect symmetric data distribution, the
mean, median, and mode are all at the same center value.

mean − mode ≈ 3(mean − median)


Data in most real applications are not symmetric.
) Positively skewed – the mode occurs at a value that is smaller than the median.
) negatively skewed – the mode occurs at a value greater than the median.

I N T R OD U CT ION TO D AT A S C I E N C E 9 / 67
MIDRANGE

Average of minimum and maximum values.


min + max
midrange =
2

I N T R OD U CT ION TO D AT A S C I E N C E 10 / 67
E XAMPLE

X = [30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110]
30 + 36 + 47 + 50 + 52 + 52 + 56 + 60 + 63 + 70 + 70 + 110
mean = = 58
12
52 + 56
median = = 54
2
mode = 52, 70
30 + 110
midrange = = 70
2

I N T R OD U CT ION TO D AT A S C I E N C E 11 / 67
D AT A D I S P E R S I O N M E A S U R E S

Range
Quartiles, and interquartile range
Five-number summary and boxplots
Variance and standard deviation

I N T R OD U CT ION TO D AT A S C I E N C E 12 / 67
R ANGE

The range of the set is the difference between the largest and smallest values.

range = max − min

I N T R OD U CT ION TO D AT A S C I E N C E 13 / 67
QUANTILES

Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal-size consecutive sets.
The k th q-quantile for a given data distribution is the value x such that at most k / q of
the data values are less than x and at most (q − k )/q of the data values are more
than x, where k is an integer such that 0 < k < q.
There are q − 1 q-quantiles.

I N T R OD U CT ION TO D AT A S C I E N C E 14 / 67
Q U A RT I L E S O R P ERCENTILES

Three data points that split the data distribution into four equal parts
Each part represents one-fourth of the data distribution.
Q1 is the 25th percentile and Q3 is the 75th percentile
Quartiles give an indication of a distribution’s center, spread, and shape

I N T R OD U CT ION TO D AT A S C I E N C E 15 / 67
I N T E R q U ART I L E R A N G E ( I Q R )

Distance between the first and third quartiles


Measure of spread that gives the range covered by the middle half of the data.

IQR = Q3 − Q1

Identifying outliers as values falling at least 1.5 × IQR above the third quartile or
below the first quartile.

I N T R OD U CT ION TO D AT A S C I E N C E 16 / 67
F I V E - N U M B E R S U M M A RY

The five-number summary of a distribution consists of the median (Q2), the quartiles
Q1 and Q3 , and the smallest and largest individual observations.
Written in the order

Five − number Summary = [Minimum, Q1, Median, Q3, Maximum]

I N T R OD U CT ION TO D AT A S C I E N C E 17 / 67
Exercise

Find the outlier in the following data using Inter-Quartile Range.


Data = 10,2, 11, 15,11,14,13,17,12,22,14,11.

1. Sort :10, 11, 11, 11, 12, 12, 13, 14, 14, 15, 17, 22
2. Median: (12+13)/2=12.5=Q2
3. Q1=11(25th percentile)
4. Q3=14.5(75th percentile)
5. IQR=Q3-Q1=3.5
6. Min=Q1-1.5IQR=5.75
7. Max=Q3+1.5IQR=19.75

Outlier=22

BITS Pilani, Pilani Campus


M EASURING THE D ISPERSION OF D ATA
B O X P L O T A N A LY S I S
• Five-number summary of a
distribution
– Minimum, Q1, Median, Q3, Maximum
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and
third quartiles, i.e., the height of the box
is IQR
– The median is marked by a line within
the box
– Whiskers: two lines outside the box
extended to Minimum and Maximum
– Outliers: points beyond a specified
outlier threshold, plotted individually

BITS Pilani, Pilani Campus


VARIANCE

Variance and standard deviation indicate how spread out a data distribution is.

I N T R OD U CT ION TO D AT A S C I E N C E 20 / 67
S T A N D A R D D E V I AT I O N

Standard deviation σ of the observations is the square root of the variance σ2.
A low standard deviation means that the data observations tend to be very close to
the mean.
A high standard deviation indicates that the data are spread out over a large range of
values.
σ measures spread about the mean and should be considered only when the mean is
chosen as the measure of center.
σ = 0 only when there is no spread, that is, when all observations have the same
value. Otherwise, σ > 0.

I N T R OD U CT ION TO D AT A S C I E N C E 21 / 67
E XAMPLE

X = [30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110]
Q1 = 47 3rd value
Q2 = 52 6th value
Q3 = 63 9th value
IQR = 63 − 47 = 16
1
σ2 = (302 + 362 + 472 ... + 1102 ) − 582 ≈ 379.17
12
σ = √379.17 ≈ 19.47

I N T R OD U CT ION TO D AT A S C I E N C E 22 / 67
T ABLE OF C ONTENTS
1 S TAT I S T I C A L D E S C R I P T I O N S O F D ATA S E L F S T U D Y
2 D ATA P R E PA R AT I O N
3 D ATA A G G R E G AT I O N , S A M P L I N G
4 D ATA S I M I L A R I T Y & D I S S I M I L A R I T Y M E A S U R E
5

8
9

I N T R OD U CT ION TO D AT A S C I E N C E 23 / 67
D ATA P R E PA R AT I O N

I N T R OD U CT ION TO D AT A S C I E N C E 24 / 67
D AT A C L E A N S I N G

Focuses on removing errors in your data so your data becomes a true and consistent
representation of the processes it originates from.
Two types of errors
) Interpretation error
Age < 100
Height of a person is less than 7feet.
Price is positive.
) Inconsistencies between data sources or against your company’s standardized values.
Female and F
Feet and meter
Dollars and Pounds

I N T R OD U CT ION TO D AT A S C I E N C E 25 / 67
D AT A C L E A N S I N G

Errors from data entry


) Cause
Typos
Errors due to lack of concentration
Machine or hardware failure
) Detection
Frequency table
) Correction
Simple assignment statements
If-then-else rules
White-spaces and typos
) Remove leading and trailing white-spaces.
) Change case of the alphabets from upper to lower.
I N T R OD U CT ION TO D AT A S C I E N C E 27 / 67
D AT A C L E A N S I N G

Physically impossible values


) Examples
2 Age < 100
2 Height of a person is less than 7feet.
2 Price is positive.
) If-then-else rules
Outliers
) Use visualization techniques like box plots or scatter plots.
) Use statistical summary with minimum and maximum values.
) Identifying outliers as values falling at least 1.5 × IQR above the third quartile or below
the first quartile.

I N T R OD U CT ION TO D AT A S C I E N C E 28 / 67
D ATA C L E A N S I N G
Missing values

I N T R OD U CT ION TO D AT A S C I E N C E 29 / 67
M ISSING V A L U E S

Ignore the tuple.


) Used when the class label is missing in a classification task.
) Not very effective, unless the tuple contains several attributes with missing values.
) Poor technique when the percentage of missing values per attribute varies considerably.
) By ignoring the tuple, we do not make use of the remaining attributes’ values in the
tuple. Such data could have been useful to the task at hand.
Fill in the missing value manually.
) Time consuming.
) May not be feasible given a large data set with many missing values.

I N T R OD U CT ION TO D AT A S C I E N C E 30 / 67
M ISSING V A L U E S

Use a global constant to fill in the missing value.


) Replace all missing attribute values by the same constant such as a label like
”Unknown” or -1.
) If missing values are replaced by, say, ”Unknown,” then the mining program may
mistakenly think that they form an interesting concept, since they all have a value in
common—that of ”Unknown.” Hence, although this method is simple, it is not foolproof.
Use a measure of central tendency for the attribute.
) Central tendency indicates the ”middle” value of a data distribution. E.g., mean or
median
) For normal (symmetric) data distributions, the mean can be used.
) Skewed data distribution should employ the median.

I N T R OD U CT ION TO D AT A S C I E N C E 31 / 67
M ISSING V A L U E S

Use the attribute mean or median for all samples belonging to the same class as the
given tuple.
) For example, if classifying customers according to credit risk, we may replace the
missing value with the mean income value for customers in the same credit risk category
as that of the given tuple.
) If the data distribution for a given class is skewed, the median value is a better choice.
Use the most probable value to fill in the missing value.
) This may be determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction.
) For example, using the other customer attributes in the data set, we may construct a
decision tree to predict the missing values for income.
) Most popular strategy.

I N T R OD U CT ION TO D AT A S C I E N C E 32 / 67
D AT A C L E A N S I N G

Deviations from code-book


) A code book is a description of your data. It contains things such as the number of
variables per observation, the number of observations, and what each encoding within a
variable means.
) Discrepancies between the code-book and the data should be corrected.
Different units of measurement
) Pay attention to the respective units of measurement.
) Simple conversion can rectify.
Different levels of aggregation
) Data set containing data per week versus one containing data per work week.
) Data summarization will fix it.

I N T R OD U CT ION TO D AT A S C I E N C E 33 / 67
N O I S Y D AT A

Noise is a random error or variance in a measured variable. Outliers may represent


noise.
Noisy data can be removed by using smoothing techniques.
) Binning
Smoothing by bin means
Smoothing by bin medians
) Regression
) Outlier Analysis
) Concept hierarchies are a form of data discretization that can also be used for data
smoothing.
2 For example: A concept hierarchy for price may map real price values into three
categories: inexpensive, moderately priced, and expensive.

I N T R OD U CT ION TO D AT A S C I E N C E 34 / 67
Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar (T1)
The Art of Data Science by Roger D Peng and Elizabeth Matsui (R1)
Data Mining: Concepts and Techniques, Third Edition by Jiawei Han and Micheline
Kamber Morgan Kaufmann Publishers, 2006 (T4)
On Being a Data Skeptic Publisher(s): O’Reilly Media, Inc. ISBN: 9781449374310

T HANK YOU

I N T R OD U CT ION TO D AT A S C I E N C E 36 / 67
INTRODUCTION TO DATA SCIENCE
MODULE # 4 : DATA WRANGLING(CONTD…)
IDS Course Team
BITS Pilani
The instructor is gratefully acknowledging
the authors who made their course
materials freely available online.

INTRODUCTION TO DATA SCIENCE


TABLE OF CONTENTS

1 DATA SIMILARITY & DISSIMILARITY MEASURE


2 3 VISUALIZATION TECHNIqUES FOR DATA EXPLORATORY ANALYSIS
3 HANDLING NUMERIC DATA
54
MANAGING CATEGORICAL ATTRIBUTES
DEALING WITH TEXTUAL DATA

INTRODUCTION TO DATA SCIENCE


MEASURES OF PROXIMITY

Similarity and dissimilarity measures are measures of proximity.


A similarity measure for two objects, i and j, will typically return the value 1 if they
are identical and 0 if the objects are unalike.
The higher the similarity value, the greater the similarity between objects.
A dissimilarity measure returns a value of 0 if the objects are the same.
The higher the dissimilarity value, the more dissimilar the two objects are.

T4:Chapter 2.4
MEASURING DATA SIMILARITY AND DISSIMILARITY
Various proximity measures
• Data Matrix versus Dissimilarity Matrix
• Proximity Measures for Nominal Attributes
• Proximity Measures for Binary Attributes
• Symmetric Binary Attributes
• Asymmetric Binary Attributes
• Proximity Measures for Ordinal Attributes
• Proximity Measures for Numeric Data
• Proximity Measures for Mixed Types
• Cosine Similarity

BITS Pilani, Pilani Campus


DATA MATRIX AND DISSIMILARITY MATRIX
Data matrix
– n data points with p dimensions
– Two modes

Dissimilarity matrix
– n data points, but registers only the distance
– A triangular matrix
– Single mode

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES
Categorical Attribute
Attribute ‘Color‘ can take 2 or more states, e.g., red, yellow,
blue, green (generalization of a binary attribute)
• Simple matching
– m: # of matches, p: total # of variables

– Similarity can be computed as:

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES

BITS Pilani, Pilani Campus


EXERCISE-CANVAS DISCUSSION

Calculate the dissimilarity matrix for the ordinal attributes

BITS Pilani, Pilani Campus


EXERCISE-CANVAS DISCUSSION

Calculate the dissimilarity matrix and similarity matrix for the ordinal
attributes

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR BINARY ATTRIBUTES

• A contingency table for binary data

– where q is the number of attributes that equal 1 for both objects i and j,
– r is the number of attributes that equal 1 for object i but equal 0 for object j,
– s is the number of attributes that equal 0 for object i but equal 1 for object j,
– t is the number of attributes that equal 0 for both objects i and j.
– The total number of attributes is p, where p = q+r+s+t .

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR BINARY ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR BINARY ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR BINARY ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR BINARY ATTRIBUTES-
EXERCISE

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR BINARY ATTRIBUTES-
EXERCISE

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR BINARY ATTRIBUTES-
SUMMARY
• Distance measure for symmetric binary variables,
dissimilarity:
• Distance measure for asymmetric binary variables,
dissimilarity:

• Similarity between Asymmetric binary values is given by


Jaccard coefficient:

BITS Pilani, Pilani Campus


EXERCISE
Suppose that a patient record table contains the attributes
name, gender, fever, cough, test-1, test-2, test-3, and test-4,
where name is an object identifier, gender is a symmetric
attribute, and the remaining attributes are asymmetric binary.
Compute the dissimilarity matrix for asymmetric binary
attributes

BITS Pilani, Pilani Campus


EXERCISE

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR NUMERIC ATTRIBUTES:

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR NUMERIC ATTRIBUTES:

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR NUMERIC ATTRIBUTES:

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR NUMERIC ATTRIBUTES:

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR NUMERIC ATTRIBUTES:

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR NUMERIC ATTRIBUTES:

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR MIXED TYPE ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR MIXED TYPE ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR MIXED TYPE ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR MIXED TYPE ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR MIXED TYPE ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR MIXED TYPE ATTRIBUTES

BITS Pilani, Pilani Campus


COSINE SIMILARITY
Cosine similarity is a measure of similarity that can be used to
compare documents or, say, give a ranking of documents with
respect to a given vector of query words

Where is given by

BITS Pilani, Pilani Campus


EXERCISE
Suppose that x and y are the first two term-frequency vectors in That is,
x =(5, 0, 3, 0, 2, 0, 0, 2, 0,0) and y =(3, 0, 2, 0, 1, 1, 0, 1, 0,1). How similar
are x and y? Compute the cosine similarity between the two vectors.

BITS Pilani, Pilani Campus


TABLE OF CONTENTS

1 DATA SIMILARITY & DISSIMILARITY MEASURE


2 3 VISUALIZATION TECHNIqUES FOR DATA EXPLORATORY ANALYSIS
3 HANDLING NUMERIC DATA
54
MANAGING CATEGORICAL ATTRIBUTES
DEALING WITH TEXTUAL DATA

INTRODUCTION TO DATA SCIENCE


BOXPLOT
A boxplot incorporates the
five-number summary.
The ends of the box are at the
quartiles.
The box length is the interquartile
range.
The median is marked by a line within
the box.
The whiskers outside the box extend
to the Minimum and Maximum
observations.
Computed in O(n log n) time.
INTRODUCTION TO DATA SCIENCE
HISTOGRAM
Graphical method for summarizing the distribution of an attribute, X .
IfX is nominal
) Bar chart
) A vertical bar is drawn for each
known value of X .
) The height of the bar indicates the

frequency of that X value.


If X is numeric
) Histogram
) The range of values for X is partitioned into disjoint consecutive subranges or buckets or
bins.
) The range of a bucket is known as the width.
) The buckets are of equal width.
INTRODUCTION TO DATA SCIENCE 42 / 113
SCATTERPLOT
Determine if there appears to be a relationship, pattern, or trend between two
numeric attributes.
Provide a visualization of bi-variate data to see clusters of points and outliers, or
correlation relationships.
Correlations can be positive, negative, or null (uncorrelated).

INTRODUCTION TO DATA SCIENCE 43 / 113


DEMO CODE

Visualization.ipynb

INTRODUCTION TO DATA SCIENCE 44 / 113


TABLE OF CONTENTS

1 DATA SIMILARITY & DISSIMILARITY MEASURE


2 3 VISUALIZATION TECHNIqUES FOR DATA EXPLORATORY ANALYSIS
3 HANDLING NUMERIC DATA
54
MANAGING CATEGORICAL ATTRIBUTES
DEALING WITH TEXTUAL DATA

INTRODUCTION TO DATA SCIENCE


HANDLING NUMERIC DATA

Techniques are
Discretization – Convert numeric data into discrete categories
Binarization – Convert numeric data into binary categories
Normalization – Scale numeric data to a specific range
Smoothing
• which works to remove noise from the data. Techniques include binning, regression, and
clustering.
• random method, simple moving average, random walk, simple exponential, and
exponential moving average (Will learn in ISM)

T4:Chapter 3.5

46 / 113
DISCRETIZATION

Convert continuous attribute into a discrete attribute.


Discretization involves converting the raw values of a numeric attribute (e.g., age) into
) interval labels (e.g., 0–10, 11–20, etc.)
) conceptual labels (e.g., youth, adult, senior)
Discretization Process
) The raw data are replaced by a smaller number of interval or concept labels.
) This simplifies the original data and makes the mining more efficient.
) Concept hierarchies are also useful for mining at multiple abstraction levels.

INTRODUCTION TO DATA SCIENCE 47 / 113


CONCEPT HIERARCHY
Divide the range of a continuous attribute into intervals.
Interval labels can then be used to replace actual data values.
The labels, in turn, can be recursively organized into higher-level concepts.
This results in a concept hierarchy for the numeric attribute.

INTRODUCTION TO DATA SCIENCE 48 / 113


DISCRETIZATION TECHNIqUES

Unsupervised discretization
) Binning [ Equal-interval, Equal-frequency] (Top-down split)
) Histogram analysis (Top-down split)
) Clustering analysis (Top-down split or Bottom-up merge)

) Correlation analysis (Bottom-up merge)

Supervised discretization
) Entropy-based discretization (Top-down split)

T1:Cahpter 2.3.6

50 / 113
UNSUPERVISED DISCRETIZATION

Class labels are ignored.


The best number of bins k is determined experimentally.
User specifies the number of intervals and/or how many data points to be included in
any given interval.
Use Binning methods.

INTRODUCTION TO DATA SCIENCE 51 / 113


DISCRETIZATION BY BINNING METHODS

1 Equal Width (distance) binning


) Each bin has equal width.

width = interval =
max − min
#bins
) Highly sensitive to outliers.
) If outliers are present, the width of each bin is large, resulting in skewed data.
2 Equal Depth (frequency) binning
) Specify the number of values that have to be stored in each bin.
) Number of entries in each bin are equal.
) Some values can be stored in different bins.

T4:Cahpter 3.4.6

52 / 113
BINNING EXAMPLE

Discretize the following data into 3 discrete categories using binning technique.
70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 80, 81, 53, 56, 57, 63, 66, 67, 67, 67, 68, 69, 70, 70.

INTRODUCTION TO DATA SCIENCE 53 / 113


BINNING EXAMPLE

Original 53, 56, 57, 63, 66, 67, 67, 67, 68, 69, 70, 70,
Data 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 80, 81
Method Bin1 Bin 2 Bin 3
Equal width= [53, 62) = [62, 72) = [72, 81] =
Width 81-53 = 28 53, 56, 57 63, 66, 67, 67, 72, 73, 75, 75,
28/3 = 9.33 67, 68, 69, 70, 76, 76, 78,
70, 70, 70 79, 80, 81
Equal depth = 53, 56, 57, 63, 68, 69, 70, 70, 75, 75, 76, 76,
Depth 24 /3 = 8 66, 67, 67, 67 70, 70, 72, 73 78, 79, 80, 81

INTRODUCTION TO DATA SCIENCE 54 / 113


DEMO CODE

Binning.ipynb

INTRODUCTION TO DATA SCIENCE 55 / 113


DISCRETIZATION BY HISTOGRAM ANALYSIS

Histogram analysis is an unsupervised discretization technique because it does not


use class information.
Histograms use binning to approximate data distributions and are a popular form of
data reduction.
A histogram for an attribute, X, partitions the data distribution of X into disjoint
subsets, referred to as buckets or bins.
If each bucket represents only a single attribute–value/frequency pair, the buckets are
called singleton buckets.
Often, buckets represent continuous ranges for the given attribute.
The histogram analysis algorithm can be applied recursively to each partition in order
to automatically generate a multilevel concept hierarchy.
INTRODUCTION TO DATA SCIENCE 56 / 113
DISCRETIZATION BY HISTOGRAM ANALYSIS
1 Equal Width Histogram
) The values are partitioned into equal size partitions or ranges.
2 Equal Frequency Histogram
) The values are partitioned such that each partition contains the same number of data
objects.

INTRODUCTION TO DATA SCIENCE 57 / 113


VARIABLE TRANSFORMATION

Variable transformation involves changing the values of an attribute.


For each object (tuple), a transformation is applied to the value of the variable for that
object.
1 Simple functional transformations
2 Normalization

T1:Chapter 2.3.7
SIMPLE FUNCTIONAL TRANSFORMATION

INTRODUCTION TO DATA SCIENCE 59 / 113


SIMPLE FUNCTIONAL TRANSFORMATION

Variable transformations should be applied with caution since they change the nature
of the data.
For instance, the transformation 1xreduces the magnitude of values that are 1 or
larger, but increases the magnitude of values between 0 and 1.
To understand the effect of a transformation, it is important to ask questions such as:
) Does the order need to be maintained?
) Does the transformation apply to all values, especially negative values and 0?
) What is the effect of the transformation on the values between 0 and 1?

INTRODUCTION TO DATA SCIENCE 60 / 113


NORMALIZATION

Normalizing the data attempts to give all attributes an equal weight.


The goal of standardization or normalization is to make an entire set of values have a
particular property.
Normalization is particularly useful for:
) classification algorithms involving neural networks.
• normalizing the input values for each attribute in the training tuples will help speed up
the learning phase.
) distance measurements such as nearest-neighbor classification and clustering.
• normalization helps prevent attributes with initially large ranges (e.g., income)
from outweighing attributes with initially smaller ranges (e.g., binary attributes).

INTRODUCTION TO DATA SCIENCE 61 / 113


WHY FEATURE SCALING?

Features with bigger magnitude dominate over the features with smaller magnitudes.
Good practice to have all variables within a similar scale.
Euclidean distances are sensitive to feature magnitude.
Gradient descent converges faster when all the variables are in the similar scale.
Feature scaling helps decrease the time of finding support vectors.

INTRODUCTION TO DATA SCIENCE 62 / 113


ALGORITHMS SENSITIVE TO FEATURE MAGNITUDE

Linear and Logistic Regression


Neural Networks
Support Vector Machines
KNN
K-Means Clustering
Linear Discriminant Analysis (LDA)
Principal Component Analysis (PCA)

INTRODUCTION TO DATA SCIENCE 64 / 113


NORMALIZATION

Scale the feature magnitude to a standard range like [0, 1] or [−1, +1] or any
other. Techniques
) Min-Max normalization
) z-score normalization
) Normalization by decimal scaling

Impact of outliers in the data ???

T4:Chapter 3.5.2

65 / 113
MIN-MAX SCALING
Min-max scaling squeezes (or stretches) all feature values to be within the range of
[0, 1].
Min-Max normalization preserves the relationships among the original data values.
It will encounter an ”out-of-bounds” error if a future input case for normalization falls
outside of the original data range for X .

INTRODUCTION TO DATA SCIENCE


MIN-MAX NORMALIZATION

Suppose that the minimum and maximum values for the attribute income are $12,000 and
$98,000, respectively. The new range is [0.0,1.0]. Apply min-max normalization to value of
$73,600.

INTRODUCTION TO DATA SCIENCE


Z-SCORE NORMALIZATION

In z-score normalization (or zero-mean normalization), the values for an attribute, x ,


are normalized based on the mean µ(x ) and standard deviation σ(x ) of x .
The resulting scaled feature has a mean of 0 and a variance of 1.
New range is [−3σ, +3σ].

xˆ = x − µ(x )
σ(x )

z-score normalization is useful when the actual minimum and maximum of attribute X
are unknown, or when there are outliers that dominate the min-max normalization.

INTRODUCTION TO DATA SCIENCE 68 / 113


Z-SCORE NORMALIZATION

Suppose that the mean and standard deviation of the values for the attribute income are
$54,000 and $16,000, respectively. Apply z-score normalization to value of $73,600.

INTRODUCTION TO DATA SCIENCE 69 / 113


DECIMAL NORMALIZATION

Normalizes by moving the decimal point of values of attribute x .


The number of decimal points moved depends on the maximum absolute value of x .
New range is [−1, +1].

INTRODUCTION TO DATA SCIENCE


DECIMAL NORMALIZATION

Example 1
CGPA Formula Normalized CGPA
2 2/10 0.2
3 3/10 0.3
Example 2
Bonus Formula Normalized Bonus
450 450/1000 0.45
310 310/1000 0.31
Example 3
Salary Formula Normalized Salary
48000 48000/100000 0.48
67000 67000/100000 0.67

INTRODUCTION TO DATA SCIENCE 71 / 113


DEMO CODE

Normalization.ipynb

INTRODUCTION TO DATA SCIENCE 72 / 113


TABLE OF CONTENTS

1 DATA SIMILARITY & DISSIMILARITY MEASURE


2 3 VISUALIZATION TECHNIqUES FOR DATA EXPLORATORY ANALYSIS
3 HANDLING NUMERIC DATA
54
MANAGING CATEGORICAL ATTRIBUTES
DEALING WITH TEXTUAL DATA

INTRODUCTION TO DATA SCIENCE


CATEGORICAL ENCODING

We need to convert categorical columns to numerical columns so that a


machine learning algorithm understands it.

Categorical encoding is a process of converting categories to numbers.

• Binarization maps a continuous or categorical


attribute into one or more binary attributes.
• Must maintain ordinal relationship.
• Algorithms that find association patterns require that the
data be in the form of binary attributes.
• E.g., Apriori algorithm, Frequent Pattern (FP) Growth
algorithm

74 / 113
CATEGORICAL ENCODING TECHNIQUES

One-hot encoding
Label Encoding

INTRODUCTION TO DATA SCIENCE 75 / 113


ONE-HOT ENCODING
Encode each categorical variable with a set of Boolean variables which take values 0
or 1, indicating if a category is present for each observation.
One binary attribute for each categorical value.
Advantages
) Makes no assumption about the distribution or categories of the categorical variable .
) Keeps all the information of the categorical variable .
) Suitable for linear models.

Disadvantages
) Expands the feature space.
) Does not add extra information while encoding.
) Many dummy variables may be identical, introducing redundant information .

) Number of resulting attributes may become too large.

In multi-class classification, the class label is converted using one-hot encoding.


INTRODUCTION TO DATA SCIENCE 76 / 113
ONE-HOT ENCODING EXAMPLE

Assume an ordinal attribute for representing service of a restaurant:


(Awful < Poor < OK < Good < Great ) requires 5 bits to maintain the ordinal
relationship.
Service Quality X1 X2 X3 X4 X5
Awful 0 0 0 0 1
Poor 0 0 0 1 0
OK 0 0 1 0 0
Good 0 1 0 0 0
Great 1 0 0 0 0

INTRODUCTION TO DATA SCIENCE 77 / 113


ON-HOT ENCODING EXAMPLE

INTRODUCTION TO DATA SCIENCE 78 / 113


LABEL ENCODING
Replace the categories by digits from 1 to n (or 0 to n − 1, depending the
implementation), where n is the number of distinct categories of the variable.
The categories are arranged in ascending order and the numbers are assigned.
Advantages
) Straightforward to implement.
) Does not expand the feature space.
) Work well enough with tree based algorithms.

Disadvantages
) Does not add extra information while encoding.
) Not suitable for linear models.
) Does not handle new categories in test set automatically.

Used for features which have multiple values into domain. eg: colour, protocol types
INTRODUCTION TO DATA SCIENCE 79 / 113
LABEL ENCODING EXAMPLE

Assume an ordinal attribute for representing service of a restaurant: (Awful, Poor, OK,
Good, Great)

Service Quality Integer Value


Awful 0
Poor 1
OK 2
Good 3
Great 4

INTRODUCTION TO DATA SCIENCE 80 / 113


DEMO CODE

Encoding.ipynb

INTRODUCTION TO DATA SCIENCE 82 / 113


TABLE OF CONTENTS

1 DATA SIMILARITY & DISSIMILARITY MEASURE


2 3 VISUALIZATION TECHNIqUES FOR DATA EXPLORATORY ANALYSIS
3 HANDLING NUMERIC DATA
54 MANAGING CATEGORICAL ATTRIBUTES
DEALING WITH TEXTUAL DATA

INTRODUCTION TO DATA SCIENCE


STEPS INVOLVED IN THE TEXTUAL DATA PROCESSING

Removing special characters, changing the case (up-casing and down-casing).


Tokenization – process of discretizing words within a document.
Creating Document Vector or Term Document Matrix.
Filtering Stop Words
Lexical Substitution
Stemming / Lemmatization

INTRODUCTION TO DATA SCIENCE 84 / 113


TOKENIZATION

Document – In the text mining context, each sentence is considered a distinct


document.
Token – Each word is called a token.
Tokenization – The process of discretizing words within a document is called
tokenization.

INTRODUCTION TO DATA SCIENCE 85 / 113


DOCUMENT VECTOR OR TERM DOCUMENT MATRIX

Create a matrix where each column consists of a token and the cells show the counts
of the number of times a token appears.
Each token is now an attribute in standard data science parlance and each document
is an example (record).
Unstructured raw data is now transformed into a format that is recognized by machine
learning algorithms for training.
The matrix / table is referred to as Document Vector or Term Document Matrix (TDM)
As more new statements are added that have little in common, we end up with a very
sparse matrix.
We could also choose to use the term frequencies (TF) for each token instead of
simply counting the number of occurrences.
INTRODUCTION TO DATA SCIENCE 86 / 113
TERM DOCUMENT MATRIX – EXAMPLE

INTRODUCTION TO DATA SCIENCE 87 / 113


STOP WORDS

There are common words such as ”a,” ”this,” ”and,” and other similar
terms. They do not really convey specific meaning.
Most parts of speech such as articles, conjunctions, prepositions, and pronouns need
to be filtered before additional analysis is performed.
Such terms are called stop words.
Stop word filtering is usually the second step that follows immediately after
tokenization.
The document vector gets reduced significantly after applying standard English stop
word filtering.

INTRODUCTION TO DATA SCIENCE 88 / 113


STOP WORDS

Domain specific terms might also need to be filtered out.


) For example, if we are analyzing text related to the automotive industry, we may want to
filter out terms common to this industry such as ”car,” ”automobile,” ”vehicle,” and so
on.
This is generally achieved by creating a separate dictionary where these context
specific terms can be defined and then term filtering can be applied to remove them
from the data.

INTRODUCTION TO DATA SCIENCE 89 / 113


LEXICAL SUBSTITUTION

Lexical substitution is the process of finding an alternative for a word in the context
of a clause.
It is used to align all the terms to the same term based on the field or subject which is
being analyzed.
This is especially important in areas with specific jargon, e.g., in clinical settings.
Example: common salt, NaCl, sodium chloride can be replaced by NaCl.
Domain specific

Paper - Lexical Substitution for the Medical Domain


URL - https://aclanthology.org/D14-1066.pdf

90 / 113
STEMMING

Stemming is usually the next process step following term filtering.


Words such as ”recognized,” ”recognizable,” or ”recognition” may be encountered
in different usages, but contextually they may all imply the same meaning.
The root of all these highlighted words is ”recognize.”
The conversion of unstructured text to structured data can be simplified by reducing
terms in a document to their basic stems, because only the occurrence of the root
terms has to be taken into account.
This process is called stemming.

INTRODUCTION TO DATA SCIENCE 91 / 113


PORTER STEMMING

The most common stemming technique for text mining in English is the Porter
Stemming method.
Porter stemming works on a set of rules where the basic idea is to remove and/or
replace the suffix of words.
) Replace all terms which end in ’ies’ by ’y,’ such as replacing the term ”anomalies” with
”anomaly.”
) Stem all terms ending in ”s” by removing the ”s,” as in ”algorithms” to ”algorithm.”
While the Porter stemmer is extremely efficient, it can make mistakes that could prove
costly.
) ”arms” and ”army” would both be stemmed to ”arm,” which would result in somewhat
different contextual meanings.

INTRODUCTION TO DATA SCIENCE 92 / 113


LEMMATIZATION

Lemmatization convert a word to its root form, in a more grammatically sensitive way.
) While both stemming and lemmatization would reduce ”cars” to ”car,” lemmatization can
also bring back conjugated verbs to their unconjugated forms such as ”are” to ”be.”
Lemmatization uses POS Tagging (Part of Speech Tagging) heavily.
POS Tagging is the process of attributing a grammatical label to every part of a
sentence.
) Eg: ”Game of Thrones is a television series.”
) POS Tagging:
({”game”:”NN”},{”of”:”IN”},{”thrones”:”NNS”},{”is”:”VBZ”},{”a”:”DT”},
{”television”:”NN”},{”series”:”NN”})
where: NN = noun, IN = preposition, NNS = noun in its plural form, VBZ = third-person
singular verb, and DT = determiner.
INTRODUCTION TO DATA SCIENCE 93 / 113
DEMO CODE

NLP.ipynb

INTRODUCTION TO DATA SCIENCE 94 / 113


Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar (T1)
Data Mining: Concepts and Techniques, Third Edition by Jiawei Han and
Micheline Kamber Morgan Kaufmann Publishers, 2006 (T4)
Data Science – Concepts and Practice by Vijay Kotu and Bala Deshpande (CH -
9.1)
THANK YOU

INTRODUCTION TO DATA SCIENCE 95 / 113


EXERCISE-CANVAS DISCUSSION

BITS Pilani, Pilani Campus


EXERCISE-CANVAS DISCUSSION
Dissimilarity Matrix
1 2 3 4
1 0
2 1 0
3 0.5 0.5 0
Calculate the dissimilarity matrix and similarity
matrix for the ordinal attributes 4 0 1 0.5 0
• There are three states for test-2: fair, good, and
excellent, that is, n= 3. Similarity Matrix, s=1-d
• Step 1:we replace each value for test-2 by its rank, 1 2 3 4
the four objects are assigned the ranks 3, 1, 2, and
3, respectively. 1 1
• Step 2 : normalizes the ranking by mapping rank 1 2 0 1
to 0.0, rank 2 to 0.5, and rank 3 to 1.0. 3 0.5 0.5 0
• Step 3:use the Euclidean distance to calculate the
dissimilarity 4 1 0 0.5 1

BITS Pilani, Pilani Campus


Machine Learning

Classification & Prediction

BITS Pilani Anita Ramachandran


Pilani | Dubai | Goa | Hyderabad
2023

1
Table of Contents

• Classification and Prediction


• Classification - Decision tree algorithm

2
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Learning Experience

• During classification or prediction, a model is generated


by observing or learning from historical data. This
learning can be two types.
• Supervised (inductive) learning
• Given: training data, desired outputs (labels)
• Unsupervised learning
• Given: training data (without desired outputs)

3
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Supervised Learning

• Desired output is already known.


• Given (x1; y1); (x2; y2); : : : ; (xn; yn).
• A supervised learning algorithm analyzes the training
data and produces an inferred function, which can be
used for mapping new examples.
• Learn a function f (x) to predict y given x.
• Example: pattern association, classification, regression.

4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Supervised Learning

5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Supervised Learning -
Classification

6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Unsupervised Learning

7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Train / Validate / Test Sets

• Training set
• Approx 70 to 90% of the actual dataset is used for training the
algorithm.
• Used to learn the parameters of the model.
• Validation set
• Approx 10 to 20% of the training dataset is used for validating
the algorithm.
• Used to tune the parameters of the model.
• Testing set
• Approx 10 to 20% of the actual dataset is used for testing the
algorithm.
• Used to test against new data.

8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification

• Predict categorical (discrete, unordered) class labels.


• Learn of a mapping function, y = f (x) that can predict
the associated class label y of a given tuple X.
• In general, mapping is represented in the form of
classification rules, decision trees, or mathematical
formulae.

9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification Algorithms

• Decision Tree
• Random Forest [Discuss in ML course]
• Logistic Regression [Discuss in ML course]
• Naive Bayes Classifier [Discuss in ML course]
• Support Vector Machine [Discuss in ML course]
• Neural Network [Discuss in DL course]

10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Algorithm

11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Trees

• Decision tree
• A flow-chart-like tree structure
• Internal node denotes a test on an attribute
• Branch represents an outcome of the test
• Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
• Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
• Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
• Test the attribute values of the sample against the decision tree

12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Construction:
Hunt’s Algorithm
• Construct a tree T from a training set D.
• If all the records in D belong to class C or if D is sufficiently pure,
then the node is a leaf node and assigned class label C.
• Purity of a node is defined as the probability of corresponding class.
• If an attribute A does not partition D in a sufficiently pure manner,
then choose another attribute A’ and partition D according to A’
values.
• Recursively construct tree and sub-trees until
• All leaf nodes satisfy the minimum purity threshold.
• Tree cannot be further split.
• Maximum depth of tree is achieved.

13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Construction:
Example

14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions

• How should the training records be split?

• How should the splitting procedure stop?

15
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Splitting
Methods
Binary Attributes Nominal Attributes

Ordinal Attributes
Continuous Attributes

16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Selecting
the best split
• p(i|t): fraction of records associated with node t belonging to class i
• Best split is selected based on the degree of impurity of the child
nodes
• Class distribution (0,1) has high purity
• Class distribution (0.5,0.5) has the smallest purity (highest impurity)

• Intuition: high purity ➔ small value of impurity measures ➔ better


split
c
Entropy (t ) = − p(i | t ) log p(i | t )
i =1
c
Gini (t ) = 1 −   p (i | t )
2

i =1

• Most Informative Attribute

17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Selecting
the best split
c
Entropy (t ) = − p(i | t ) log p(i | t )
i =1
c
Gini (t ) = 1 −   p (i | t )
2

i =1

Classifica tion error (t ) = 1 − max i  p(i | t )

Gini = ?
Entropy = ?
Error = ?

Comparison among the


impurity measures for binary
classification problems

18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions -
Information Gain
• In general the different impurity measures are consistent
• Gain of a test condition: compare the impurity of the
parent node with the impurity of the child nodes
k N (v j )
 = I ( parent ) −  I (v j )
j =1 N
• I(.) is the impurity measure of a given node, N is the total no: of
records at the parent node, k is the no: of attribute values, and
N(vj) is the no: of nodes associated with the child node vj

• Maximizing the gain == minimizing the weighted


average impurity measure of children nodes
• If I() = Entropy(), then Δinfo is called information gain
19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions -
Information Gain

Information gain measures the expected reduction in entropy


caused by partitioning the training set according to an attribute
20
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Example –
Splitting Binary Attributes
With A:
Gini for N1 = 1 – 16/49 – 9/49 = 0.489
Gini for N2 = 1 – 4/25 – 9/25 = 0.48
Weighted avg Gini = 0.489 x 7/12 + 0.48 x 5/12 = 0.485

With B:
Gini for N1 = 1 – 1/25 – 16/25 = 0.32
Gini for N2 = 1 – 25/49 – 4/49 = 0.408
Weighted avg Gini = 0.32 x 5/12 + 0.408 x 7/12 = 0.37

21
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Example –
Splitting Nominal Attributes

22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Example –
Splitting Continuous Attributes
• Brute force method – high complexity
• Sort training records:
• Based on their annual income - O(N log N) complexity
• Candidate split positions are identified by taking the midpoints
between two adjacent sorted values
• Measure Gini index for each split position, and choose the one that
gives the lowest value
• Further optimization: consider only candidate split positions located
between two adjacent records with different class labels

23
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Summary so far - Decision
Tree
• Build a model (based on past vector) in the form of a tree structure,
that predicts the value of the output variable based on the input
variables in the feature vector
• Each node (decision node) of a decision tree corresponds to one
feature vector
• Root node, Branch node, Leaf node
• Building a Decision Tree - Recursive partitioning
• Splits data into multiple subsets on the basis of feature values
• Root node – entire dataset
• First selects the feature which predicts the target class in the strongest
way
• Splits the dataset into multiple partitions
• Stopping criteria
• All or most of the examples at a particular node have the same class
• All features have been used up in the partitioning
• The tree has grown to a pre-defined threshold limit

24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example

• Consider the data set available for a company’s hiring


cycles. A student wants to find out if he may be offered a
job in the company. His parameters are as follows:
• CGPA: High, Communication: Bad, Aptitude: High,
Programming skills: Bad

25
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
• Outcome = false for
all the cases where
Aptitude = Low,
irrespective of other
conditions

• So feature Aptitude
can be taken up as
the first node of the
decision tree

26
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
• For Aptitude = HIGH, job offer condition is TRUE for all the cases
where Communication = Good.
• For cases where Communication = Bad, job offer condition is TRUE
for all the cases where CGPA = HIGH
• Use the below decision tree to predict outcome for (Aptitude = high,
Communication = Bad and CGPA = High)

27
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Entropy & Information
Gain Calculation - Level 1

28
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 1

29
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 1

• Information gain from CGPA = 0.99-0.69 = 0.3


• Information gain from Communication = 0.99-0.63 =
0.36
• Information gain from Programming skills = 0.04
• Information gain from Aptitude = 0.47 -> Aptitude
will be the first node of the decision tree formed 30
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 1
• For Aptitude = Low, entropy is 0
• The result will be the same always regardless of the values of
the other features
• The branch for Aptitude = Low will not continue any further

31
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2
• We will have only one branch to navigate: Aptitude =
High

32
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2

33
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2

34
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2
Entropy values at the end of Level 2:
• 0.85 before the split
• 0.33 when CGPA is used for split
• 0.30 when Communication is used for split
• 0.80 when Programming skill is used for split

Information Gains
• After split with CGPA = 0.52
• After split with Communication = 0.55
• After split with Programming skill = 0.05

• Highest information gain from Communication, hence it should used


for next level split
• Entropy = 0 for Communication = Good, so that branch will not
continue further

35
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 3
Entropy values at the end of Level 3:
• 0.81 before the split
• 0 CGPA is used for split
• 0.50 when Programming skill is used for split

36
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2

37
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Problems with information gain

• Natural bias of information gain: it favors attributes with


many possible values.
• Consider the attribute Date in the PlayTennis example.
– Date would have the highest information gain since it
perfectly separates the training data.
– It would be selected at the root resulting in a very broad
tree
– Very good on the training, this tree would perform
poorly in predicting unknown instances. Overfitting.
• The problem is that the partition is too specific, too many
small classes are generated.
• We need to look at alternative measures …

BITS Pilani, Pilani Campus


Characteristics of Decision
Tree Induction
• Non-parametric approach
• Computationally inexpensive, even with large training set
• Easy to interpret
• Accuracy is comparable to other classifiers
• Robust to noise, with methods to prevent overfitting
• Immune to presence of redundant or irrelevant attributes
• Splits using single attribute at a time -> rectilinear decision
boundaries
• Limits decision tree representation for modeling complex relationships
among continuous attributes
• Tree pruning strategies has more effect on the performance of
decision trees rather than choice of impurity measure

39
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Algorithms
• Iterative Dichotomiser 3 (ID 3)
– Entropy based criteria
– Gives an exhaustive decision tree.
– Categorical inputs are handled.
• C 4.5
– Entropy based criteria
– Handle missing data.
– Categorical and continuous inputs are handled.
– Uses Tree Pruning to addresses over-fitting problem of ID
3.
• CART (Classification and Regression Tree)
– Gini Index is used.
– Categorical and continuous inputs are handled.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


More…
• Advantages
– Decision trees can easily be converted to classification rules.
– Does not require any domain knowledge or parameter setting.
– Decision trees can handle multidimensional data.
– Simple, fast, good accuracy
• Applications
– Medicine
– Manufacturing and production
– Financial analysis
– Astronomy
– Molecular biology

41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Backup

42
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues in Decision Tree
Learning
• Handling training examples with missing data, attributes
with differing costs
• Model overfitting
• Causes of model overfitting
• Estimating generalization error
• Handling overfitting
• Evaluating classifier performance

43
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Alternative measures for selecting
attributes – E.g.: Gain Ratio
• Impurity measures such as entropy and Gini index tend to favor
attributes that have a large number of distinct values
• Even in a less extreme situation, a test condition that results in a large
number of outcomes may not be desirable because the number of
records associated with each partition is too small to enable us to
make any reliable predictions
• Solution 1: restrict the test conditions to binary splits only (CART)
• Solution 2: modify the splitting criterion to take into account the
number of outcomes produced by the attribute test condition
• In the C4.5 decision tree algorithm, a splitting criterion known as gain
ratio is used to determine the goodness of a split

44
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Introduction to Data Science

Evaluation metrics, FE for text data

BITS Pilani Anita Ramachandran


Pilani | Dubai | Goa | Hyderabad
2023

1
Evaluation Metrics

BITS Pilani, Pilani Campus


Evaluation metric for classification problems

PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
• Accuracy is the fraction of predictions (TP) (FN)
ACTUAL
our model got right. Formally, accuracy CLASS Class=No c d
has the following definition: (FP) (TN)

Accuracy=Number of correct predictions / Total number of predictions

• For binary classification, accuracy can also be calculated in terms of positives


and negatives as follows:
a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN

BITS Pilani, Pilani Campus


Problem with Accuracy

• Consider a 2-class problem


– Number of Class NO examples = 990
– Number of Class YES examples = 10
• If a model predicts everything to be class NO, accuracy is 990/1000 = 99 %
– This is misleading because this trivial model does not detect any class YES example
– Detecting the rare class is usually more interesting (e.g., frauds, intrusions, defects, etc)

PREDICTED CLASS
Class=Yes Class=No

Class=Yes 0 10
ACTUAL
CLASS Class=No 0 990

BITS Pilani, Pilani Campus


Which model is better?

PREDICTED
A Class=Yes Class=No
ACTUAL Class=Yes 0 10 Accuracy:
Class=No 0 990
99%

PREDICTED
B Class=Yes Class=No
ACTUAL Class=Yes 10 0 Accuracy:
Class=No 500 490 50%

BITS Pilani, Pilani Campus


Alternative Measures

PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL Class=No c d
CLASS

a TP What proportion of positive identifications was actually correct?


Precision (p) = = A model that produces no false positives has a precision of 1.0.
a + c TP + FP
a TP What proportion of actual positives was identified correctly?
Recall (r) = =
a + b TP + FN A model that produces no false negatives has a recall of 1.0.
1 2rp 2a 2TP
F - measure (F) = = = =
 1 / r + 1 / p  r + p 2a + b + c 2TP + FP + FN
 
 2 
F-score of 1.0, indicating perfect precision and
recall

BITS Pilani, Pilani Campus


Measures of Classification Performance
(Summary)

PREDICTED CLASS
Yes No
ACTUA
Yes TP FN
L
CLASS No FP TN

 is the probability that we reject


the null hypothesis when it is true.
This is a Type I error or a false
positive (FP).

 is the probability that we accept


the null hypothesis when it is
false. This is a Type II error or a
false negative (FN).

BITS Pilani, Pilani Campus


Which Classifier is better? No skew

Precision (p) = 0.98


T1 PREDICTED CLASS TPR = Recall (r) = 0.5
Class=Yes Class=No
FPR = 0.01
Class=Yes 50 50 TPR/FPR = 50
ACTUAL
CLASS
Class=No 1 99
F − measure = 0.66

Precision (p) = 0.9


T2 PREDICTED CLASS
TPR = Recall (r) = 0.99
Class=Yes Class=No
FPR = 0.1
Class=Yes 99 1
TPR/FPR = 9.9
ACTUAL Class=No 10 90
CLASS F − measure = 0.94

T3 PREDICTED CLASS Precision (p) = 0.99


Class=Yes Class=No TPR = Recall (r) = 0.99
Class=Yes 99 1 FPR = 0.01
ACTUAL
Class=No 1 99 TPR/FPR = 99
CLASS
F − measure = 0.99

BITS Pilani, Pilani Campus


Which Classifier is better? Medium Skew

Precision (p) = 0.83


T1 PREDICTED CLASS TPR = Recall (r) = 0.5
Class=Yes Class=No
FPR = 0.01
Class=Yes 50 50 TPR/FPR = 50
ACTUAL
CLASS
Class=No 10 990
F − measure = 0.62

Precision (p) = 0.5


T2 PREDICTED CLASS
TPR = Recall (r) = 0.99
Class=Yes Class=No
FPR = 0.1
Class=Yes 99 1
TPR/FPR = 9.9
ACTUAL Class=No 100 900
CLASS F − measure = 0.66

T3 PREDICTED CLASS Precision (p) = 0.9


Class=Yes Class=No TPR = Recall (r) = 0.99
Class=Yes 99 1 FPR = 0.01
ACTUAL
Class=No 10 990 TPR/FPR = 99
CLASS
F − measure = 0.94

BITS Pilani, Pilani Campus


Which Classifier is better? High Skew

Precision (p) = 0.3


T1 PREDICTED CLASS TPR = Recall (r) = 0.5
Class=Yes Class=No
FPR = 0.01
Class=Yes 50 50 TPR/FPR = 50
ACTUAL
CLASS
Class=No 100 9900
F − measure = 0.375

Precision (p) = 0.09


T2 PREDICTED CLASS
TPR = Recall (r) = 0.99
Class=Yes Class=No
FPR = 0.1
Class=Yes 99 1
TPR/FPR = 9.9
ACTUAL Class=No 1000 9000
CLASS F − measure = 0.165

T3 PREDICTED CLASS Precision (p) = 0.5


Class=Yes Class=No TPR = Recall (r) = 0.99
Class=Yes 99 1 FPR = 0.01
ACTUAL
Class=No 100 9900 TPR/FPR = 99
CLASS
F − measure = 0.66

BITS Pilani, Pilani Campus


Classifications and Probability Estimates

100%

80%

• Logistic regression produces a score 60%


between 0 – 1 (probability estimate)
40%

• Use threshold to produce classification


20%

0%
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6

• What happens if you vary the threshold?

• ROC Curve

BITS Pilani, Pilani Campus


ROC Curve-(Receiver Operating Characteristic)

• ROC curve plots TPR against FPR


– Performance of a model represented as a point in an
ROC curve
(TPR,FPR):
• (0,0): declare everything to be negative class
• (1,1): declare everything to be positive class
• (1,0): ideal
• Diagonal line:
– Random guessing
– Below diagonal line:
• prediction is opposite
of the true class

BITS Pilani, Pilani Campus


ROC Curve

• ROC curve can be used to select a threshold for a classifier, which maximizes the
true positives and in turn minimizes the false positives.
• ROC Curves help determine the exact trade-off between the true positive rate
and false-positive rate for a model using different measures of probability
thresholds.
• ROC curves are more appropriate to be used when the observations present are
balanced between each class.

13

BITS Pilani, Pilani Campus


Example

People People Call these patients Call these patients


with “negative” “positive”
without disease
disease

Test Result

Test Result

BITS Pilani, Pilani Campus


Example

Call these patients Call these patients Call these patients Call these patients
“negative” “positive” “negative” “positive”
True
Positives

Test Result
Test Result False Positives

without the disease


with the disease

BITS Pilani, Pilani Campus


Example

Call these patients Call these patients Call these patients Call these patients
“negative” “positive” “negative” “positive”

False True
negatives negatives

Test Result
Test Result

without the disease


with the disease

BITS Pilani, Pilani Campus


Moving the Threshold: left

‘‘-’’ ‘‘+’’

Test Result

without the disease Which line has the higher recall of -?


with the disease Which line has the higher precision of -?

BITS Pilani, Pilani Campus


ROC curve

100
%
True Positive Rate
(Recall)

0
% 0 100
False Positive
% Rate (1- %
specificity)

BITS Pilani, Pilani Campus


More -

• Evaluation metrics for regression


– MAE, MSE, RMSE, R2 value, Adjusted R2 value
• Cluster analysis
– How to evaluate the goodness of the clusters
– Cohesion (intra-cluster distance), separation (inter-cluster distance),
silhouette coefficient

19
BITS Pilani, Pilani Campus
Feature Engineering for Text Data

20
BITS Pilani, Pilani Campus
N-Grams

• There are families of words in the spoken and written language


that typically go together. Grouping such terms, called n-grams,
and analyzing them statistically can present new insights.
• The final pre-processing step typically involves forming these n-
grams and storing them in the document vector.
• Algorithms providing n-grams become computationally
expensive and the results become huge so in practice the
amount of ”n” will vary based on the size of the documents
and the corpus.

21
BITS Pilani, Pilani Campus
Example

22
BITS Pilani, Pilani Campus
Term Frequency-Inverse Document
Frequency

• Consider a web search problem where the user types in some keywords
and the search engine extracts all the documents (essentially, web pages)
that contain these keywords.
• How does the search engine know which web pages to serve up?
• In addition to using network rank or page rank, the search engine also runs
some form of text mining to identify the most relevant web pages.
– Example, the user types in the following keywords: ”RapidMiner books
that describe text mining.”
• In this case, the search engines run on the following basic logic:
– Give a high weight-age to those keywords that are relatively rare.
– Give a high weight-age to those web pages that contain a large number
of instances of the rare keywords.

23
BITS Pilani, Pilani Campus
Term Frequency-Inverse Document
Frequency

24
BITS Pilani, Pilani Campus
Term Frequency-Inverse Document
Frequency

25
BITS Pilani, Pilani Campus
Example

26
BITS Pilani, Pilani Campus
Example

27
BITS Pilani, Pilani Campus
Example

28
BITS Pilani, Pilani Campus
Example

29
BITS Pilani, Pilani Campus
Backup

30
BITS Pilani, Pilani Campus
How to Construct an ROC curve

Instance Score True • Use a classifier that produces a continuous-


Class valued score for each instance
1 0.95 + – The more likely it is for the instance to be in
2 0.93 + the + class, the higher the score
3 0.87 -
• Sort the instances in decreasing order according
4 0.85 -
to the score
5 0.85 -
• Apply a threshold at each unique value of the
6 0.85 +
score
7 0.76 -
8 0.53 + • Count the number of TP, FP,
9 0.43 -
TN, FN at each threshold
10 0.25 + – TPR = TP/(TP+FN)
– FPR = FP/(FP + TN)

BITS Pilani, Pilani Campus


How to construct an ROC curve

Class + - + - - - + - + +
P
Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

ROC Curve:

BITS Pilani, Pilani Campus


Using ROC for Model Comparison

• No model consistently outperforms


the other
– M1 is better for small FPR
– M2 is better for large FPR

• Area Under the ROC curve (AUC)


– Ideal:
• Area = 1
– Random guess:
• Area = 0.5

BITS Pilani, Pilani Campus


AUC-ROC

100 100
% %

AUC =

True Positive
True Positive
100%

Rate
Rate
AUC =
50%
0
0 %
% 0 10
0 10
% False Positive 0%
% False Positive 0% Rate
Rate
100 100
% %

AUC =
True Positive

True Positive
90% AUC =
Rate

Rate
65%
0 0
% %

BITS Pilani, Pilani Campus


I NTRODUCTION TO DATA S CIENCE
M ODULE # 5 : F EATURE E NGINEERING
IDS Course Team
BITS Pilani
The instructor is gratefully acknowledging
the authors who made their course
materials freely available online.

I N T R OD U CT ION TO D AT A S C I E N C E 2 / 83
T ABLE OF C ONTENTS

1 F E AT U R E E N G I N E E R I N G
2 F E AT U R E S E L E C T I O N
3 F I LT E R M E T H O D S
Pearson’s Correlation Coefficient
Chi-Squared Statistic
Information Theory Metrics
Gini Index
4 WRAPPER M ETHODS
5 E VA L U AT I O N O F F E AT U R E S E L E C T I O N
6 F E AT U R E E N G I N E E R I N G F O R T E X T
I N T R OD U CT ION TO D AT A S C I E N C E 3 / 83
F E AT U R E S

Feature is a property of an object under study.


Features are the basic building blocks of datasets.

Building Area Common Area Type of Flooring Distance From Sale Price
Bus Depot per square feet
11345 350 Marble 16503.22 6,715
2000 1334 Vitrified Tiles 16321.19 3,230
2544 924 Wood Vitrified Tiles 15619.92 6,588

I N T R OD U CT ION TO D AT A S C I E N C E 4 / 83
F E AT U R E E N G I N E E R I N G

Feature Engineering is the process of selecting and extracting useful, predictive


features from data.
The goal is to create a set of features that best represent the information contained in
the data, producing a simpler model that generalizes well to future observations.

I N T R OD U CT ION TO D AT A S C I E N C E 5 / 83
M O T I VAT I O N FOR F E AT U R E E N G I N E E R I N G

H UGHES P HENOMENON
Given fixed number of data points, performance of a regressor or a classifier first increases
but later decreases as the number of dimensions of the data increases.

Reasons for this phenomenon


Redundant Features
Correlation between features
Irrelevant Features

I N T R OD U CT ION TO D AT A S C I E N C E 6 / 83
F E AT U R E C R E AT I O N

Create new attributes that can capture important information in a dataset much more
efficiently than the original attributes.
Two general methodologies:
) Feature Extraction
) Feature Construction
2 Create dummy features
2 Create derived features

I N T R OD U CT ION TO D AT A S C I E N C E 7 / 83
F E AT U R E E X T R A C T I O N

Machine learning algorithms operate on a numeric feature space, expecting input as


a two-dimensional array where rows are instances and columns are features.
To perform machine learning on unstructured data like images and text, we need to
transform the data into vector representations such that we can apply numeric
machine learning.
This process is called feature extraction or vectorization.
Mostly rely on domain knowledge
) Fourier Transform
) Wavelet Transform
) Scale-Invariant Feature Transform (SIFT) for images
) Vector space transformation for text (TF-IDF)

I N T R OD U CT ION TO D AT A S C I E N C E 8 / 83
F E AT U R E E X T R A C T I O N

Bag of Words

I N T R OD U CT ION TO D AT A S C I E N C E 9 / 83
F E AT U R E E X T R A C T I O N

Image Features

I N T R OD U CT ION TO D AT A S C I E N C E 10 / 83
F E AT U R E E X T R A C T I O N
Facial Landmarks

I N T R OD U CT ION TO D AT A S C I E N C E 11 / 83
F E AT U R E E X T R A C T I O N

Human Pose Estimation

I N T R OD U CT ION TO D AT A S C I E N C E 12 / 83
F E AT U R E C O N S T R U C T I O N

Create dummy features


) Often used to convert categorical variable to into numerical variables.
) Use one-hot encoding or label encoding.

State (nominal scale) State (Label encoding)


Maharashtra 3
Tamil Nadu 4
Delhi −→ 0
Karnataka 2
Gujarat 1
Uttar Pradesh 5

I N T R OD U CT ION TO D AT A S C I E N C E 14 / 83
F E AT U R E C O N S T R U C T I O N
Customer ID Gender Payment Method
C001 Female Online banking
C002 Male Online banking
C003 Female Credit card
C004 Male Debit Card

Customer ID Gender Online banking Credit card Debit Card


C001 Female 1 0 0
C002 Male 1 0 0
C003 Female 0 1 0
C004 Male 0 0 1
I N T R OD U CT ION TO D AT A S C I E N C E 15 / 83
F E AT U R E C O N S T R U C T I O N
Create derived features
Involves creating a new feature using data from existing features
Mostly rely on domain knowledge
Eg: Calculating price per sqft

Area Price (Rs) Price/Sft (Rs)


1800 81,00,000 4500
2000 78,00,000 3900
1550 65,10,000 4200
2400 1,15,20,000 4800
3500 1,22,50,000 3500
2800 1,45,60,000 5200

I N T R OD U CT ION TO D AT A S C I E N C E 16 / 83
F E AT U R E C O N S T R U C T I O N

Customer ID Gender Session Begin Session End


C001 Female 15-06-2019 10:30 15-06-2019 11:15
C002 Male 13-06-2019 08:00 13-06-2019 08:03
C003 Female 02-06-2019 16:25 02-06-2019 18:35
C004 Male 01-06-2019 11:20 01-06-2019 13:00

Customer ID Gender Session Duration


C001 Female 45
C002 Male 3
C003 Female 125
C004 Male 100
I N T R OD U CT ION TO D AT A S C I E N C E 17 / 83
T ABLE OF C ONTENTS

1 F E AT U R E E N G I N E E R I N G
2 F E AT U R E S E L E C T I O N
3 F I LT E R M E T H O D S
Pearson’s Correlation Coefficient
Chi-Squared Statistic
Information Theory Metrics
Gini Index
4 WRAPPER M ETHODS
5 E VA L U AT I O N O F F E AT U R E S E L E C T I O N
6 F E AT U R E E N G I N E E R I N G F O R T E X T
I N T R OD U CT ION TO D AT A S C I E N C E 18 / 83
F E AT U R E S E L E C T I O N

Feature selection is the process of identifying relevant and important features from
irrelevant or redundant features.
It intends to select a subset of attributes or features that makes the most meaningful
contribution to a machine learning activity.

I N T R OD U CT ION TO D AT A S C I E N C E 19 / 83
F A C T O R S A F F E C T I N G F E AT U R E S E L E C T I O N
Feature Relevance
) In supervised algorithms, it is important for each feature to contribute towards the class
label, otherwise it is irrelevant.
) Need to determine : Strongly relevant, Moderately relevant and Weakly relevant features.
) In case of unsupervised algorithms, there is no labelled data. During the grouping
process, the algorithm identifies the irrelevant features.
Feature Redundancy
) A feature may contribute to information that is similar to the information contributed by
one or more features.
) All features having potential redundancy are candidates for rejection in the final feature
subset.
) If two features X1, X2 are highly correlated, then the two features become redundant
features since they have same information in terms of correlation measure.
I N T R OD U CT ION TO D AT A S C I E N C E 20 / 83
F E AT U R E S U B S E T S E L E C T I O N

Given: D initial set of features F = { f 1 , f2, f3, ..., fD } and target class label T .
Find: Minimum subset F J = { f1J , f2J , f3J ,..., fMJ } that achieves maximum classification
performance where F J ⊆ F .
There are 2D possible subsets.
Need a criteria to decide which subset is the best:
) Classifier based on these M features has the lowest probability of error of all such
classifiers.
Evaluating 2D possible subsets is time consuming and expensive.
Use heuristics to reduce the search space.

I N T R OD U CT ION TO D AT A S C I E N C E 21 / 83
S TEPS IN F E AT U R E S E L E C T I O N

Feature selection is an optimization problem


having the following steps:
Step1: Search the space of all possible
features.
Step2: Pick the optimal subset using
an objective function.

I N T R OD U CT ION TO D AT A S C I E N C E 22 / 83
F E AT U R E S E L E C T I O N A P P R OA C H E S
• Filter approaches: Features are selected before the data mining algorithm is run,
using some approach that is independent of the data mining task. For example, we
might select sets of attributes whose pairwise correlation is as low as possible.

• Wrapper approaches: These methods use the target data mining algorithm as a
black box to find the best subset of attributes, in a way similar to that of the ideal
algorithm described above, but typically without enumerating all possible subsets.

• Embedded approaches: Feature selection occurs naturally as part of the data


mining algorithm. Specifically, during the operation of the data mining algorithm, the
algorithm itself decides which attributes to use and which to ignore. Eg: Algorithms
for building decision tree classifiers

I N T R OD U CT ION TO D AT A S C I E N C E 23 / 83
T ABLE OF C ONTENTS

1 F E AT U R E E N G I N E E R I N G
2 F E AT U R E S E L E C T I O N
3 F I LT E R M E T H O D S
Pearson’s Correlation Coefficient
Chi-Squared Statistic
Information Theory Metrics
Gini Index
4 WRAPPER M ETHODS
5 E VA L U AT I O N O F F E AT U R E S E L E C T I O N
6 F E AT U R E E N G I N E E R I N G F O R T E X T
I N T R OD U CT ION TO D AT A S C I E N C E 24 / 83
F ILT E R M E T H O D S
The Predictive power of individual feature is evaluated.
Rank each feature according to some uni-variate metric and select the highest
ranking features.
Compute a score for each feature.
The score should reflect the discriminative power of each feature.
Advantages
) Fast
) Provides generically useful feature set.
Disadvantages
) Cause higher error than wrapper methods.
) A feature that is not useful by itself can be very useful when combined with others. Filter
methods can miss it.
I N T R OD U CT ION TO D AT A S C I E N C E 25 / 83
F ILT E R M E T H O D S

Algorithm
Given Input: large feature set F .
1 Identify candidate subset S ⊆ F .
2 While ! stop criterion()
1 Evaluate utility function J using S.
2 Adapt S.
3 Return S.

I N T R OD U CT ION TO D AT A S C I E N C E 26 / 83
T YPES OF F ILT E R S

Correlation-based
Information-theoretic metrics
) Pearson correlation ) Mutual Information (Information
) Spearman rank correlation
Gain)
) Kendall concordance
) Gain Ratio
Statistical/probabilistic independence
Others
metrics
) Gini index
) Chi-square statistic
) Fisher score
) F-statistic
) Cramer’s V
) Welch’s statistic

I N T R OD U CT ION TO D AT A S C I E N C E 27 / 83
WHICH F I LT E R ?

I N T R OD U CT ION TO D AT A S C I E N C E 29 / 83
P E A R S O N ’ S C O R R E L AT I O N C O E F F I C I E N T
Used to measure the strength of association between two continuous features.
Both positive and negative correlation are useful.
We use Pearson Correlation to compute the correlation matrix or heat map.
Steps
1 Compute the Pearson’s Correlation Coefficient for each feature.
2 Sort according the score.
Retain the highest ranked features, discard the lowest ranked.
3

Limitation
Pearson assumes all features are independent.
Pearson identifies only linear correlations
) Positive linear relationship – In children, as the height increases, weight also increases.
) Negative linear relationship – If the vehicle increases its speed, the time taken to travel
decreases.
I N T R OD U CT ION TO D AT A S C I E N C E 30 / 83
P E A R S O N ’ S C O R R E L AT I O N C O E F F I C I E N T

I N T R OD U CT ION TO D AT A S C I E N C E 31 / 83
I N T E R P R E T AT I O N OF THE P E A R S O N C O R R E L AT I O N
−1 ≤ rA,B ≤ +1
If rA,B > 0
) A and B are positively correlated.
) The values of A increase as the values of B increase.
) The higher the value, the stronger the correlation (i.e., the more each attribute implies
the other).
If rA,B < 0
) A and B are negatively correlated.
) The values of one attribute increase as the values of the other attribute decrease.
If rA,B = 0
) A and B are independent and there is no correlation between them.
If rA,B = −1 or + 1
) linear fit is perfect: all data points lie on one line.
Use scatter plot for visualizing.
I N T R OD U CT ION TO D AT A S C I E N C E 32 / 83
P E A R S O N ’ S C O R R E L AT I O N C O E F F I C I E N T
P E A R S O N ’ S C O R R E L AT I O N E X A M P L E

Check whether sale of ice creams and sun glasses are related?

Ice cream sale Sun glasses sale


A B
20 30
10 5
23 29
5 10

I N T R OD U CT ION TO D AT A S C I E N C E 34 / 83
P E A R S O N ’ S C O R R E L AT I O N E X A M P L E

I N T R OD U CT ION TO D AT A S C I E N C E 35 / 83
P E A R S O N ’ S C O R R E L AT I O N E X A M P L E

I N T R OD U CT ION TO D AT A S C I E N C E 36 / 83
D EMO C ODE

PearsonExample.py
CorrelationCoeffecient.ipynb
PearsonCorrelation Covid Data.ipynb

I N T R OD U CT ION TO D AT A S C I E N C E 37 / 83
χ 2 S TAT I S T I C
Chi-square test of independence allow us to see whether or not two categorical
variables are related or not.
The probability density function for the χ2 distribution with r degrees of freedom (df) .

I N T R OD U CT ION TO D AT A S C I E N C E 38 / 83
χ 2 S T AT I S T I C E X A M P L E

Let’s say you want to know if gender has anything to do with political party preference. You
poll 440 voters in a simple random sample to find out which political party they prefer. The
results of the survey are shown in the table below:

Gender Republican Democrat Independent Total


Male 100 70 30 200
Female 140 60 20 220
Total 240 130 50 440

I N T R OD U CT ION TO D AT A S C I E N C E 39 / 83
χ 2 S T AT I S T I C E X A M P L E

To see if gender is linked to political party preference, perform a Chi-Square test of


independence using the steps below.
Step 1: Define the Hypothesis
) H0: There is no link between gender and political party preference. [Null Hypothesis]
) HA: There is a link between gender and political party preference. [Alternate Hypothesis]

I N T R OD U CT ION TO D AT A S C I E N C E 40 / 83
χ 2 S T AT I S T I C E X A M P L E

I N T R OD U CT ION TO D AT A S C I E N C E 41 / 83
χ 2 S T AT I S T I C E X A M P L E

I N T R OD U CT ION TO D AT A S C I E N C E 42 / 83
χ 2 S T AT I S T I C E X A M P L E

Step 5: From the table, critical


value = 5.991(df = 2, alpha = 0.05)
Since Calculated value of χ2 >
Critical Value.
H0 is rejected ; HA is accepted.
Interpretation: There is sufficient
evidence to say that there is a link
between the gender and political party
preference.

I N T R OD U CT ION TO D AT A S C I E N C E 43 / 83
χ 2 S T AT I S T I C E X A M P L E
A group of customers were classified in terms of personality (introvert, extrovert or normal)
and in terms of color preference (red, yellow or green) with the purpose of seeing whether
there is an association (relationship) between personality and color preference.
Data was collected from 400 customers and presented in the 3(rows)×3(cols) contingency
table below.

Observed Counts Colors


Personality Red Yellow Green Total
Introvert 11 5 1 17
Extrovert 8 6 8 22
Normal 3 10 12 25
Total 22 21 21 64

I N T R OD U CT ION TO D AT A S C I E N C E 44 / 83
χ 2 S T AT I S T I C E X A M P L E

Step 1:
Set up hypotheses and determine level of significance.
Null hypothesis(H0): Color preference is independent of personality.
Alternative hypothesis(HA): Color preference is dependent on personality .
Level of significance: specifies the probability of error. Generally it is set as 5%.

 = 0.05

Assume that H0 is always true unless the evidence portraits something else in which
case we will reject H0 and accept HA.

I N T R OD U CT ION TO D AT A S C I E N C E 45 / 83
χ 2 S TAT I S T I C E X AM P L E
Step 2:
Compute the expected count.

Row total × Column total


E =
Grand total

Expected Counts Colors


Personality Red Yellow Green Total
Introvert 5.8 5.6 5.6 17
Extrovert 7.6 7.2 7.2 22
Normal 8.6 8.2 8.2 25
Total 22 21 21 64

I N T R OD U CT ION TO D AT A S C I E N C E 46 / 83
χ 2 S T AT I S T I C E X A M P L E

Step 3:
Compute the Chi-Squared Statistic.

I N T R OD U CT ION TO D AT A S C I E N C E 47 / 83
χ2 S T AT I S T I C E X A M P L E
Step 4
Compute degrees of freedom.

df = (r − 1)(c − 1)

r is the number of categories in one variable and c is the number of categories in the
other.
df = (3 − 1) × (3 − 1) = 4 (contingency table)
Step 5
From the table, critical value = 9.488(df = 4, alpha = 0.05)
Since Calculated value of χ2 > Critical Value of χ2 H0 is rejected ; HA is accepted.
Interpretation: There is sufficient evidence to say that Color Preference depends on
the Personality.
I N T R OD U CT ION TO D AT A S C I E N C E 48 / 83
D EMO C ODE

ChiSquareGeneral.ipynb
ChiSquareCovidExample.ipynb

I N T R OD U CT ION TO D AT A S C I E N C E 49 / 83
I N FO R M AT I O N T H E O RY METRICS

Information-theoretic concepts can only be applied to discrete variables.


For continuous feature values, some data discretization techniques are required
beforehand.

I N T R OD U CT ION TO D AT A S C I E N C E 50 / 83
I N FO R M AT I O N G A I N

I N T R OD U CT ION TO D AT A S C I E N C E
I N FO R M AT I O N G A I N
Compute the Information Gain for the attribute Travel Cost wrt Transport Mode.
Gender Car Ownership Travel Cost Income Level Transport Mode
Male 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Female 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Female 1 Expensive High Car
Male 2 Expensive Medium Car
Female 2 Expensive High Car
Female 1 Cheap Medium Train
Male 0 Standard Medium Train
Female 1 Standard Medium Train
I N T R OD U CT ION TO D AT A S C I E N C E 52 / 83
I N FO R M AT I O N G A I N

Step 1: Compute the Entropy of Transport Mode.

Transport Mode
Bus Car Train
4 3 3

I N T R OD U CT ION TO D AT A S C I E N C E 53 / 83
I N FO R M AT I O N G A I N
Step 2: Compute the Entropy of target given one feature.
Feature Transport Mode
Bus Train Car
Cheap 4 1 0
Expensive 0 0 3
Standard 0 2 0

I N T R OD U CT ION TO D AT A S C I E N C E 54 / 83
I N FO R M AT I O N G A I N

Step 3: Compute the information gain.

IG(Transport |Cost ) = H(4, 3, 3) − (H(5, 3, 2)


= 1.571 − 0.36
= 1.211

I N T R OD U CT ION TO D AT A S C I E N C E 55 / 83
D EMO C ODE

InformationGainCovidData.ipynb

I N T R OD U CT ION TO D AT A S C I E N C E 56 / 83
G INI I N D E X

Gini index minimizes the probability of misclassification.


Used in CART (Classification and Regression Tree) algorithms.

where pk denotes the proportion of instances belonging to class k .


Higher Gini Index; better prediction of Y given X .

I N T R OD U CT ION TO D AT A S C I E N C E 57 / 83
G INI I N D E X
Compute the Gini Index for the feature Travel Cost wrt Transport Mode.
Gender Car Ownership Travel Cost Income Level Transport Mode
Male 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Female 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Female 1 Expensive High Car
Male 2 Expensive Medium Car
Female 2 Expensive High Car
Female 1 Cheap Medium Train
Male 0 Standard Medium Train
Female 1 Standard Medium Train
I N T R OD U CT ION TO D AT A S C I E N C E 58 / 83
G INI I N D E X

I N T R OD U CT ION TO D AT A S C I E N C E 59 / 83
G I N I I N D E X I N T E R P R E T AT I O N

Gini index < 0.2 represents perfect equality.


Gini index between 0.2and 0.3 represent relative equality. Gini index
between 0.3and0.4 represent adequate equality. Gini index between
0.4and0.5 represent big gap.
Gini index > 0.5 represent Severe gap.
0 represents perfect equality.
1 represents perfect inequality.

I N T R OD U CT ION TO D AT A S C I E N C E 60 / 83
T ABLE OF C ONTENTS

1 F E AT U R E E N G I N E E R I N G
2 F E AT U R E S E L E C T I O N
3 F I LT E R M E T H O D S
Pearson’s Correlation Coefficient
Chi-Squared Statistic
Information Theory Metrics
Gini Index
4 WRAPPER M ETHODS
5 E VA L U AT I O N O F F E AT U R E S E L E C T I O N
6 F E AT U R E E N G I N E E R I N G F O R T E X T
I N T R OD U CT ION TO D AT A S C I E N C E 61 / 83
WRAPPER METHODS

Wrappers require some method to search the space of all possible subsets of
features, assessing their quality by learning and evaluating a classifier with that
feature subset.
The feature selection process is based on a specific machine learning algorithm that
we are trying to fit on a given dataset.
It follows a greedy search approach by evaluating all the possible combinations of
features against the evaluation criterion.
The wrapper methods usually result in better predictive accuracy than filter methods.

I N T R OD U CT ION TO D AT A S C I E N C E 62 / 83
WRAPPER METHODS
Greedy Based algorithms.
Performance of the method depends on the machine learning models chosen.
Sequential feature selection algorithm add or remove one feature at a time based on
the classifier performance until a desired criterion is met.
Two methods
) Sequential Forward Selection(SFS)
) Sequential Backward Selection(SBS)
Advantages
) Highest performance
Disadvantages
) Computationally expensive
) Memory intensive
I N T R OD U CT ION TO D AT A S C I E N C E 63 / 83
WRAPPER METHODS T YPES

Forward selection
) starts with one predictor and adds more iteratively.
) At each subsequent iteration, the best of the remaining original predictors are added
based on performance criteria.
) SequentialFeatureSelector class from mlxtend
Backward elimination
) starts with all predictors and eliminates one-by-one iteratively.
) One of the most popular algorithms is Recursive Feature Elimination (RFE) which
eliminates less important predictors based on feature importance ranking.
) RFE class from sklearn

I N T R OD U CT ION TO D AT A S C I E N C E 64 / 83
S E q U E N T I A L F O RWA R D S E L E C T I O N

I N T R OD U CT ION TO D AT A S C I E N C E 65 / 83
S F S E XAMPLE – W I N E D ATA

I N T R OD U CT ION TO D AT A S C I E N C E 66 / 83
S E q U E N T I A L B A C K WA R D S E L E C T I O N

I N T R OD U CT ION TO D AT A S C I E N C E 67 / 83
S B S E XAMPLE – W I N E D ATA

I N T R OD U CT ION TO D AT A S C I E N C E 68 / 83
E MBEDDED METHODS

Embedded methods combine the


qualities of filter and wrapper
methods.
Implemented by algorithms that have
their own built-in feature selection
methods.
The most common embedded
technique are the tree algorithm’s like
RandomForest, ExtraTree, and so on.

I N T R OD U CT ION TO D AT A S C I E N C E 69 / 83
Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar (T1)
Data Mining: Concepts and Techniques, Third Edition by Jiawei Han and Micheline
Kamber Morgan Kaufmann Publishers, 2006 (T4)

T HANK YOU

I N T R OD U CT ION TO D AT A S C I E N C E 70 / 83

You might also like