0% found this document useful (0 votes)
29 views26 pages

DS Unit 1

Uploaded by

Tanmay Mandal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views26 pages

DS Unit 1

Uploaded by

Tanmay Mandal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

12-02-2024

DATA SCIENCE

UNIT 1

Data is a representation of facts or ideas in a formalized manner 2


capable of being communicated or manipulated by some process.

1
12-02-2024

DATA SCIENCE

• Data Science is an interdisciplinary field focused on extracting knowledge from data sets and
applying the knowledge and insights from that data to solve problems in a wide range of application
domains.

• Data science also integrates domain knowledge from the underlying application domain (e.g., natural sciences,
information technology, and medicine).

• Data science is multifaceted and can be described as a science, a research paradigm, a research method, a discipline,
a workflow, and a profession.

• Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" to
"understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many
fields within the context of mathematics, statistics, computer science, information science, and domain knowledge.

WHO ARE DATA SCIENTISTS? Quantitative


•Maths
•Algorithms
•Statistics

 Not computer scientists


 But should know about databases, data structures, algorithms, etc.

 Not mathematicians Collaborative


•Teamwork
Data Technical
 But should know about optimization, stochastics, etc. •Communication Scientists •Programming
•Infrastructures
skills

 Not statisticians
 But should know about regression, statistical tests, etc.

Skeptical
 Not domain experts •Create
hypotheses, but
 But must work together with them be skeptical
about them

A bit of everything
… but actually as much as possible of everything

2
12-02-2024

TECHNIQUES, PEOPLE, & PROCESSES

 Techniques
 Languages, tools, and methods
 Must be suited for the given problem
Techniques
 People
 Require training for the techniques
 Should be guided through a project by a process
 Process
Processes People
 Supports the people
 Must be accepted by the people
 Should have a measurable positive effect

OVERVIEW OF GENERIC PROCESS OF DATA SCIENCE PROJECTS

Discovery

Data
Operationalize
Preparation

Communicate
Model Planning
Results

Model Building

3
12-02-2024

DISCOVERY

 Initial phase of the project

 Learn the domain


 Knowledge for understanding the data and the use cases of the project
 Knowledge for the interpretation of the results

 Learn from the past


 Identify past projects on similar issues
 Differences, reasons for failures, weaknesses of past projects
 Can also be projects of competitors, if reports are available

DISCOVERY

 Frame the problem


 Framing is the process of stating the data analysis problem to be solved
 Why is the problem important?
 Who are the key stakeholders and what are their interests in the project?
 What is the current situation and what are pain points that motivate the project?
 What are the objectives of the project?
 Business needs
 Research goals
 What needs to be done to achieve the objectives?
 What are success criteria for the project?
 What are risks for the project?

4
12-02-2024

DISCOVERY

 Begin learning the data


 Get a high-level understanding of the data
 Maybe even some initial statistics or visualizations of the data
 Determine requirements for data structures and tools for processing the data

 Formulate hypothesis
 Part of the Science in Data Science
 Should define expectations
 Feature X is well suited for the prediction of …
 The following patterns will be found in the data: …
 Deep learning will outperform …
 Decision trees will perform well and allow insights into …
 Should be discussed with stakeholders

DISCOVERY
 Analyze available resources
 Technologies
 Resources for computation and storage
 Licenses for analysis frameworks
 Data
 Is the available data sufficient for the use case?
 Would other data be required and could the additional data be collected within the scope of the project?
 Timeframe
 Scope in calendar time and person months
 Human resources
 Who is available for the project?
 Is the skillset a good match for the tasks of the project?

 Only start project if the resources are sufficient!

5
12-02-2024

DATA PREPARATION
 Create the infrastructure for the project
 Usually different from infrastructure in which data is made available to you
 Warehouse/csv-file/…  distributed storage that enables analysis
 Could also be simpler, for small data sizes

 Extract – Transform – Load (ETL) the data


 Define how to query existing database to extract required data
 Determine required transformations of the raw data
 Quality checking (e.g., filtering of missing data, implausible data)
 Structuring (e.g., for unstructured data, differences in data structures)
 Conversions (e.g., timestamps, character encodings)
 Load the data into your analysis environment

DATA PREPARATION

 ELT vs. ETL


 Transformations can be very time-consuming for big data
 Might not be possible without using the analysis infrastructure
 Load raw data, transform afterwards  ELT!

 Also allows more flexibility with transformations


 E.g., testing the effect of different transformations

 Allows access to raw data

6
12-02-2024

DATA PREPARATION

 Get a deep understanding of the data


 Understand all data sources
 E.g., what does each column in a relational database contain?
 How can a structure be imposed on semi-/quasi-/unstructured data?

 Survey and visualize data


 Descriptive statistics
 Correlation analysis
 Visualizations like histograms, density plots, pair-wise plots, etc.

 Clean and normalize data


 Discard data that is not required
 Normalize to remove scale effects

DATA PREPARATION
 Clean data
 Discard data that is not required
 Can make the difference between a complex infrastructure and a single machine for analysis

 Example:
 100 million measurements
 10 floating point features per measurement  80 Bytes per measurement
 3 useful features ≈ 24 Bytes per measurement
 7.45 Gigabytes with all features, 2.23 Gigabytes with only useful features
 Can use my laptop for cleaned data without problems

7
12-02-2024

MODEL PLANNING

 Determine methods for data analysis

 Should be well-suited to meet objectives


 Often determines the type of method
 Classification, regression, clustering, association mining, …
 Other factors can also restrict the available methods
 For example, if insight is important, „blackbox“ methods cannot be used

 Should be well-suited for the available data


A blackbox method is a method where you
 Volume, structure, … only get results, but do not really understand
why the output is computed that way.
A whitebox method also explains why the
output is as it is.

MODEL PLANNING
 Methods for data analysis may cover
 Feature modeling, e.g., for text mining
 Feature selection, e.g., based on information gain, correlations, etc.
 Model creation, e.g., different models that may address the use case
 Statistical methods, e.g., for the comparison of results
 Visualizations, e.g., for the presentation of results

 Split data into different data sets


 Training data, validation data, test data
 „Toy“ data for local use in case of big data
 Same structure, but very small

8
12-02-2024

MODEL BUILDING

 Perform the analysis using the planned methods


 Often iterative process!

 Separate phase, because this can be VERY time consuming


 Use toy examples for model planning
 Use real big data set with potentially lots of hyper parameters for tuning during model building

 Includes the calculation of performance indicators

COMMUNICATE RESULTS

 Main question:Was the project successful?

 Compare results to hypothesis from the discovery phase

 Identify the key findings

 Try to quantify the value of your results


 Business value, e.g., the expected Return On Investment (ROI)
 Advancement of the state of the art

 Summarize findings for different audiences

9
12-02-2024

OPERATIONALIZE

 Implement results in operation


 Only in case of successful projects

 Should run a pilot first


 Determine if expectations hold during the practical application
 All kinds of reasons for failures
 Rejection by users, shift in data reduces model performance, ...

 Define a process to update and retrain model


 Data gets older, models get outdated
 Data driven models should be updated regularly
 Process is required

ROLES WITHIN PROJECTS

 A role is a function or part performed especially in a particular operation or process

 Role ≠ Person
 One role can be fulfilled by multiple persons
 One person can fulfill multiple roles

 Roles assign responsibilities within processes


 In practice, roles are often related to job titles
 Software Developer, Database Administrator, Project Manager, …

10
12-02-2024

ROLES FOR DATA SCIENCE PROJECTS


Role Description
• Someone who uses the end results
Business User
• Can consult and advise project team on value of end results and how these will be operationalized

• Responsible for the genesis of the project


Project Sponsor • Generally provides the funding
• Gauge the value from the final outputs

• Ensure key milestones and objectives are met on time and at expected quality
Project Manager
• Plans and manages resources

Business Intelligence • Business domain expertise with deep understanding of the data
Analyst • Understands reporting in the domain, e.g., Key Performance Indicators (KPIs)

Data Engineer • Deep technical skills to assist with data management and ETL/ELT

Database Administrator • Provisions and configures database environment to support the analytical needs of the project

• Expert on analytical techniques and data modeling


Data Scientist • Applies valid analytical techniques to given business problems
• Ensures analytical objectives are met

DELIVERABLES

 A deliverable is a tangible or intangible good or service produced as a result of a project.


 Are often parts of contracts
 Should meet stakeholder‘s needs and expectations

 Four core deliverables for data science projects


 Sponsor presentation
 Analyst presentation
 Code
 Technical specifications

11
12-02-2024

SPONSOR PRESENTATION

 Big Picture of the project

 Clear takeaway messages


 Highlight KPIs
 Should aid decision making

 Should address a non-technical audience

 Clean and simple visualizations


 For example, bar charts, line charts, …

ANALYST PRESENTATION

 Describe analysis methods and data


 General approach
 Interesting insights, unexpected situations

 Details on how results change current status


 Business process changes
 Advancement of the state of the art

 May use more complex visualizations


 For example, density plots, histograms, boxplots, ROC curves, …
 Should still be clean and not overloaded

12
12-02-2024

CODE AND TECHNICAL SPECIFICATION

 All available code of the project


 Often code is prototypical („hacky“) because results are more important than clean code

 Enables operationalization
 May re-use code as is
 May adopt code or clean up code
 May rewrite same functionality in a different language/for a different environment

 Technical specification should be provided as well


 Description of the environment
 Description of how to invoke code

EXPECTED DELIVERABLES BY ROLE


Role Deliverable
Expects a sponsor presentation:
 Are the results good for me?
Business User
 What are the benefits for me?
 What are the implications for me?
Expects a sponsor presentation:
 What is the impact of operationalizing the results?
Project Sponsor
 What are the risk and what is the potential ROI?
 How can this be evangelized within the organization (and beyond)?
• Responsible for the timely availability of all deliverables
Project Manager
• Responsible for the sponsor presentations
Expects an analyst presentation:
 Which data was used?
Business Intelligence Analyst
 How will reporting change?
 How will KPIs change?
Data Engineer • Responsible for data engineering code and technical documentation
Database Administrator • Responsible for infrastructure code and technical documentation
• May be the target audience for analyst presentations.
• Responsible for data analysis code and technical documentation
Data Scientist
• Responsible for the analyst presentation
• Support of the project management with the sponsor presentation

13
12-02-2024

DATA AS DELIVERABLE

 Only applicable if new data was collected/generated

 Sharing the data may be very important


 Especially in research to enable reproducible and replicable research

 Sharing may be internal (industry) or public (research)


 Use stable links for references to prevent link rot
 Ideally Digital Object Identifiers (DOIs)

 Should not only contain the data, but also metadata and tools for collecting the data

WHAT IS BIG DATA?

 Big data is a term for data sets that are so large or complex
where traditional data processing application software are
inadequate to deal with them. [source:Wikipedia]

28

14
12-02-2024

WHAT IS BIG DATA?

 Big data is data whose scale, distribution, diversity, and/or timeliness


require the use of new technical architectures and analytics to
enable insights that unlock new sources of business value. [McKinsey
Global Institute, 2011]

29

What’s Big Data Analytics?

30

15
12-02-2024

WHAT IS BIG DATA ANALYTICS?

“Reality is one, though wise men speak of it variously”.


Rigveda

 Humans have a tendency to claim absolute truth based on their limited, subjective experience
as they ignore other people's limited, subjective experiences which may be equally true.

 Individual truth may be partially true but it is not the ultimate truth.

 There might be some fact to what somebody says. We might not agree with it at first because we have
our own reasons. But what we think might not be the absolute truth.

31

WHAT IS BIG DATA ANALYTICS?


 All about finding the needle of value in a haystack of
structured, semi-structured, and unstructured information

32

16
12-02-2024

CHARACTERISTICS OF BIG DATA

 Data growth challenges and opportunities are defined as being three-dimensional,


i.e. increasing volume, velocity, and variety

 Volume: the quantity of generated and stored data

 Velocity: the speed at which data is generated and processed

 Variety: the type and nature of data


33

VOLUME (SCALE)

34

17
12-02-2024

VOLUME (SCALE)

CERN’s
Large Hydron Collider (LHC)
generates 15 PB a year !

35

VELOCITY (SPEED)

 Data is generated fast and need to be processed fast


 Late decisions  missing opportunities
 Examples
 E-Promotions: Based on your current location, your purchase history, what you
like  send promotions right now for store next to you
 Healthcare monitoring: sensors monitoring your activities and body  any
abnormal measurements require immediate reaction
36

18
12-02-2024

REAL-TIME/FAST DATA

Mobile devices
(tracking all objects all the time)

Scientific instruments
Social media and networks Sensor technology and
(collecting all sorts of data)
networks
(measuring all kinds of data)

 The progress and innovation is no longer hindered by the ability to collect data

 But, by the ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion 37

VARIETY (COMPLEXITY)

 Relational Data (Tables/Transaction/Legacy Data)


 Text Data (Web)
 Semi-structured Data (XML)
 Graph Data
 Social Network, Semantic Web (RDF), …
 Streaming Data

38
To extract knowledge all these types of data need to linked together

19
12-02-2024

SOME MAKE IT TEN V’S …

39

DATA ENGINEERING, DATA ANALYSIS, DATA SCIENCE

 Data engineering: designing and building infrastructure for integrating and managing
data from various resources

 Data analysis: querying and processing data, providing reports, summarizing and
visualizing data

 Data science: applying statistics, machine learning and analytic approaches to solve
critical business problems, and turning data into valuable and actionable insights 40

20
12-02-2024

ANALYSIS VERSUS ANALYTICS

 “Analysis is the separation of a whole into its component parts, and analytics is the
method of logical analysis.”

 Analysis looks backwards over time, providing marketers with a historical


view of what has happened.

 Analytics look forward to model the future or predict a result.

 Analytics defines the science behind the analysis.


41

BUSINESS ANALYSIS VERSUS BUSINESS ANALYTICS

42

21
12-02-2024

ANALYTICS TYPES

 DESCRIPTIVE

 DIAGNOSTIC

 PREDICTIVE

 PRESCRIPTIVE

43

DESCRIPTIVE ANALYTICS: WHAT IS HAPPENING?

 Comprises analyzing past data to present it in a


summarized form which can be easily
interpreted.

 A major portion of analytics done today is descriptive


analytics through use of statistics functions such as
counts, maximum, minimum, mean, top-N, percentage,
for instance. These statistics help in describing patterns
in the data and present the data in a summarized form.

 For example, computing the total number of likes for a


particular post, computing the average monthly rainfall or
finding the average number of visitors per month on a
website.
44

22
12-02-2024

DIAGNOSTIC ANALYTICS: WHY IT IS HAPPENING?

 Comprises analysis of past data to diagnose


the reasons as to why certain events
happened.

 Let us consider an example of a system that collects


and analyzes sensor data from machines for
monitoring their health and predicting failures.
 Diagnostic analytics can provide more insights into why
certain a fault has occurred based on the patterns in
the sensor data for previous faults.
45

PREDICTIVE ANALYTICS: WHAT IS LIKELY TO HAPPEN?


 Comprises predicting the occurrence of an event or the likely
outcome of an event or forecasting the future values using
prediction models.
 For example, predictive analytics can be used for predicting when a fault
will occur in a machine, predicting whether a tumor is benign or malignant,
predicting the occurrence of natural emergency (events such as forest fires
or river floods) or forecasting the pollution levels.
 These models learn patterns and trends from the existing data and
predict the occurrence of an event or the likely outcome of an event
(classification models) or forecast numbers (regression models).
 The accuracy of prediction models depends on the quality and volume
of the existing data available for training the models, such that all the
patterns and trends in the existing data can be learned accurately.
 The typical approach adopted while developing prediction models is
to divide the existing data into training and test data sets (for example
46
75% of the data is used for training and 25% data is used for testing
the prediction model).

23
12-02-2024

PRESCRIPTIVE ANALYTICS: WHAT DO I NEED TO DO?


 While predictive analytics uses prediction models to predict the likely
outcome of an event, prescriptive analytics uses multiple prediction
models to predict various outcomes and the best course of action for
each outcome.

 Prescriptive Analytics can predict the possible outcomes based


on the current choice of actions. We can consider prescriptive
analytics as a type of analytics that uses different prediction models for
different inputs.

 Prescriptive analytics prescribes actions or the best option to


follow from the available options.

 For example, prescriptive analytics can be used to prescribe the best


medicine for treatment of a patient based on the outcomes of various 47
medicines for similar patients.

DATA ANALYTICS

48

24
12-02-2024

INDUSTRY WIDE APPLICATIONS OF ANALYTICS


Industry Sector Sample Analytical problems Data Sources

 Supply Chain Analytics  Procurement , sales and production data


 Quality and process improvement  Warranty and after sales service
Manufacturing
 Revenue and cost management  Commodity Price Data
 Warranty Analytics  Manufacturing Data
 Macroeconomic Data

 Assortment Planning  Price data


 Promotion Planning  Demand data at SKU and at category level
 Demand forecasting  SKU level sales data with and without
Retail  Market Basket Analysis promotions
 Customer Segmentation  Planogram
 Customer demographics data
 Point of sales data
 Loyalty program data

 Clinical care • All patient care related data


Healthcare
 Hospitality related data • Hospitality related data
• Patient feedback data

INDUSTRY WIDE APPLICATIONS OF ANALYTICS


Industry Sector Sample Analytical problems Data Sources

 Demand forecasting  Transactional and feedback data


 NPS Optimization  Pricing and demand data
Service
 Service Quality Analysis  Promotional data
 Customer Segmentation
 Promotion

 Assortment Planning  Customer transactional data


 Promotion Planning  Loan originating data
Banking & Finance  Demand forecasting  Credit scoring data
 Market Basket Analysis
 Customer Segmentation

 Demand for Analytics Services  Customer interaction and market research data
IT and ITES(IT enabling
 Software Development Cycle time  Internal product development data
Services)

**Primary sources of data and secondary sources to be used in solving these analytical problems

25
12-02-2024

REFERENCES

 And Several Online Materials and Research Papers 51

26

You might also like