DS Unit 1
DS Unit 1
DATA SCIENCE
UNIT 1
1
12-02-2024
DATA SCIENCE
• Data Science is an interdisciplinary field focused on extracting knowledge from data sets and
applying the knowledge and insights from that data to solve problems in a wide range of application
domains.
• Data science also integrates domain knowledge from the underlying application domain (e.g., natural sciences,
information technology, and medicine).
• Data science is multifaceted and can be described as a science, a research paradigm, a research method, a discipline,
a workflow, and a profession.
• Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" to
"understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many
fields within the context of mathematics, statistics, computer science, information science, and domain knowledge.
Not statisticians
But should know about regression, statistical tests, etc.
Skeptical
Not domain experts •Create
hypotheses, but
But must work together with them be skeptical
about them
A bit of everything
… but actually as much as possible of everything
2
12-02-2024
Techniques
Languages, tools, and methods
Must be suited for the given problem
Techniques
People
Require training for the techniques
Should be guided through a project by a process
Process
Processes People
Supports the people
Must be accepted by the people
Should have a measurable positive effect
Discovery
Data
Operationalize
Preparation
Communicate
Model Planning
Results
Model Building
3
12-02-2024
DISCOVERY
DISCOVERY
4
12-02-2024
DISCOVERY
Formulate hypothesis
Part of the Science in Data Science
Should define expectations
Feature X is well suited for the prediction of …
The following patterns will be found in the data: …
Deep learning will outperform …
Decision trees will perform well and allow insights into …
Should be discussed with stakeholders
DISCOVERY
Analyze available resources
Technologies
Resources for computation and storage
Licenses for analysis frameworks
Data
Is the available data sufficient for the use case?
Would other data be required and could the additional data be collected within the scope of the project?
Timeframe
Scope in calendar time and person months
Human resources
Who is available for the project?
Is the skillset a good match for the tasks of the project?
5
12-02-2024
DATA PREPARATION
Create the infrastructure for the project
Usually different from infrastructure in which data is made available to you
Warehouse/csv-file/… distributed storage that enables analysis
Could also be simpler, for small data sizes
DATA PREPARATION
6
12-02-2024
DATA PREPARATION
DATA PREPARATION
Clean data
Discard data that is not required
Can make the difference between a complex infrastructure and a single machine for analysis
Example:
100 million measurements
10 floating point features per measurement 80 Bytes per measurement
3 useful features ≈ 24 Bytes per measurement
7.45 Gigabytes with all features, 2.23 Gigabytes with only useful features
Can use my laptop for cleaned data without problems
7
12-02-2024
MODEL PLANNING
MODEL PLANNING
Methods for data analysis may cover
Feature modeling, e.g., for text mining
Feature selection, e.g., based on information gain, correlations, etc.
Model creation, e.g., different models that may address the use case
Statistical methods, e.g., for the comparison of results
Visualizations, e.g., for the presentation of results
8
12-02-2024
MODEL BUILDING
COMMUNICATE RESULTS
9
12-02-2024
OPERATIONALIZE
Role ≠ Person
One role can be fulfilled by multiple persons
One person can fulfill multiple roles
10
12-02-2024
• Ensure key milestones and objectives are met on time and at expected quality
Project Manager
• Plans and manages resources
Business Intelligence • Business domain expertise with deep understanding of the data
Analyst • Understands reporting in the domain, e.g., Key Performance Indicators (KPIs)
Data Engineer • Deep technical skills to assist with data management and ETL/ELT
Database Administrator • Provisions and configures database environment to support the analytical needs of the project
DELIVERABLES
11
12-02-2024
SPONSOR PRESENTATION
ANALYST PRESENTATION
12
12-02-2024
Enables operationalization
May re-use code as is
May adopt code or clean up code
May rewrite same functionality in a different language/for a different environment
13
12-02-2024
DATA AS DELIVERABLE
Should not only contain the data, but also metadata and tools for collecting the data
Big data is a term for data sets that are so large or complex
where traditional data processing application software are
inadequate to deal with them. [source:Wikipedia]
28
14
12-02-2024
29
30
15
12-02-2024
Humans have a tendency to claim absolute truth based on their limited, subjective experience
as they ignore other people's limited, subjective experiences which may be equally true.
Individual truth may be partially true but it is not the ultimate truth.
There might be some fact to what somebody says. We might not agree with it at first because we have
our own reasons. But what we think might not be the absolute truth.
31
32
16
12-02-2024
VOLUME (SCALE)
34
17
12-02-2024
VOLUME (SCALE)
CERN’s
Large Hydron Collider (LHC)
generates 15 PB a year !
35
VELOCITY (SPEED)
18
12-02-2024
REAL-TIME/FAST DATA
Mobile devices
(tracking all objects all the time)
Scientific instruments
Social media and networks Sensor technology and
(collecting all sorts of data)
networks
(measuring all kinds of data)
The progress and innovation is no longer hindered by the ability to collect data
But, by the ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion 37
VARIETY (COMPLEXITY)
38
To extract knowledge all these types of data need to linked together
19
12-02-2024
39
Data engineering: designing and building infrastructure for integrating and managing
data from various resources
Data analysis: querying and processing data, providing reports, summarizing and
visualizing data
Data science: applying statistics, machine learning and analytic approaches to solve
critical business problems, and turning data into valuable and actionable insights 40
20
12-02-2024
“Analysis is the separation of a whole into its component parts, and analytics is the
method of logical analysis.”
42
21
12-02-2024
ANALYTICS TYPES
DESCRIPTIVE
DIAGNOSTIC
PREDICTIVE
PRESCRIPTIVE
43
22
12-02-2024
23
12-02-2024
DATA ANALYTICS
48
24
12-02-2024
Demand for Analytics Services Customer interaction and market research data
IT and ITES(IT enabling
Software Development Cycle time Internal product development data
Services)
**Primary sources of data and secondary sources to be used in solving these analytical problems
25
12-02-2024
REFERENCES
26