0% found this document useful (0 votes)
7 views61 pages

Int To Ds

The document introduces the concept of data science and the data revolution, highlighting the exponential growth of data across various sectors and its implications. It discusses the characteristics and challenges of big data, the demand for data science professionals, and defines data science as an interdisciplinary field focused on extracting insights from structured and unstructured data. Additionally, it outlines the competencies required in data science, including data analytics, engineering, and domain expertise.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views61 pages

Int To Ds

The document introduces the concept of data science and the data revolution, highlighting the exponential growth of data across various sectors and its implications. It discusses the characteristics and challenges of big data, the demand for data science professionals, and defines data science as an interdisciplinary field focused on extracting insights from structured and unstructured data. Additionally, it outlines the competencies required in data science, including data analytics, engineering, and domain expertise.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Chapter 1

Introduction to Data Science

1
Data Revolution
• Data is created constantly, and at an ever-increasing rate
• Massive amounts of data about many aspects of our lives
• Shopping, communicating, reading news, listening to music, searching for
information, expressing our opinions
• The finance, the medical industry, pharmaceuticals, bioinformatics,
government, education, retail, and the list goes on.
• Websites track every user’s on every click.
• Smartphone are building up a record of our location
• Smart cars collect driving habits, smart homes collect living habits, and
smart marketers collect purchasing habits.

2 2
Data Revolution
• Cross-referenced encyclopedia; domain-specific databases about movies,
music, sports results, pinball machines,

• There is a growing influence of data in most sectors and most industries.

• Culturally saturated feedback loop where our behavior changes


the product and the product changes our behavior
• Technology makes this possible:
• infrastructure for large-scale data processing,
• increased memory, and bandwidth, as well as a cultural acceptance of
technology

3 3
Big Data - a tsunami that is hitting us
 We are witnessing a tsunami of data:
 Huge volumes
 Data of different types and formats
 Impacting the business at new and ever increasing speeds
 The challenges:
 Capturing/collecting data
 Managing
 Processing - from managing the raw data to programming to
provide insight into the data
 Storing - safeguarding and securing
 “Big Data refers to non-conventional strategies and innovative
technologies used by businesses and organizations to
capture, manage, process, and make sense of a large volume
of data”
Data has an intrinsic property…it grows and grows

90% 80% 20%


of the world’s of the world’s of available data can
data was created data today is be processed by
in the last two unstructured traditional systems
years

1 in 2
business leaders don’t have
access to data they need
Growing interconnected &
instrumented world
Data Revolution
 eBay captures a terabyte of data per minute
 Every mouse click on a web site is captured in Web log files
 Machines (smart meters, Sensors, GPS, etc)
 Social media sites

7
7
Characteristics of the Data Revolution

8
Characteristics of Big Data

March 3, 2017

16th Annual Accounting Educators Seminar -


University of Missouri - Kansas City 9
Causes of Data Revolution

10
Causes of Data Revolution (Historical perspective)
Year Event
1991 • World Wide Web is born
1995  Sun releases the Java platform
 Global Positioning System (GPS) omnipresence in car, airplane
1999 invents the term the Internet of Things
2001 Wikipedia is launched
2003  The amount of data created surpasses the amount of data created in all
of human history before then
 LinkedIn launched, 260 million users by 2013

2004 Facebook is launched, 1.15 billion user by 2013


2008 The number of devices connected to the Internet exceeds the
world’s population.
2011  The IPv4 address space have all been assigned, 4.5 billion unique
addresses assigned

2012 The Obama administration announces the Big Data Research


and Development Initiative

11
Causes of Data Revolution

• Major derivers can be identified as the major cause


• Development of the Web
• Open data initiatives across the glob
• Internet of Things

12
Demand for Data Science
 According to US News and World Report in 2023, information
security analyst, software developer, data scientist ranked among
the top jobs in terms of pay and demand
Data scientist
 Average annual salary: $152,279

13
Definition of Data Science
• Data science (DS) is an interdisciplinary field
that uses scientific methods, processes,
algorithms, and systems to extract knowledge
and insights from structured, semi-structured and
unstructured data.
• In simpler terms, DS is about obtaining,
processing, and analyzing data to gain insights
for many purposes.

14 14
Definition of Data Science
• DS combines various technologies, techniques,
and theories from various fields, mostly related to
computer science, statistics, and mathematics, to
obtain actionable knowledge from data.
• In simple terms, it is the umbrella of techniques
used when trying to extract insights and
information from data.

15 15
Data Science

16
Discipline Definition (reading assignment)
Groups who have tried to define data science profession
• ACM Data Science Task Force (2019)
• The EDISON Data Science Framework (2018)
• The National Academies of Science, Engineering, and Medicine
Report on Data Science for Undergraduates (2018)
• The Park City Report (2017)
• The Business Higher Education Framework (BHEF) Data Science and
Analytics (DSA) Competency Map (2016)
• Business Analytics Curriculum for Undergraduate Majors (2015)

17
Data Analytics life cycle

18
Identified Data Science Competence Groups
 Traditional/known Data Science competences/skills groups
include
 Data Analytics or Business Analytics or Machine Learning
 Engineering or Programming
 Subject/Scientific Domain Knowledge

 EDISON identified 2 additional competence groups


demanded
by organisations
 Data Management, Curation, Preservation
 Scientific or Research Methods and/vs Business
Processes/Operations

19
Identified Data Science Competence Groups

 Other skills commonly recognized aka “soft skills” or “social intelligence”


 Inter-personal skills or team work, cooperativeness

 All groups need to be represented in Data Science curriculum and


training programs
 Challenging task for Data Science education and training
 Another aspect of integrating Data Scientist into organization structure
 General Data Science (or Big Data) literacy for all involved roles and
management
 Common agreed and understandable way of communication and
information/data presentation
 Role of Data Scientist: Provide such literacy advice and guiding to
organisation

20
Data Science Competence Groups - Research

Data Science Competence


includes 5 areas/groups
 Data Analytics
 Data Science Engineering
 Domain Expertise
 Data Management
 Scientific Methods (or Business
Process Management)

Scientific Methods
• Design Experiment
• Collect Data
• Analyse Data
• Identify Patterns
• Hypothesise Explanation
• Test Hypothesis

Business Operations
• Operations Strategy
• Plan
• Design & Deploy
• Monitor & Control
• Improve & Re-design

21
Data Science Competences Groups – Business

Optimisation Design Data Science Competence


includes 5 areas/groups
 Data Analytics
 Data Science Engineering
 Domain Expertise
 Data Management
RESEARCH
DOMAIN DATA  Scientific Methods (or Business
EXPERTISE DATA ANALYTICS Process Management)
SCIENCE Scientific Methods
• Design Experiment
ANALYTIC ALGORITHMS
• Collect Data
SYSTEMS
Monitoring Modelling • Analyse Data
• Identify Patterns
ENGINEERING • Hypothesise Explanation
COMPETENCES
• Test Hypothesis
Data Business Process
Management Business Process
Management
Operations/Stages
• Design
Execution Scientific
• Model/Plan
Methods
• Deploy & Execute
• Monitor & Control
• Optimise & Re-design
22
DS competency groups
Data Analytics (DA) Data Management/ DS Engineering (DSE) Scientific/Research Methods (DSRM) DS Domain Knowledge
Curation (DM) (including Business Apps)
1 Use appropriate Develop and Use engineering principles to
statistical techniques on implement data research, design, or develop Create new understandings and capabilities
available data to deliver Understand business and
strategy structures, instruments, by using the scientific method's hypothesis, provide insight, translate
insights machines, experiments, test, and evaluation techniques; critical unstructured business
processes, systems, theories, review; or similar engineering research and problems into an abstract
or technologies development methods mathematical framework
2 Use predictive Use data to improve existing
Direct systematic study toward a fuller
analytics to analyse Develops specialized services or develop new
knowledge or understanding of the services
big data and Develop data data analysis tools to fundamental aspects of phenomena and of
discover new models including support executive observable facts, and discovers new
relations metadata decision making approaches to achieve goals
3 Research and analyze Integrate different Participate strategically and
complex data sets, data source and Undertake creative work, making tactically in financial
combine different provide for further systematic use of investigation or decisions that impact
sources and types of analysis Design, build, operate experimentation, to discover or revise management and
data to improve relational non- organizations
knowledge of reality, and uses this
analysis. relational databases knowledge to devise new applications
4 Develop and apply
Develop specialized Develop and Recommends business
maintain a computational solutions to related strategic
analytics to enable
historical data domain related problems objectives and
agile decision repository of
making
using wide range of data Apply ingenuity to complex problems, alternatives and
analysis
analytics platforms develop innovative ideas implements them
5 Collect and Develop solutions for Ability to translate strategies into action
manage different plans and follow through to completion. Provides scientific, technical,
secure and reliable data and analytic support services
source of data
access to other organisational roles
6 Develop algorithms to Influence the development of Analyse multiple data sources
analyse multiple source of organizational objectives for marketing purposes
Visualise complex
and variable data. data
7 Prototype new data Analyse customer data to
identify/optimise23
customer
analytics applications relations actions
DS vs STATISTICS OUT US

•Statistics education deals with Structured data


•Data from sampling/census
•Inferential study (estimation and hypotheses testing)
•Small p and small n
•Focus on statistical inference (Fisherian or Bayesian)

24
Statistical Inference
• The world we live in is complex, random, and uncertain
• It’s one big data-generating machine
• We capture the world or certain traces of the world into
data
• Those captured traces will be converted into something
more comprehensible, to something that somehow
captures it all in a much more concise way, and that
something could be mathematical models or functions of
the data, in a process called statistical estimators.
• This overall process of going from the world to the data,
and then from the data back to the world, is the field of
statistical inference.

25 25
Statistical Inference
• We usually infer not from the total population but
from the sample
• In the age of BigData where we have all the
population, the notion of taking sample may not
work
• The new kinds of data in BigData require us to
think more carefully about what sampling means
in these contexts.
• How do you sample from a network and preserve
the complex network structure?

26 26
Understanding data
Types of data
Categorical data
Nominal, e.g. colour, gender, …, etc.
Ordinal, e.g. Military rank, academic
rank, overall performance, …, etc.
Numerical data
Discrete , e.g. number of children in HH,
number of students in a class.
Continous, e.g. income, age, weight,..,
etc.
27
Understanding data

 Classification of digital data


 1. Structured data:
o Information that has been meticulously organized
into a predefined format.
o Its elements are addressable, making it suitable
for effective analysis.
o Storage Format: resides in relational databases,
where it is stored in tables with rows and columns.

28
Understanding data

 Classification of digital data


 2. Semi-structured:
o Doesn’t fit neatly into a relational database structure

but possesses some organizational properties.


o It’s more flexible than structured data but less so than

unstructured data.
o Storage Considerations: While it can be challenging

to store certain types of semi-structured data in


relational databases, they exist to ease space
constraints. e.g. XML data

29
Understanding data

 Classification of digital data


 3. Unstructured Data
o Unstructured data lacks a predefined format or data

model.
o It doesn’t conform to rigid structures like those

required by mainstream relational databases.


o Storage Platforms: Organizations increasingly use

alternative platforms to store and manage


unstructured data.
o E.g. Word documents, PDF files, plain text, media

logs, video, audio, …etc,

30
Understanding data
Data sources
 1. Panel Data: Panel data, also known as
longitudinal data, involves measurements over
time for the same subjects (individuals, firms,
countries, etc.).
 Examples:

 Tracking annual income for the same individuals


over several years.
 Observing stock prices for specific companies
across different quarters.
 Use Case: Panel data is valuable for studying
changes within subjects over time and
analyzing individual-level effects. 31
Understanding Data
 Data sources
 2. Transactional Data:Transactional data captures
information from specific transactions. It includes details
necessary to define each transaction
 Examples:
 Online product sales records.
 Credit card transactions.
 Checking account deposits and withdrawals.
 Use Case: Transactional data is crucial for business
analytics, fraud prevention, and process optimization.

32
Understanding Data
 Data sources
 3. Biological Data:Biological data refers to
information derived from living organisms and their
products.
 Examples:

 DNA sequences.
 Protein structures.
 Genomic data.
 Amino acid sequences.
 Use Case: Bioinformatics leverages biological data to
analyze and interpret vast amounts of genomic
information.

33
Understanding Data
 Data sources
 4. Spatial Data: Spatial data directly or
indirectly references specific geographical areas
or locations. It includes both location-specific
data and other relevant information.
 Examples:

 Geometric data (e.g., floor plans, Google Maps


directions).
 Geographic data (e.g., latitude and longitude
relationships).
 Use Case: Spatial data helps analyze
relationships between variables in a
geographical context. 34
Understanding Data
 Data sources
 5. Social Network Data: Social network data
pertains to interactions and connections among
individuals or entities within a network.
 Examples:

 Friendships on social media platforms.


 Collaboration networks among researchers.
 Communication patterns in organizations.
 Use Case: Social network data aids in
understanding social dynamics, influence, and
information flow.
35
Reading assignment
• Review the open/data initiatives all over the world
• What is the benefit of these initiatives?
• What are those initiatives
• What is the r/n between open data initiative and data
science
• Compare and contrast the different data science definitions
• Come-up with the best definition
• Compare data science with statistics, database system,
machine learning
• Is data science a new discipline as compared to the above
fields
• Will there be an identity confusion problem in DS?

36
Data Science Discipline
Knowledge Areas

37
Identified Data Science Skills/Experience
Groups
 A data scientist is a practitioner who has sufficient knowledge in the overlapping
regimes of business needs, domain knowledge, analytical skills, and software and
systems engineering to manage the end-to-end data processes in the data life cycle.

Big Data Tools and Programming


Languages
• Big Data Analytics platforms
• Math& Stats tools
• Databases
• Data/applications visualization
• Data Management and Curation

38
Identified Data Science Skills/Experience
Groups
 Group 1: Skills/experience related to competences
 Data Analytics and Machine Learning
 Data Management/Curation (including both general data management and scientific data
management)
 Data Science Engineering (hardware and software) skills
 Scientific/Research Methods
 Application/subject domain related (research or business)
 Mathematics and Statistics Big Data Tools and Programming
Languages
 Group 2: Big Data (Data Science) tools and platforms • Big Data Analytics platforms
 Big Data Analytics platforms • Math& Stats tools
 Math & Stats apps & tools • Databases
 Databases (SQL and NoSQL) • Data/applications visualization
• Data Management and Curation
 Data Management and Curation platform
 Data and applications visualisation
 Cloud based platforms and tools
 Group 3: Programming and programming languages and IDE
 General and specialized development platforms for data analysis and statistics
 Group 4: Soft skills or Social Intelligence
 Personal, inter-personal communication, team work (also called social intelligence or soft 39
skills)
The roles and responsibilities of Data
Scientists, Data Engineers, and the
dynamics of Data Science Teams:

40
Data Scientists
 Role:
 Data scientists are analytical experts who extract valuable insights
from data.
 They bridge the gap between raw data and actionable
business decisions.
 Responsibilities:
 Data Collection and Cleaning
 Exploratory Data Analysis (EDA
 Feature Selection and Model Building:
 Communication.
 Skills: technical, analytical, and communication skills.
 Impact: They drive data-driven decision-making within
organizations.

41
Data Engineers
 Role: Data engineers build and maintain the
infrastructure that data scientists use for data
collection, storage, and processing.
 Responsibilities:
 Data Pipelines.
 Database Management
 Data Transformation
 Skills: database management, programming,
and system architecture.
 Impact: They enable efficient data flow and
accessibility for data scientists.
42
Data Science Teams

 Custom-Built and Diverse: Data science


teams vary based on organizational needs.
They can supplement different business units
and operate within specific analytical domains.
 Principles for Success:
 Experiment,
 Democratize Data
 Measure Impact
Talent: data scientists and data engineers

43
Application of data science (Societal
Problems Addressed )

44
HealthCare

 Medical Image Analysis


 Procedures such as detecting tumors, artery
stenosis, organ delineation, lung texture
classification
HealthCare
 Genetics & Genomics
 Understand the impact of the DNA on our health
and find individual biological connections
between genetics, diseases, and drug response.
 Data science techniques allow integration of
different kinds of data with genomic data in the
disease research, which provides a deeper
understanding of genetic issues in reactions to
particular drugs and diseases.
Retail
 Customer is savvy, impatient and busy.
 They want instant gratification and excellent customer service.
 In order to compete and stay one step ahead, retailers need to have a
360-degree view of the customer.
 Retail analytics helps businesses get deep insights into customer
behavior.
 It helps them understand their customer’s requirements more
precisely, while also helping them to bring in more of the right kind of
customers.

47
Retail
 Customer is savvy, impatient and busy.
 They want instant gratification and excellent customer service.
 In order to compete and stay one step ahead, retailers need to have a
360-degree view of the customer.
 Helps businesses get deep insights into customer behavior.
 It helps them understand their customer’s requirements more precisely
 It gives insights such as:
 How to increase margins at a product-level?
 Insights into your customer profile that helps answer questions like who they are and why
they make certain purchases (Market Basket analysis)
 Identify items that are likely to be purchased together.
 Which marketing strategies work better than others?
 ROI of marketing spend
 Optimal Pricing
 What promotions and offers to employ in each store?
 Store wise product-mix
 Personalized offers
 Efficient stock strategy 48
E-commerce
 Businesses can collect a wealth of information about their site, their visitors
and where they came from, and use it to find new customers and increase
conversions.
 E- commerce businesses primarily use analytics to understand:
 Acquisition - how your visitors and customers found and arrived at your
site.
 Shopping and purchasing behavior: how users engage with your
website, which products they view, which ones they add or remove from
shopping carts; along with initiating, abandoning, and completing
transactions.
 Economic Performance – how many products the average transaction
includes, the average order value, refunds you had to issue.

49
Finance
 The global financial analytics market is one of the fastest growing sectors of
the data industry.
 Organizations big and small are investing in financial analytics tools and
technologies to solve specific business problems, reduce costs, improve
budgets and get insights into future financial scenarios.
 Typically financial analytics includes
 Risk analysis
 Working capital management
 Fraud detection and prevention
 Shareholder metric analysis

50
Healthcare, Education, Telecom etc
 Analytics can be used for evidence based medical care, improved patient
care, predicting outbreaks of diseases and reducing hospital operating costs.
 Analytics is also being used to improve teaching practices. It also enables
teachers to better monitor student progress, personalize learning and
improve educational institutions operational efficiencies.
 In the telecom industry analytics is fast gaining much ground. Operators are
using analytics to drive revenue, reduce churn and improve network
performance.

51
Marketing
 Understanding customers and how to find more people like them is the key
to sustainable growth.
 Analytics can not only help companies do this but it can add value to other
marketing functions as well, by gathering data across all marketing channels
and consolidating it into a common marketing view.
 It helps measure, manage and analyze marketing performance to maximize
its effectiveness and optimize return on investment (ROI).
 How are our marketing initiatives performing today?
 Which of them are viable in the long run?
 How can we improve those which are not effective?
 How do our marketing activities compare with our competitors’?
 What can we learn from our competition?
 Are our marketing resources properly allocated?
 Are we using the right channels?

52
Sales
 Though sales analytics can help identify, model, understand and predict
sales trends and outcomes we see very few companies realizing its potential
to aid sales management.
 However the potential is huge and over the next several years sales
analytics will be one of the most important domains for Data Analytics and
Big Data.
 What sales analytics can essentially do is:
 See what goods and services have and have not sold well.
 Determine optimal inventory
 Measure the effectiveness of the sales force and determine optimal sales force
size
 Sales incentive cost analysis
 Competitor sales analysis

53
Supply chain management
 Supply chain analytics helps monetize and optimize:
 Current inventory status
 Forecasts
 Demand planning
 Sourcing
 Production
 Improved worker productivity measurement
 Transportation routing

54
Human Resource
 HR analytics helps managers by creating a single view of all relevant
workforce and other HR related data.
 These insights can be used to make business decisions that drive business
processes and initiatives and improve profitability.
 Key areas where workforce related data driven analytics can be used are:
 Talent acquisition and retention
 Attrition
 Headcount Management and Workforce Optimization
 Optimization of Compensation and Benefits
 Build Leadership
 Performance and Career Management
 Training and Development

55
What is Data Analytics?
• Analytics can range from a simple exploration into how
many sales of a particular product were made last year to
a complex neural network model predicting which
customers to target for next year’s marketing campaign.
• The extensive use of data, statistical and quantitative
analysis, exploratory , predictive models, and fact based
management to drive decisions and actions
• A key to deriving value from data is the use of analytics.

56 56
What is Data Analytics?

 Three ways of defining Analytics


– Converting data into insights and
intelligence, delivered when and where
they are needed to help companies make
better strategic and operational decisions
– Getting data out
– The use of “rocket science” algorithms
(e.g., machine learning, neural networks)
to analyze data

57
Three Kinds of Analytics

 A key to deriving value from big data is the use of analytics


 Three kinds of analytics
 Descriptive analytics
 Look backward and reveal what has occurred
 Predictive analytics
 Suggest what will occur in the future
 Exploratory or discovery analytics,
 Finding relationships in big data that were not previously
known
 Prescriptive analytics
 Identify optimal solutions
 what happened – what will happen – how can we make it happen

58
Example of Descriptive analytics

 A sales report of Dashen Beer. This


report
 How many units of Beer were sold
 where they were sold
 what price and a lot of other things.
 All you are doing is slicing and dicing
the data in different ways, looking at
it from different angles, along
different dimensions etc.
59
Example of Predictive analytics

 How many beer will be sold for next week?


 Predictive analytics works by identifying patterns in
historical data and then using statistics to make inferences
about the future.
 We try to fit the data into a certain pattern and if we believe
that the data is following a certain pattern then we can
predict what will happen in the future.
 Retailers are very interested in understanding relationships between
products. They want to know if a person buys product A, is he also
likely to buy product B or C.
 Let us take another example from the telecom industry.
 They would love to be able to predict which of their
customers are likely to leave their service in the future.

60
Example of Prescriptive analytics

 Prescriptive analytics goes beyond predictive


analytics by not only predicting what will
happen but also suggesting the most optimal
decisions given what could happen and
showing what will happen under different
scenarios.
 It includes concepts like optimization and
simulation.

61

You might also like