HTC Emerging Ch2

The document provides an overview of key concepts in data science including: - What data science is and the roles of data scientists. - The differences between data and information. - The data processing life cycle of input, processing, and output. - The different types of data from the perspectives of computer science, programming, and data analytics including structured, semi-structured, unstructured, and metadata. - The data value chain in emerging era of big data.

Uploaded by

fate15340

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views37 pages

HTC Emerging Ch2

Uploaded by

fate15340

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

CHAPTER two

Introduction to Data Science

Objective
After completing this chapter, the students will be able to:
 Describe what data science is and the role of data scientists.
 Differentiate data and information.
 Describe data processing life cycle
 Understand different data types from diverse perspectives
 Describe data value chain in emerging era of big data.
 Understand the basics of Big Data.
 Describe the purpose of the Hadoop ecosystem components.
An Overview of Data Science

Data Science is a multi-disciplinary field that uses

scientific methods, processes, algorithms, and systems
to extract knowledge and insights from structured,
semi-structured and unstructured data.
Data science is much more than simply analyzing data.
It offers a range of roles and requires a range of skills.
Overview of Data Science …
• Example:
– Consider data involved in buying a box of cereal from the store or supermarket:
– Your data here is the planned purchase written somewhere
– When you get to the store, you use that piece of data to remind yourself about
what you need to buy and pick it up and put it in your cart.
– At checkout, the cashier scans the barcode on your box and the cash register
logs the price.
– Back in the warehouse, a computer informs the stock manager that it is time to
order this item from distributor because your purchase takes the last box in the
store.
– You may have a coupon for your purchase and the cashier scans that too, giving
you a predetermined discount.
Overview of Data Science …
• Example:
– At the end of the week, a report of all the scanned manufacturer coupons gets
uploaded to the cereal company so they can issue a reimbursement to the grocery
store for all of the coupon discounts they have handed out to customers.
– Finally, at the end of the month, a store manager looks at a colorful collection of pie
charts showing all the different kinds of cereal that were sold and, on the basis of
strong sales of cereals, decides to offer more varieties of these on the store’s limited
shelf space next month.
– So, the small piece of information on your notebook ended up in
many different places
• Notably on the desk of a manager as an aid to decision making.
• The data went through many transformations.
Overview of Data Science …
• Example …
• In addition to the computers where the data might have stopped by or stayed on for the long
term, lots of other pieces of hardware—such as the barcode scanner—were involved in
collecting, manipulating, transmitting, and storing the data.
• In addition, many different pieces of software were used to organize, aggregate, visualize, and
present the data.
• Finally, many different human systems were involved in working with the data.
• People decided which systems to buy and install, who should get access to what kinds of data,
and what would happen to the data after its immediate purpose was fulfilled.
– Data science evolves as one of the most promising and in-demand career paths.
– Professionals use advanced techniques for analyzing large volumes of data.
Overview of Data Science …
• Skills important for data science:
– Statistics
– Linear algebra
– Programming knowledge with focus on data
warehousing, data mining, and data modeling
Data VS Information
• Data: a representation of facts, concepts, or instructions in a
formalized manner, which should be suitable for communication,
interpretation, or processing, by human or electronic machines.
• It can be described as unprocessed facts and figures.
• It is represented groups of non-random symbols in the form of text,
images, voice, videos representing quantities, action and objects.
• Information is the processed/interpreted data on which decisions
and actions are based.
• It is interpreted data; created from organized, structured, and
processed data in a particular context.
Data VS Information…
Data vs. Information Examples Chart

• Seeing examples of data and information side-by-side in a chart can

help you better understand the differences between the two terms.
Data Processing Cycle
• Data processing is the re-structuring or re-ordering of data by
people or machines to increase their usefulness and add values
for a particular purpose.
• Data processing consists of the following basic steps - input,
processing, and output. These three steps constitute the data
processing cycle.
Data Processing Cycle…
• Input − input data is prepared in some convenient
form for processing.
• The form will depend on the processing machine. For
example, when electronic computers are used, the
input data can be recorded on any one of the several
types of input medium, such as magnetic disks,
tapes, and so on.
Data Processing Cycle…
• Processing - input data is changed to produce data in a more
useful form.
– For example, pay-checks can be calculated from the time
cards, or a summary of sales for the month can be calculated
from the sales orders.
• Output − the result of the proceeding processing step is collected.
– The particular form of the output data depends on the use of
the data. For example, output data may be pay-checks for
employees.
Data types and their representation

• Data types can be described from diverse perspectives.

1. Computer science and programming perspective:
– A data type is an attribute of data that tells the compiler or interpreter how the
programmer intends to use the data.
– Almost all programming languages explicitly include the notion of data type, though
different languages may use different terminology.
– Common data types include:
• Integers: store integers.
• Booleans: store one of the two values: true or false
• Characters: store a single character (numeric, alphabetic, symbol, …)
• Floating-point numbers: stores real numbers
Data types and their representation …
• A data type:
– Constrains the values that an expression (such as a variable or a
function) might take.
– Defines the operations that can be performed on the data, the
meaning of the data, and the way values of that data type can be
stored/represented.
2. Data types from Data Analytics perspective
• From a data analytics point of view there are three common types of data
types or structures:
– Structured, Semi-structured, and Unstructured data types.
– Describes the three types of data and metadata.
Data types and their representation …

Data types from a data analytics perspective

• Structured Data: is data that adheres to a pre-defined data model and is therefore
straightforward to analyze.
• Structured data conforms to a tabular format with a relationship between the different rows
and columns.
• Common examples of structured data are Excel files or SQL databases.
• Each of these has structured rows and columns that can be sorted.
• Structured data is considered the most ‘traditional’ form of data storage, since the earliest
versions of database management systems (DBMS) were able to store, process and access
structured data.
Data types and their representation …
• Semi-structured Data: is a form of structured data that does not conform with the formal structure of
data models associated with relational databases or other forms of data tables.
• But, contain tags or other markers to separate semantic elements and enforce hierarchies of records
and fields within the data.
• Therefore, it is also known as a self-describing structure.
• Examples of semi-structured data include JSON and XML are forms of semi-structured data.
• Unstructured Data: is information that either does not have a predefined data model or is not
organized in a pre-defined manner.
• Unstructured information is typically text-heavy but may contain data such as dates, numbers, and
facts as well.
• This results in irregularities and ambiguities that make it difficult to understand using traditional
programs as compared to data stored in structured databases.
• Common examples of unstructured data include audio, video files or No-SQL databases.
Data types and their representation …
• Metadata – Data about Data: A last category of data type is metadata.
• From a technical point of view, this is not a separate data structure, but it is one
of the most important elements for Big Data analysis and big data solutions.
• Metadata is data about data. It provides additional information about a specific
set of data.
• Example: In a set of photographs, metadata could describe when and where the
photos were taken.
• The metadata then provides fields for dates and locations which, by themselves,
can be considered structured data.
• Because of this reason, metadata is frequently used by Big Data solutions for
initial analysis.
Data value Chain

• The Data Value Chain is introduced to describe the information flow within a big
data system as a series of steps needed to generate value and useful insights from
data.
• The Big Data Value Chain identifies the following key high-level activities:
Data value Chain …
• Data Acquisition: is the process of gathering, filtering, and cleaning data
before it is put in a data warehouse or any other storage solution on which
data analysis can be carried out.
• Data acquisition is one of the major big data challenges in terms of
infrastructure requirements.
• The infrastructure required to support the acquisition of big data must:
– deliver low, predictable latency in both capturing data and in executing
queries;
– be able to handle very high transaction volumes, often in a distributed
environment; and
– support flexible and dynamic data structures.
Data value Chain …
• Data Analysis: is concerned with making the raw data acquired amenable to
use in decision-making as well as domain-specific usage.
• Data analysis involves exploring, transforming, and modelling data with the
goal of highlighting relevant data, synthesizing and extracting useful hidden
information with high potential from a business point of view.
• Related areas include data mining, business intelligence, and machine
learning.
• Data Curation: is the active management of data over its life cycle to ensure
it meets the necessary data quality requirements for its effective usage.
• Data curation processes can be categorized into different activities such as
content creation, selection, classification, transformation, validation, and
Data value Chain …
• Data curation is performed by expert curators that are responsible for
improving the accessibility and quality of data.
• Data curators (also known as scientific curators, or data annotators) hold the
responsibility of ensuring that data are trustworthy, discoverable, accessible,
reusable, and fit their purpose.
• A key trend for the curation of big data utilizes community and crowd
sourcing approaches.
• Data Storage: is the persistence and management of data in a scalable way
that satisfies the needs of applications that require fast access to the data.
• Relational Database Management Systems (RDBMS) have been the main, and
almost unique, solution to the storage paradigm for nearly 40 years.
Data value Chain …
• However, the ACID (Atomicity, Consistency, Isolation, and Durability) properties
that guarantee database transactions lack flexibility with regard to schema
changes and the performance and fault tolerance when data volumes and
complexity grow, making them unsuitable for big data scenarios.
• NoSQL technologies have been designed with the scalability goal in mind and
present a wide range of solutions based on alternative data models.
• Data Usage: covers the data-driven business activities that need access to data,
its analysis, and the tools needed to integrate the data analysis within the
business activity.
• Data usage in business decision-making can enhance competitiveness through
reduction of costs, increased added value, or any other parameter that can be
measured against existing performance criteria
Big Data: Definition
• Big data is a blanket term for the non-traditional strategies and
technologies needed to gather, organize, process, and gather insights
from large datasets.
• While the problem of working with data that exceeds the computing
power or storage of a single computer is not new, the pervasiveness,
scale, and value of this type of computing has greatly expanded in
recent years.
• What Is Big Data?
• Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.
Big Data Characteristics – The 4Vs
• Big data differs from traditional data in the following ways:
• Volume: large amounts of data Zeta bytes/Massive datasets.
Orders of magnitude larger than traditional datasets.
• Velocity: Data is live streaming or in motion. The speed that
data moves through the system. Data is frequently flowing into
the system from multiple sources and is often processed in
real-time.
• Variety: data comes in many different forms, quality and from
diverse sources. (Social media, server logs, sensors, …)
• Veracity: can we trust the data? How accurate is it? etc.
• Let’s look our smart phones, now a day smart phones
generates a lot of data in the form of text, phone calls,
emails, photos, videos, searches and music.
• Approximately 40 Exabytes of data get generated every
month by a single smart phone user, now consider how
much data will generate from 5 billon smart phone.
• That is mind blowing in fact, this amount of data quit a lot
for traditional computing systems to handle. This massive
amount of data is called big data.
• Now let’s have a look at the data generated per
minute on internet.
• 2.1M snaps are shard in Snap chat,
• 3.8M search queries are mead in Google,
• 1M people are log in Facebook,
• 4.5M videos are watched in YouTube and
• 188M emails are send.
Big Data Solutions: Clustered Computing
• Individual computers are often inadequate for handling big data
at most stages.
• Clustered computing is used to better address the high storage
and computational needs of big data.
• Clustered computing is a form of computing in which a group of
computers (often called nodes) that are connected through a LAN
(local area network) so that, they behave like a single machine.
• The set of computers is called a cluster.
• The resources from these computers are pooled to appear as one
more powerful computer than the individual computers.
Big Data Solutions: Clustered Computing …
• Big data clustering software combines the resources of many smaller
machines, seeking to provide a number of benefits:
– Resource Pooling: Combining the available storage space, CPU and memory is
extremely important.
– Processing large datasets requires large amounts of all three of these
resources.
– High Availability: Clusters provide varying levels of fault tolerance and
availability guarantees to prevent hardware or software failures from affecting
access to data and processing.
– Increasingly important for real-time analytics of big data.
– Easy Scalability: Clusters make it easy to scale horizontally by adding more
machines to the group. The system can react to changes in resource
Big Data Solutions: Clustered Computing …
• Using clusters requires a solution for managing cluster membership,
coordinating resource sharing, and scheduling actual work on
individual nodes.
• Cluster membership and resource allocation can be handled by softwares
like Hadoop’s YARN (which stands for Yet Another Resource Negotiator).
• The assembled computing cluster often acts as a foundation that other
software interfaces with to process the data.
• The machines involved in the computing cluster are also typically
involved with the management of a distributed storage system, which we
will talk about when we discuss data persistence.
Big Data Solutions: Hadoop
• Hadoop is an open-source framework intended to make interaction with big data
easier.
• It is a framework that allows for the distributed processing of large datasets across
clusters of computers using simple programming models.
• The four key characteristics of Hadoop are:
• Economical: Its systems are highly economical as ordinary computers can be used
for data processing.
• Reliable: It is reliable as it stores copies of the data on different machines and is
resistant to hardware failure.
• Scalable: It is easily scalable both, horizontally and vertically.
• Flexible: It is flexible and you can store as much structured and unstructured data as
you need.
Big Data Solutions: Hadoop Ecosystem
• Hadoop Ecosystem is a platform or a suite which provides various
services to solve the big data problems.
• Hadoop has an ecosystem that has evolved from its four core
components: data management, access, processing, and storage.
• It is continuously growing to meet the needs of Big Data.
• It comprises the following components and many others:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
Big Data Solutions: Hadoop Ecosystem …
• PIG, HIVE: Query-based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm
libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
Big Data Solutions: Hadoop Ecosystem …
Big data life cycle with hadoop
1. Ingesting data into the system
– The first stage of Big Data processing is to Ingest data into the system.
– The data is ingested or transferred to Hadoop from various sources such as
relational databases, systems, or local files.
– Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers event
data.
2. Processing the data in storage.
– The second stage is Processing.
– In this stage, the data is stored and processed.
– The data is stored in the distributed file system, HDFS, and the NoSQL
distributed data, HBase.
Big data life cycle with hadoop
3. Computing and analyzing data
– The third stage is to Analyze Data
– Here, the data is analyzed by processing frameworks such as Pig,
Hive, and Impala.
– Pig converts the data using a map and reduce and then analyzes it.
– Hive is also based on the map and reduce programming and is most
suitable for structured data.
4. Visualizing the results
– The fourth stage is access, which is performed by tools such as
Sqoop, Hive, Hue and Cloudera Search.
Thank You For Watching

Yonatantesfaye30@gmail.com

Chapter - 2
No ratings yet
Chapter - 2
38 pages
Introduction To Data Science: Chapter Two
No ratings yet
Introduction To Data Science: Chapter Two
52 pages
Multidisciplinary Field That Uses A Variety
No ratings yet
Multidisciplinary Field That Uses A Variety
48 pages
Chapter 2
No ratings yet
Chapter 2
10 pages
(ET) Chapter - 2
No ratings yet
(ET) Chapter - 2
31 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Lesson 3 Data Science
No ratings yet
Lesson 3 Data Science
12 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
52 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Emerging CH2
No ratings yet
Emerging CH2
41 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
30 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
Chapter 2 - Intro To Data Sciences (Updated)
No ratings yet
Chapter 2 - Intro To Data Sciences (Updated)
67 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
41 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
Data Science Basics for Students
No ratings yet
Data Science Basics for Students
9 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
8 pages
Chapter Two
No ratings yet
Chapter Two
57 pages
Data Science
No ratings yet
Data Science
32 pages
Chapter 2 (Data Science)
No ratings yet
Chapter 2 (Data Science)
35 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
35 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
Data Science and Big Data Basics
No ratings yet
Data Science and Big Data Basics
32 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Data Science: Insights & Challenges
No ratings yet
Data Science: Insights & Challenges
33 pages
Ict Ch. 2
No ratings yet
Ict Ch. 2
38 pages
Chapter 2EMR
No ratings yet
Chapter 2EMR
21 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Chapter 2. Introduction To Data Science
100% (2)
Chapter 2. Introduction To Data Science
45 pages
CH-2 Introduction To Data Science
No ratings yet
CH-2 Introduction To Data Science
26 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
43 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Emerging Tech CH 2
No ratings yet
Emerging Tech CH 2
52 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Introduction To Emerging Technologies Chapter 2
No ratings yet
Introduction To Emerging Technologies Chapter 2
31 pages
CH 2
No ratings yet
CH 2
23 pages
Chapter 2 Introduction To Data Science - For Extension
No ratings yet
Chapter 2 Introduction To Data Science - For Extension
51 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
56 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Data Science & Big Data Essentials
No ratings yet
Data Science & Big Data Essentials
31 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
Lab # 7 View, Sequence, Synonym and Trigger Eng. Alaa O Shama
No ratings yet
Lab # 7 View, Sequence, Synonym and Trigger Eng. Alaa O Shama
11 pages
Class 10 IT PYQs E Book Readers Venue 2025 26 2025 23 06 03 29 51
100% (1)
Class 10 IT PYQs E Book Readers Venue 2025 26 2025 23 06 03 29 51
63 pages
Python's Pandas for Data Analysis
No ratings yet
Python's Pandas for Data Analysis
10 pages
OmniPCX 4400 Database Trace Tools
No ratings yet
OmniPCX 4400 Database Trace Tools
8 pages
Lab Manual For DBMS
No ratings yet
Lab Manual For DBMS
10 pages
Understanding Data
No ratings yet
Understanding Data
8 pages
Web Attacks Notes
No ratings yet
Web Attacks Notes
11 pages
DBMS Essentials for Beginners
No ratings yet
DBMS Essentials for Beginners
92 pages
Redis Crashed
No ratings yet
Redis Crashed
6 pages
String 1033
No ratings yet
String 1033
27 pages
Data Mining for Retail Insights
No ratings yet
Data Mining for Retail Insights
12 pages
Informatica Codereview Checklist
100% (1)
Informatica Codereview Checklist
4 pages
Question Paper Computer Science Class 12
No ratings yet
Question Paper Computer Science Class 12
5 pages
Database Concepts - Assignment-Day 4 1ans)
No ratings yet
Database Concepts - Assignment-Day 4 1ans)
2 pages
Python Programming Course Guide
No ratings yet
Python Programming Course Guide
6 pages
Hibernate Notes by Sriman
50% (2)
Hibernate Notes by Sriman
206 pages
SACM21 Test
100% (1)
SACM21 Test
14 pages
ACID Properties & Concurrency in DBMS
No ratings yet
ACID Properties & Concurrency in DBMS
16 pages
Practical No.2 Perform The Extraction Transformation and Loading (ETL) Process To Construct The Database in The Sqlserver
No ratings yet
Practical No.2 Perform The Extraction Transformation and Loading (ETL) Process To Construct The Database in The Sqlserver
12 pages
Assignment 2: Relational Queries, SQL
No ratings yet
Assignment 2: Relational Queries, SQL
8 pages
User Exit/Badi in Routing, EWB and BOM: Symptom
No ratings yet
User Exit/Badi in Routing, EWB and BOM: Symptom
3 pages
ODK Database Design & Reporting
No ratings yet
ODK Database Design & Reporting
6 pages
RPG Ontology for Semantic Web
No ratings yet
RPG Ontology for Semantic Web
5 pages
Chapter 8 Database CS 9618
No ratings yet
Chapter 8 Database CS 9618
29 pages
Grokking Relational Database Design MEAP Qiang Hao Ready To Read
No ratings yet
Grokking Relational Database Design MEAP Qiang Hao Ready To Read
86 pages
Dbms 2018
No ratings yet
Dbms 2018
1 page
Xii CS Syntax and Examples
No ratings yet
Xii CS Syntax and Examples
4 pages
Normalization
No ratings yet
Normalization
6 pages
BDA Regular Paper Solution
No ratings yet
BDA Regular Paper Solution
8 pages
Dharesh Resume
No ratings yet
Dharesh Resume
6 pages

HTC Emerging Ch2

Uploaded by

HTC Emerging Ch2

Uploaded by

CHAPTER two

Introduction to Data Science

Data Science is a multi-disciplinary field that uses

• Seeing examples of data and information side-by-side in a chart can

• Data types can be described from diverse perspectives.

Data types from a data analytics perspective

You might also like