Ict Ch. 2

The document provides an overview of data science, including definitions of data and information, the data processing cycle, and various data types. It discusses the data value chain in the context of big data, highlighting key activities such as data acquisition, analysis, curation, storage, and usage. Additionally, it introduces Hadoop and its ecosystem as a solution for managing and processing large datasets in a clustered computing environment.

Uploaded by

zinashgezahegn67

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views38 pages

Ict Ch. 2

Uploaded by

zinashgezahegn67

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

By Tatek, Mar.

2025
1
 Introduction
• Data science is a multi-disciplinary field that uses scientific
methods, processes, algorithms, and systems to extract
knowledge and insights from structured, semi-structured
and unstructured data.
After completing this chapter, the students will be able to:
 Describe what data science is and the role of data scientists.
 Differentiate data and information.
 Describe data processing life cycle
 Understand different data types from diverse perspectives
 Describe data value chain in emerging era of big data.
 Understand the basics of Big Data.
 Describe the purpose of the Hadoop ecosystem components.
2
1. What are data and information?
• Data is a representation of facts, concepts, or
instructions in a formalized manner, which
should be suitable for communication,
interpretation, or processing, by human or
electronic machines.
• It can be described as unprocessed facts and
figures.
• It is represented with the help of characters such
as:
alphabets (A-Z, a-z),
digits (0-9) or
special characters (+, -, /, *, <,>, =, etc.).
3
• Information is the processed data on which
decisions and actions are based.
• It is data that has been processed into a form that is
meaningful to the recipient.
• Furtherer more, information is interpreted data;
created from organized, structured, and processed
data in a particular context.

4
2. Data Processing Cycle
• Data processing is the re-structuring or re-ordering
of data by people or machines to increase their
usefulness and add values for a particular purpose.
• Data processing consists of basic steps:
Input
Processing
Output
• These three steps constitute the data processing
cycle.

5
Data Processing Cycle

Input Processing Output

a) Input − in this step, the input data is prepared in

some convenient form for processing.
• The form will depend on the processing machine.
• For example:
 when computers are used, the input data can be
recorded on any one of the several types of storage
medium, such as hard disk, CD, flash disk and so
on. 6
b) Processing − in this step, the input data is changed
to produce data in a more useful form.
• For example:
 Interest can be calculated on deposit to a bank,
 A summary of sales for the month can be calculated
from the sales orders.
c) Output − at this stage, the result of the processed
data is collected.
• The particular form of the output data depends on
the use of the data.
• For example:
Output data may be payroll for employees.
7
 Data types & their representation
• Data types can be described from diverse
perspectives.
• In computer science and computer programming,
for instance, a data type is simply an attribute of
data that tells the compiler or interpreter how the
programmer intends to use the data.

8
Data types from Computer programming
perspective
• Common data types in computer programming
language include:
Integers (int)- is used to store whole numbers,
mathematically known as integers
Booleans (bool)- is used to represent restricted to one of
two values: true or false
Characters (char)- is used to store a single character
Floating-point numbers (float)- is used to store real
numbers
Alphanumeric strings (string)- used to store a
combination of characters and numbers 9
Data types from Data Analytics
perspective
• From a data analytics point of view, it is important
to understand that there are three common types
of data types or structures:
• Structured
• Semi-structured
• Unstructured data types

10
1. Structured Data
• Structured data is data that adheres to a pre-
defined data model and is therefore
straightforward to analyze.
• Structured data conforms to a tabular format with a
relationship between the different rows and
columns.
• Common examples of structured data are Excel files
or SQL databases.
• Each of these has structured rows and columns that
can be sorted.
11
2. Semi-structured Data
• Semi-structured data is a form of structured data
that does not conform with the formal structure of
data models associated with relational databases or
other forms of data tables.
• But nonetheless, contains tags or other markers to
separate semantic elements and enforce hierarchies
of records and fields within the data.
• Therefore, it is also known as a self-describing
structure.
• Examples of semi-structured data include E-mails,
JSON, HTML and XML are forms of semi-structured
data.
12
3. Unstructured Data
• Unstructured data is information that either does not
have a predefined data model or is not organized in a
pre-defined manner.
• Unstructured information is typically text-heavy but
may contain data such as dates, numbers, and facts as
well.
• This results in irregularities and ambiguities that make
it difficult to understand using traditional programs as
compared to data stored in structured databases.
• Common examples of unstructured data include audio,
video, images, social media posts, E-mail body, PDFs,
NoSQL databases, …. 13
4. Metadata (Data about Data)
• Metadata is a data type, not a separate data
structure, but it is one of the most important
elements for Big Data analysis and big data
solutions.
• Metadata is data about data. It provides additional
information about a specific set of data.
• In a set of photographs, for example, metadata
could describe when and where the photos were
taken.
• The metadata then provides fields for dates and
locations which, by themselves, can be considered
structured data.
14
 Data value Chain
• Data Value Chain is introduced to describe the
information flow within a big data system as a
series of steps needed to generate value and useful
insights from data.
• The Big Data Value Chain identifies the following
key high-level activities:
 Data Acquisition
 Data Analysis
 Data Curation
 Data Storage
 Data Usage
15
Table: Data value Chain

16
1. Data Acquisition
• Data acquisition is the process of gathering,
filtering, and cleaning data before it is put in a data
warehouse or any other storage.
• Data acquisition is one of the major big data
challenges in terms of infrastructure requirements.
• The infrastructure required to support the
acquisition of big data:
Must deliver low, predictable latency in both capturing
data and in executing queries.
Be able to handle very high transaction volumes, often
in a distributed environment.
Support flexible and dynamic data structures. 17
2. Data Analysis
• Data Analysis is concerned with making the raw
data acquired to use in decision-making as well as
domain-specific usage.
• Data analysis involves exploring, transforming, and
modeling data with the goal of highlighting relevant
data, synthesizing and extracting useful hidden
information with high potential from a business
point of view.
• Related areas include data mining, business
intelligence, and machine learning.

18
3. Data Curation
• Data Curation is the active management of data over
its life cycle to ensure it meets the necessary data
quality requirements for its effective usage.
• Data curation processes can be categorized into
different activities such as:
Content creation
Selection
Classification
Transformation
Validation
Preservation
• Data curators (also known as scientific curators or
data annotators) hold the responsibility of ensuring
that data are trustworthy, discoverable, accessible,
reusable and fit their purpose.
19
4. Data Storage
• Data Storage is the persistence and management of
data in a scalable way that satisfies the needs of
applications that require fast access to the data.
• Relational Database Management Systems (RDBMS)
have been the main, and almost unique, a solution to
the storage paradigm for nearly 40 years.
• However, the ACID (Atomicity, Consistency, Isolation,
and Durability) properties that guarantee database
transactions lack flexibility with regard to schema
changes and the performance and fault tolerance
when data volumes and complexity grow, making them
unsuitable for big data scenarios.
• NoSQL technologies have been designed with the
scalability goal in mind and present a wide range of
solutions based on alternative data models.
20
5. Data Usage
• Data Usage covers the data-driven business
activities that need access to data, its analysis, and
the tools needed to integrate the data analysis
within the business activity.
• Data usage in business decision-making can
enhance competitiveness through the reduction of
costs, increased added value, or any other
parameter that can be measured against existing
performance criteria.

21
 Basic concepts of big data

• Big data is a blanket term for the non-traditional

strategies and technologies needed to gather,
organize, process, and gather insights from large
datasets.
• While the problem of working with data that
exceeds the computing power or storage of a single
computer is not new, the pervasiveness, scale, and
value of this type of computing have greatly
expanded in recent years.

22
What is big data?

• Big data is the term for a collection of data sets so

large and complex that it becomes difficult to
process using on-hand database management tools
or traditional data processing applications.
• In this context, a “large dataset” means a dataset
too large to process or store with traditional tooling
or on a single computer.
• This means that the common scale of big datasets is
constantly shifting and may vary significantly from
organization to organization.
23
• Big data is characterized by 4V and more:
Volume: large amounts of data. Zeta bytes/Massive
datasets
Velocity: Data is live streaming or in motion
Variety: data comes in many different forms from
diverse sources
Veracity: can we trust the data? How accurate is it?
etc.

24
25
Clustered Computing and Hadoop Ecosystem
 Clustered Computing
• Because of the quantities of big data,
individual computers are often
inadequate for handling the data at most
stages.
• To better address the high storage and
computational needs of big data,
computer clusters are a better fit.
26
• Big data clustering software combines the resources
of many smaller machines, seeking to provide a
number of benefits:
1. Resource Pooling
• Combining the available resources such as:
Storage space to hold data
CPU
Memory
• Processing large datasets requires large amounts of
all three of the resources.
2. High Availability
• Clusters can provide varying levels of fault tolerance
and availability guarantees to prevent hardware or
software failures from affecting access to data and
processing.
27
3. Easy Scalability
• Clusters make it easy to scale horizontally by
adding additional machines to the group.
• This means the system can react to changes
in resource requirements without expanding
the physical resources on a machine.
• Using clusters requires a solution for
managing cluster membership, coordinating
resource sharing, and scheduling actual
work on individual nodes.
• Cluster membership and resource allocation
is handled by software like Hadoop’s YARN
(Yet Another Resource Negotiator).
28
 Hadoop and its Ecosystem
• Hadoop is an open-source framework
intended to make interaction with big
data easier.
• Hadoop is a framework that allows for
the distributed processing of large
datasets across clusters of computers
using simple programming models.
• It is inspired by a technical document
published by Google. 29
• The four key characteristics of Hadoop are:
• Economical: Its systems are highly
economical as ordinary computers can be
used for data processing.
• Reliable: It is reliable as it stores copies of
the data on different machines and is
resistant to hardware failure.
• Scalable: It is easily scalable both,
horizontally and vertically. A few extra nodes
help in scaling up the framework.
• Flexible: It is flexible and you can store as
much structured and unstructured data as
you need to and decide to use them later.
30
• Hadoop has an ecosystem that has
evolved from its four core components:
Data management
Access
Processing
Storage

31
• Hadoop is continuously growing to meet the needs
of Big Data. It comprises the following components
and many others:

HDFS: Hadoop Distributed File System

YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query-based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm
libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling 32
Fig. Hadoop Ecosystem 33
• Big Data Life Cycle with Hadoop
1. Ingesting data into the system
• The first stage of Big Data processing is
Ingest.
• The data is ingested or transferred to
Hadoop from various sources such as
relational databases systems, or local
files.
• Sqoop transfers data from RDBMS to
HDFS, whereas Flume transfers event
data. 34
2. Processing the data in storage
• The second stage is Processing.
• In this stage, the data is stored and
processed.
• The data is stored in the distributed file
system, HDFS, and the NoSQL distributed
data, HBase.
• Spark and MapReduce perform data
processing.
35
3. Computing and analyzing data
• The third stage is to Analyze.
• Here, the data is analyzed by processing
frameworks such as Pig, Hive, and
Impala.
• Pig converts the data using a map and
reduce and then analyzes it.
• Hive is also based on the map and reduce
programming and is most suitable for
structured data. 36
4. Visualizing the results
• The fourth stage is Access, which is
performed by tools such as Hue and
Cloudera Search.
• In this stage, the analyzed data can be
accessed by users.

37
Thanks
Questions?
38

Data Science and Big Data Basics
No ratings yet
Data Science and Big Data Basics
32 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Emerging Tech CH 2
No ratings yet
Emerging Tech CH 2
52 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Data Science
No ratings yet
Data Science
32 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
Data Science: Insights & Challenges
No ratings yet
Data Science: Insights & Challenges
33 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
41 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Chapter 2 (Data Science)
No ratings yet
Chapter 2 (Data Science)
35 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Course Name: Introduction To Emerging Technologies
No ratings yet
Course Name: Introduction To Emerging Technologies
24 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
30 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
37 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
43 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Chapter 2 - EMTE - 240216 - 133452
No ratings yet
Chapter 2 - EMTE - 240216 - 133452
47 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
Chapter 2EMR
No ratings yet
Chapter 2EMR
21 pages
Introduction To Emerging Technologies Chapter 2
No ratings yet
Introduction To Emerging Technologies Chapter 2
31 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
56 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Emergency Chapter Two
No ratings yet
Emergency Chapter Two
41 pages
Emerging CH2
No ratings yet
Emerging CH2
41 pages
Chapter 2 Introduction To Data Science - For Extension
No ratings yet
Chapter 2 Introduction To Data Science - For Extension
51 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
Chapter 2 - Intro To Data Sciences (Updated)
No ratings yet
Chapter 2 - Intro To Data Sciences (Updated)
67 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
8 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
(ET) Chapter - 2
No ratings yet
(ET) Chapter - 2
31 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
CH-2 Introduction To Data Science
No ratings yet
CH-2 Introduction To Data Science
26 pages
EmTec Chapter 2
No ratings yet
EmTec Chapter 2
32 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
Fin Irjmets1683739869
No ratings yet
Fin Irjmets1683739869
5 pages
Windows Server 2008 and Vista Cookbook
No ratings yet
Windows Server 2008 and Vista Cookbook
78 pages
The Illusion of The Illusion of Thinking: A Comment On Shojaee Et Al. (2025)
No ratings yet
The Illusion of The Illusion of Thinking: A Comment On Shojaee Et Al. (2025)
4 pages
Ooad Casestudy
No ratings yet
Ooad Casestudy
12 pages
Icon 10-6 To 8-Answer Key-Updated
100% (2)
Icon 10-6 To 8-Answer Key-Updated
40 pages
Nokia N8
No ratings yet
Nokia N8
11 pages
Promoção, Ofertas e Descontos
No ratings yet
Promoção, Ofertas e Descontos
809 pages
Kruti Dev Hindi Typing Char Map
100% (1)
Kruti Dev Hindi Typing Char Map
1 page
HP - 500B Microtower PC Series
No ratings yet
HP - 500B Microtower PC Series
4 pages
XDR
No ratings yet
XDR
3 pages
Selenium Interview Guide
No ratings yet
Selenium Interview Guide
19 pages
PCI Compliance: Understand and Implement Effective PCI Data Security Standard Compliance 5th Edition Branden Williams Full
No ratings yet
PCI Compliance: Understand and Implement Effective PCI Data Security Standard Compliance 5th Edition Branden Williams Full
114 pages
Principal Components in Regression Analysis
No ratings yet
Principal Components in Regression Analysis
27 pages
Matrices
No ratings yet
Matrices
24 pages
Web Designing: Course:BCA Sem Course:BCA Semester:4th Subject: WD Faculty: Ms. Rubbina Topic:CSS
No ratings yet
Web Designing: Course:BCA Sem Course:BCA Semester:4th Subject: WD Faculty: Ms. Rubbina Topic:CSS
27 pages
Traffic Sign Classification Report
No ratings yet
Traffic Sign Classification Report
2 pages
Final Product Concept
No ratings yet
Final Product Concept
25 pages
Dumpsys ANR WindowManager
No ratings yet
Dumpsys ANR WindowManager
9,291 pages
Data Testing Framework Guide
No ratings yet
Data Testing Framework Guide
10 pages
Scheduling Vocabulary
No ratings yet
Scheduling Vocabulary
4 pages
Microsoft Defender Advanced Threat Protection (ATP) Design
100% (2)
Microsoft Defender Advanced Threat Protection (ATP) Design
1 page
Aditya Sharma English Pration
No ratings yet
Aditya Sharma English Pration
14 pages
LS6 - Template 5
No ratings yet
LS6 - Template 5
35 pages
Switch VLANs, Trunks, DTP - 231109 - 203512
No ratings yet
Switch VLANs, Trunks, DTP - 231109 - 203512
17 pages
DS QUESTION BANK UPDATED One
No ratings yet
DS QUESTION BANK UPDATED One
14 pages
Oracle Expenses Mobile App: Initial Setup
No ratings yet
Oracle Expenses Mobile App: Initial Setup
2 pages
Payroll Management System Vishal Yadav Project Class 12e-Converted - Vishal Yadav
No ratings yet
Payroll Management System Vishal Yadav Project Class 12e-Converted - Vishal Yadav
21 pages
Object Oriented System Design With C++
No ratings yet
Object Oriented System Design With C++
37 pages
Trash Composting Publishedin Serambi Engineering
No ratings yet
Trash Composting Publishedin Serambi Engineering
14 pages
IT Unit 1
No ratings yet
IT Unit 1
3 pages

Ict Ch. 2

Uploaded by

Ict Ch. 2

Uploaded by

By Tatek, Mar.

Input Processing Output

a) Input − in this step, the input data is prepared in

• Big data is a blanket term for the non-traditional

• Big data is the term for a collection of data sets so

HDFS: Hadoop Distributed File System

You might also like