0% found this document useful (0 votes)

23 views10 pages

2 Emerging

The document explains the concepts of data and information, highlighting that data consists of facts and figures while information is organized data. It covers data science as a multidisciplinary field that extracts insights from various data types, including structured, unstructured, and semi-structured data. Additionally, it discusses data processing, storage, and the differences between HDFS, RDBMS, and NoSQL databases, emphasizing their respective use cases and advantages.

Uploaded by

okaywhynot55

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views10 pages

2 Emerging

Uploaded by

okaywhynot55

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

What is data

It can be facts, figures, observations, instructions or measurements that are collected and
stored so they can be used for analysis, decision-making, or problem-solving.

Should be suitable for communication, interpretation or processing by human or electronic

machine.

Data is represented by characters, alphabets, and numbers.

What is information

Information is organized data, processed data.

Information must qualify the following qualities: timely, accuracy and completeness

Data science

Multi – disciplinary field that uses scientific methods, processes, algorithms and systems to
extract knowledge and insights from structured, semi structured and unstructured data.

Data science is the field that involves using data to gain insights, make decisions, and solve
problems. It combines elements from:

 Statistics and Mathematics (to analyze data)

 Computer Science (to process and handle data, often using code)

 Domain Knowledge (to understand the context and make meaningful conclusions

Data processing is the act of taking raw data and turning it into meaningful information. It’s like
taking ingredients (raw data), following a recipe (processing), and ending up with a meal (useful
info).

Collection > Preparation / Cleaning > input > processing > output > storage

Data types and its representation

Data type is an attribute of data which tells the compiler or interpreter how the programmer
intends to use the data.

a data type defines what kind of value a piece of data holds — like a label that tells the
computer how to interpret and work with that data.

All programming languages explicitly include the notion of data type.

Common data types include: integers, Booleans, characters, floating point numbers,
alphanumeric strings. (in computer science and computer programming)
On the other hand, for the analysis of data, it is important to understand that there are three
common types of data types or structures:

Structured data: considered as the most traditional form of data storage. Depends on the
existence of a data model

Organized in tables or databases

Easy to search and analyze

Examples: Excel spreadsheets, SQL databases

Unstructured Data: does not have a predefined data model

The ability to analyze unstructured data is especially relevant in the context of big data, since a
large part of an organization is unstructured.

The ability to extract value from unstructured data is one of the main drivers behind the quick
growth of big data.

 Not organized in a fixed format

 Examples: Emails, videos, images, social media posts

 Harder to analyze but very rich in information

Semi – structured data

JSON (JavaScript Object Notation)

 A lightweight and easy-to-read format for storing and sharing data.

 Looks like Python dictionaries or JavaScript objects.

 Widely used in web APIs, apps, and data exchange.

Example:

JSON

"name": "Ali",

"age": 25,

"skills": ["JavaScript", "React"]

}
✅ Human-readable
✅ Easy to parse in code
✅ Used a lot in modern web and app development

🔸 XML (extensible Markup Language)

 A markup language that uses tags to define data.

 More structured and wordy than JSON.

 Used in older systems, documents, and some APIs.

Example:

<skill>JavaScript</skill>

<skill>React</skill>

</skills>

</person>

✅ Very structured
✅ Good for complex data
✅ Still used in enterprise systems and legacy software

Data value chain

📥 Data Acquisition means collecting or obtaining data from various sources so it can be used
for analysis, processing, or storage.

Data Analysis is the process of examining, organizing, and interpreting data to discover useful
information, patterns, trends, or insights.

It helps us answer questions like:

 "What happened?"

 "Why did it happen?"

 "What will happen next

Data curations

It is the active management of data over its life cycle to ensure it meets necessary data quality
requirements for its effective usage.

Data storage

Data usage

Big Data refers to extremely large and complex datasets that are difficult to manage, process,
or analyze using traditional tools (like Excel or small databases).

Think data from millions of users, real-time sensors, or social media platforms — way too
much to handle with just your laptop!

The 5 Vs of Big Data (core pillars) characteristics that make big data different from other data
processing.

Volume

Velocity

Variety

The Big Data Life Cycle describes the end-to-end journey of data — from the moment it's
generated to the moment it's used for decision-making.
The general categories of activities involved with big data processing

1. Data Ingestion

 Bringing data into your system or platform

 Can be done in real-time (streaming) or in batches

 Tools: Apache Kafka, Flume, Sqoop

2. Data Storage

 Data is stored in systems that can handle large volume and variety

 Choices depend on data type:

o HDFS (Hadoop Distributed File System)

o NoSQL databases (like MongoDB, Cassandra)

o Data lakes / cloud storage (AWS S3, Azure Blob)

📍 Example: Saving years of customer click data in Amazon S3

3. Data Analysis

 Discover patterns, trends, and insights using:

o Statistical methods

o Machine learning models

o Visualizations

 Tools: Python, R, Power BI, Tableau

📍 Example: Analyzing customer buying trends

4 . Data Visualization

 Turning insights into graphs, charts, dashboards

 Helps decision-makers understand what's happening

📍 Example: A dashboard showing sales performance by region

Clustered computing

Clustered computing is when multiple computers (called nodes) work together as a single,
unified system to perform tasks — especially when one computer alone isn’t powerful or fast
enough.
🧱 HDFS (Hadoop Distributed File System)

HDFS is primarily a storage system designed for Big Data, and while not a traditional database
management system, it is used in conjunction with data processing frameworks like Hadoop,
Spark, and Hive.

✅ Pros of HDFS:

Advantage Description

Easily scales horizontally by adding more servers (nodes). Can

⚖️Scalability
handle petabytes of data.

Data is replicated (default is 3 copies), ensuring no data is lost

Fault Tolerance
even if a node fails.

Optimized for reading/writing large amounts of data, especially in

🚀 High Throughput
batch processing.

💸 Cost-Effective Can run on commodity hardware, making it relatively inexpensive.

🔗 Works with Hadoop Seamlessly integrates with big data processing tools like
Ecosystem MapReduce, Hive, Pig, Spark.

Designed to store and manage large volumes of unstructured or

📈 Big Data Storage
semi-structured data.

❌ Cons of HDFS:

Limitation Description

Struggles with a large number of small files, as each file creates

🐢 Not Ideal for Small Files
overhead on the NameNode.
Limitation Description

Does not support SQL or querying natively. External tools (like

🔄 No Native Query Support
Hive or Spark SQL) are needed for querying.

Best suited for batch processing; not ideal for real-time or

🧱 Batch Processing Only
interactive queries.

Requires specialized knowledge to set up and maintain a Hadoop

⚙️Complex to Set Up
cluster.

🧠 Single Point of Failure If the NameNode fails (without high availability setup), the whole
(NameNode) system may stop functioning.

2. Relational Database Management Systems (RDBMS)

These are traditional databases that store data in structured tables with rows and columns.
They support SQL for querying data.

✅ Pros of RDBMS (e.g., MySQL, PostgreSQL, Oracle DB):

Advantage Description

Excellent for structured data where relationships are well-defined (e.g.,

📊 Structured Data
bank transactions, inventory).

🔍 Advanced Querying
Supports complex queries using SQL (e.g., joins, aggregations, filtering).
(SQL)

Guarantees data consistency, reliability, and transaction management

Data Integrity (ACID)
(ACID properties).

A well-established technology with extensive tools, libraries, and

💻 Mature Ecosystem
community support.

🔒 Security Built-in security features (e.g., access control, encryption).

❌ Cons of RDBMS:

Limitation Description

🧱 Rigid Schema Requires a predefined schema; inflexible when dealing with unstructured
Limitation Description

data.

Vertical scaling (adding more power to a single server) is expensive;

⚙️Scaling
horizontal scaling (distributing across multiple servers) is challenging.

🐢 Not Ideal for Big Performance may degrade as data grows into the terabyte or petabyte
Data range.

🔄 Write Relational databases can be slower for write-heavy workloads (e.g., logs,
Performance real-time streaming).

3. NoSQL Databases (Distributed DBMS)

NoSQL databases (like MongoDB, Cassandra, and Couchbase) are designed for high scalability,
flexibility, and can handle unstructured or semi-structured data. They are often used in
distributed systems where horizontal scaling is important.

✅ Pros of NoSQL DBMS:

Advantage Description

⚖️Horizontal Easily scales out across multiple servers and regions, making them
Scalability suitable for big data.

Allows for unstructured or semi-structured data and flexible schemas

🔄 Flexible Schema
(great for rapidly changing data).

Great for write-heavy workloads (e.g., logs, IoT devices) and handling
🚀 High Performance
large volumes of data.

🧠 Eventual Optimized for availability and partition tolerance, often adopting an

Consistency eventual consistency model (e.g., Cassandra).

📈 Specialized Data Supports various data models such as document (MongoDB), key-value
Models (Redis), graph (Neo4j), and column-family (Cassandra).

❌ Cons of NoSQL DBMS:

Limitation Description

🧱 Eventual May not provide strong consistency (ACID) like traditional RDBMS, leading
Limitation Description

Consistency to potential data anomalies.

Complex Data Designing data models can be more challenging, especially for developers
Modeling used to relational databases.

While NoSQL databases support basic queries, they lack full support for SQL-
🔍 Limited
style joins and complex queries (though tools like MongoDB Aggregation
Querying
Framework are improving this).

🧑‍💻 Young Some NoSQL DBMSs are still evolving, and their ecosystem may not be as
Technology mature as relational databases.

4. Key Differences Between HDFS and Other DBMSs

Here’s a comparison between HDFS and RDBMS/NoSQL DBMS:

Feature HDFS Relational DBMS NoSQL DBMS

Unstructured, semi- Structured (tables, Structured, semi-structured,

Data Type
structured rows, columns) unstructured

Horizontal (distributed across Horizontal (distributed

Scalability Vertical (single server)
nodes) systems)

No native querying (external SQL (Structured Query NoSQL query languages

Querying
tools like Hive or Spark) Language) (varies)

High fault tolerance ACID transactions Eventual consistency or

Consistency
(replication) (strong consistency) strong consistency (varies)

Key-Value, Document,
Data Model Files and blocks Tables with schema
Column-family, Graph

Big Data storage and batch High-volume, write-heavy

Best for Transactional data
processing workloads

Large-scale data storage and OLTP systems (e.g., Real-time applications,

Use Case
processing financial, ERP) distributed systems

🧠 Summary
 HDFS: Ideal for storing massive datasets (often unstructured) in a distributed system for
batch processing. It’s not a DBMS in the traditional sense but works as the storage layer
for big data ecosystems.

 Relational DBMS: Best for managing structured data with ACID compliance. Excellent
for transactional systems but struggles with scaling to large datasets (TBs and beyond).

 NoSQL DBMS: Best for high scalability, handling unstructured or semi-structured data,
and high write throughput. Offers flexibility, but lacks the full feature set of relational
databases (like complex joins).

Data Science Essentials & Big Data Concepts
No ratings yet
Data Science Essentials & Big Data Concepts
20 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Big Data Analytics 18CS72 - Module 1
No ratings yet
Big Data Analytics 18CS72 - Module 1
84 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Big Data Analytics
No ratings yet
Big Data Analytics
21 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Module 1
No ratings yet
Module 1
54 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
CH 2
No ratings yet
CH 2
23 pages
Big Data Analytics Course Guide
No ratings yet
Big Data Analytics Course Guide
17 pages
Data Science and Big Data Basics
No ratings yet
Data Science and Big Data Basics
32 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Ds Unit 3 Notes
No ratings yet
Ds Unit 3 Notes
29 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Big Data Analytics Notess
No ratings yet
Big Data Analytics Notess
69 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
30 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Uc PDF
No ratings yet
Uc PDF
10 pages
IET Udaipur BDA Unit-1
No ratings yet
IET Udaipur BDA Unit-1
10 pages
Bda (Unit 1)
No ratings yet
Bda (Unit 1)
24 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
BIG Data1
No ratings yet
BIG Data1
49 pages
Daily Class Notes: Ugc Net
No ratings yet
Daily Class Notes: Ugc Net
5 pages
BDS Session 1
100% (1)
BDS Session 1
70 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
9 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
43 pages
M-Ii DS
No ratings yet
M-Ii DS
26 pages
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
No ratings yet
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
6 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Bigdata
No ratings yet
Bigdata
12 pages
EmTec Chapter 2
No ratings yet
EmTec Chapter 2
32 pages
CS 441 Handouts
No ratings yet
CS 441 Handouts
300 pages
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
No ratings yet
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
28 pages
Big Data Analytics
No ratings yet
Big Data Analytics
61 pages
Big Data Analytics (R20a0520)
No ratings yet
Big Data Analytics (R20a0520)
84 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Unit 1
No ratings yet
Unit 1
51 pages
BDA IA1 New
No ratings yet
BDA IA1 New
21 pages
BD 1
No ratings yet
BD 1
15 pages
BIG DATA 1 Unit
100% (1)
BIG DATA 1 Unit
17 pages
Big Data NOTES
No ratings yet
Big Data NOTES
14 pages
Chapter 1 Introduction To Big Data
No ratings yet
Chapter 1 Introduction To Big Data
19 pages
Ict Ch. 2
No ratings yet
Ict Ch. 2
38 pages
Big Data Notes
No ratings yet
Big Data Notes
89 pages
Bigdata CO1 4 Merged
No ratings yet
Bigdata CO1 4 Merged
5 pages
Notesfor BDA
No ratings yet
Notesfor BDA
59 pages
Data Science
No ratings yet
Data Science
32 pages
Ese Bda
No ratings yet
Ese Bda
28 pages
Big Data Hadoop Complete Final Spaced
No ratings yet
Big Data Hadoop Complete Final Spaced
15 pages
Course Name: Introduction To Emerging Technologies
No ratings yet
Course Name: Introduction To Emerging Technologies
24 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Data Analytics Unit 1 2
No ratings yet
Data Analytics Unit 1 2
29 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Ga4 0 0 User Guide PDF
No ratings yet
Ga4 0 0 User Guide PDF
377 pages
2.3 Lab - Explore YANG Models Using The Pyang Tool
0% (3)
2.3 Lab - Explore YANG Models Using The Pyang Tool
2 pages
ServiceNow System Admin Exam PR000370
No ratings yet
ServiceNow System Admin Exam PR000370
13 pages
Beginner's Guide to Using vi Editor
No ratings yet
Beginner's Guide to Using vi Editor
2 pages
ROC800-Series Remote Operations Controllers PDF
No ratings yet
ROC800-Series Remote Operations Controllers PDF
10 pages
Sponsorship Brochure
No ratings yet
Sponsorship Brochure
18 pages
Amplivox Otowave 102
No ratings yet
Amplivox Otowave 102
2 pages
03 SPP 48S Proposal Format For Scholarship
No ratings yet
03 SPP 48S Proposal Format For Scholarship
9 pages
Amazon History File
No ratings yet
Amazon History File
5 pages
Certificate of Compliance: Certificate's Holder: Jinan Biobase Biotech Co., LTD
No ratings yet
Certificate of Compliance: Certificate's Holder: Jinan Biobase Biotech Co., LTD
2 pages
RTS Question Bank
67% (3)
RTS Question Bank
8 pages
Next-GEN Digital Stack For Student Management: Wip - Si6 Confidential
No ratings yet
Next-GEN Digital Stack For Student Management: Wip - Si6 Confidential
2 pages
Parts Manual: 120M Series 2 Motor Grader
100% (4)
Parts Manual: 120M Series 2 Motor Grader
537 pages
User Guide: Kasa Smart Wi-Fi Plug Mini HS103
No ratings yet
User Guide: Kasa Smart Wi-Fi Plug Mini HS103
20 pages
New Employee Orientation: A Case Study
No ratings yet
New Employee Orientation: A Case Study
2 pages
Instrument Landing System
No ratings yet
Instrument Landing System
18 pages
Syllabus Data Structures 2023
No ratings yet
Syllabus Data Structures 2023
4 pages
DCC Micro Project
No ratings yet
DCC Micro Project
10 pages
Joystick 3D Specs for Arduino Users
100% (1)
Joystick 3D Specs for Arduino Users
2 pages
CS2253 - Computer Organization and Architecture PDF
100% (1)
CS2253 - Computer Organization and Architecture PDF
2 pages
User Guide For Cambium Devices (AutoRecovered)
No ratings yet
User Guide For Cambium Devices (AutoRecovered)
50 pages
Soalan - Kertas 1 Set B
No ratings yet
Soalan - Kertas 1 Set B
15 pages
Geophysical Equipment by SiberGeo
No ratings yet
Geophysical Equipment by SiberGeo
16 pages
Transformer DGA Monitoring Guide
No ratings yet
Transformer DGA Monitoring Guide
5 pages
Mega Catalogue
No ratings yet
Mega Catalogue
1 page
AP 216 Dumps Success
No ratings yet
AP 216 Dumps Success
7 pages
High-End Mammography System Tender
No ratings yet
High-End Mammography System Tender
3 pages
Residential Sevice Load Worksheet For Electrical Vehicle Charging System PDF
No ratings yet
Residential Sevice Load Worksheet For Electrical Vehicle Charging System PDF
4 pages
How Ai Can Accelerate R and D For Cell and Gene Therapies
No ratings yet
How Ai Can Accelerate R and D For Cell and Gene Therapies
10 pages
UNIT6 - Code Conversion
No ratings yet
UNIT6 - Code Conversion
71 pages

2 Emerging

Uploaded by

2 Emerging

Uploaded by

What is data

Should be suitable for communication, interpretation or processing by human or electronic

Data is represented by characters, alphabets, and numbers.

Information is organized data, processed data.

 Statistics and Mathematics (to analyze data)

Data types and its representation

All programming languages explicitly include the notion of data type.

Organized in tables or databases

Easy to search and analyze

Examples: Excel spreadsheets, SQL databases

Unstructured Data: does not have a predefined data model

 Not organized in a fixed format

 Examples: Emails, videos, images, social media posts

 Harder to analyze but very rich in information

Semi – structured data

JSON (JavaScript Object Notation)

 A lightweight and easy-to-read format for storing and sharing data.

 Looks like Python dictionaries or JavaScript objects.

 Widely used in web APIs, apps, and data exchange.

"skills": ["JavaScript", "React"]

🔸 XML (extensible Markup Language)

 A markup language that uses tags to define data.

 More structured and wordy than JSON.

 Used in older systems, documents, and some APIs.

Data value chain

It helps us answer questions like:

 "Why did it happen?"

 Bringing data into your system or platform

 Can be done in real-time (streaming) or in batches

 Tools: Apache Kafka, Flume, Sqoop

 Choices depend on data type:

o HDFS (Hadoop Distributed File System)

o NoSQL databases (like MongoDB, Cassandra)

o Data lakes / cloud storage (AWS S3, Azure Blob)

📍 Example: Saving years of customer click data in Amazon S3

 Discover patterns, trends, and insights using:

o Machine learning models

 Tools: Python, R, Power BI, Tableau

📍 Example: Analyzing customer buying trends

 Turning insights into graphs, charts, dashboards

 Helps decision-makers understand what's happening

📍 Example: A dashboard showing sales performance by region

Easily scales horizontally by adding more servers (nodes). Can

Data is replicated (default is 3 copies), ensuring no data is lost

Optimized for reading/writing large amounts of data, especially in

💸 Cost-Effective Can run on commodity hardware, making it relatively inexpensive.

Designed to store and manage large volumes of unstructured or

Struggles with a large number of small files, as each file creates

Does not support SQL or querying natively. External tools (like

Best suited for batch processing; not ideal for real-time or

Requires specialized knowledge to set up and maintain a Hadoop

2. Relational Database Management Systems (RDBMS)

✅ Pros of RDBMS (e.g., MySQL, PostgreSQL, Oracle DB):

Excellent for structured data where relationships are well-defined (e.g.,

Guarantees data consistency, reliability, and transaction management

A well-established technology with extensive tools, libraries, and

🔒 Security Built-in security features (e.g., access control, encryption).

Vertical scaling (adding more power to a single server) is expensive;

3. NoSQL Databases (Distributed DBMS)

✅ Pros of NoSQL DBMS:

Allows for unstructured or semi-structured data and flexible schemas

🧠 Eventual Optimized for availability and partition tolerance, often adopting an

❌ Cons of NoSQL DBMS:

Consistency to potential data anomalies.

4. Key Differences Between HDFS and Other DBMSs

Here’s a comparison between HDFS and RDBMS/NoSQL DBMS:

Feature HDFS Relational DBMS NoSQL DBMS

Unstructured, semi- Structured (tables, Structured, semi-structured,

Horizontal (distributed across Horizontal (distributed

No native querying (external SQL (Structured Query NoSQL query languages

High fault tolerance ACID transactions Eventual consistency or

Big Data storage and batch High-volume, write-heavy

Large-scale data storage and OLTP systems (e.g., Real-time applications,

You might also like