What is data
It can be facts, figures, observations, instructions or measurements that are collected and
stored so they can be used for analysis, decision-making, or problem-solving.
Should be suitable for communication, interpretation or processing by human or electronic
machine.
Data is represented by characters, alphabets, and numbers.
What is information
Information is organized data, processed data.
Information must qualify the following qualities: timely, accuracy and completeness
Data science
Multi – disciplinary field that uses scientific methods, processes, algorithms and systems to
extract knowledge and insights from structured, semi structured and unstructured data.
Data science is the field that involves using data to gain insights, make decisions, and solve
problems. It combines elements from:
Statistics and Mathematics (to analyze data)
Computer Science (to process and handle data, often using code)
Domain Knowledge (to understand the context and make meaningful conclusions
Data processing is the act of taking raw data and turning it into meaningful information. It’s like
taking ingredients (raw data), following a recipe (processing), and ending up with a meal (useful
info).
Collection > Preparation / Cleaning > input > processing > output > storage
Data types and its representation
Data type is an attribute of data which tells the compiler or interpreter how the programmer
intends to use the data.
a data type defines what kind of value a piece of data holds — like a label that tells the
computer how to interpret and work with that data.
All programming languages explicitly include the notion of data type.
Common data types include: integers, Booleans, characters, floating point numbers,
alphanumeric strings. (in computer science and computer programming)
On the other hand, for the analysis of data, it is important to understand that there are three
common types of data types or structures:
Structured data: considered as the most traditional form of data storage. Depends on the
existence of a data model
Organized in tables or databases
Easy to search and analyze
Examples: Excel spreadsheets, SQL databases
Unstructured Data: does not have a predefined data model
The ability to analyze unstructured data is especially relevant in the context of big data, since a
large part of an organization is unstructured.
The ability to extract value from unstructured data is one of the main drivers behind the quick
growth of big data.
Not organized in a fixed format
Examples: Emails, videos, images, social media posts
Harder to analyze but very rich in information
Semi – structured data
JSON (JavaScript Object Notation)
A lightweight and easy-to-read format for storing and sharing data.
Looks like Python dictionaries or JavaScript objects.
Widely used in web APIs, apps, and data exchange.
Example:
JSON
"name": "Ali",
"age": 25,
"skills": ["JavaScript", "React"]
}
✅ Human-readable
✅ Easy to parse in code
✅ Used a lot in modern web and app development
🔸 XML (extensible Markup Language)
A markup language that uses tags to define data.
More structured and wordy than JSON.
Used in older systems, documents, and some APIs.
Example:
<person>
<name>Ali</name>
<age>25</age>
<skills>
<skill>JavaScript</skill>
<skill>React</skill>
</skills>
</person>
✅ Very structured
✅ Good for complex data
✅ Still used in enterprise systems and legacy software
Data value chain
📥 Data Acquisition means collecting or obtaining data from various sources so it can be used
for analysis, processing, or storage.
Data Analysis is the process of examining, organizing, and interpreting data to discover useful
information, patterns, trends, or insights.
It helps us answer questions like:
"What happened?"
"Why did it happen?"
"What will happen next
Data curations
It is the active management of data over its life cycle to ensure it meets necessary data quality
requirements for its effective usage.
Data storage
Data usage
Big Data refers to extremely large and complex datasets that are difficult to manage, process,
or analyze using traditional tools (like Excel or small databases).
Think data from millions of users, real-time sensors, or social media platforms — way too
much to handle with just your laptop!
The 5 Vs of Big Data (core pillars) characteristics that make big data different from other data
processing.
Volume
Velocity
Variety
The Big Data Life Cycle describes the end-to-end journey of data — from the moment it's
generated to the moment it's used for decision-making.
The general categories of activities involved with big data processing
1. Data Ingestion
Bringing data into your system or platform
Can be done in real-time (streaming) or in batches
Tools: Apache Kafka, Flume, Sqoop
2. Data Storage
Data is stored in systems that can handle large volume and variety
Choices depend on data type:
o HDFS (Hadoop Distributed File System)
o NoSQL databases (like MongoDB, Cassandra)
o Data lakes / cloud storage (AWS S3, Azure Blob)
📍 Example: Saving years of customer click data in Amazon S3
3. Data Analysis
Discover patterns, trends, and insights using:
o Statistical methods
o Machine learning models
o Visualizations
Tools: Python, R, Power BI, Tableau
📍 Example: Analyzing customer buying trends
4 . Data Visualization
Turning insights into graphs, charts, dashboards
Helps decision-makers understand what's happening
📍 Example: A dashboard showing sales performance by region
Clustered computing
Clustered computing is when multiple computers (called nodes) work together as a single,
unified system to perform tasks — especially when one computer alone isn’t powerful or fast
enough.
🧱 HDFS (Hadoop Distributed File System)
HDFS is primarily a storage system designed for Big Data, and while not a traditional database
management system, it is used in conjunction with data processing frameworks like Hadoop,
Spark, and Hive.
✅ Pros of HDFS:
Advantage Description
Easily scales horizontally by adding more servers (nodes). Can
⚖️Scalability
handle petabytes of data.
Data is replicated (default is 3 copies), ensuring no data is lost
Fault Tolerance
even if a node fails.
Optimized for reading/writing large amounts of data, especially in
🚀 High Throughput
batch processing.
💸 Cost-Effective Can run on commodity hardware, making it relatively inexpensive.
🔗 Works with Hadoop Seamlessly integrates with big data processing tools like
Ecosystem MapReduce, Hive, Pig, Spark.
Designed to store and manage large volumes of unstructured or
📈 Big Data Storage
semi-structured data.
❌ Cons of HDFS:
Limitation Description
Struggles with a large number of small files, as each file creates
🐢 Not Ideal for Small Files
overhead on the NameNode.
Limitation Description
Does not support SQL or querying natively. External tools (like
🔄 No Native Query Support
Hive or Spark SQL) are needed for querying.
Best suited for batch processing; not ideal for real-time or
🧱 Batch Processing Only
interactive queries.
Requires specialized knowledge to set up and maintain a Hadoop
⚙️Complex to Set Up
cluster.
🧠 Single Point of Failure If the NameNode fails (without high availability setup), the whole
(NameNode) system may stop functioning.
2. Relational Database Management Systems (RDBMS)
These are traditional databases that store data in structured tables with rows and columns.
They support SQL for querying data.
✅ Pros of RDBMS (e.g., MySQL, PostgreSQL, Oracle DB):
Advantage Description
Excellent for structured data where relationships are well-defined (e.g.,
📊 Structured Data
bank transactions, inventory).
🔍 Advanced Querying
Supports complex queries using SQL (e.g., joins, aggregations, filtering).
(SQL)
Guarantees data consistency, reliability, and transaction management
Data Integrity (ACID)
(ACID properties).
A well-established technology with extensive tools, libraries, and
💻 Mature Ecosystem
community support.
🔒 Security Built-in security features (e.g., access control, encryption).
❌ Cons of RDBMS:
Limitation Description
🧱 Rigid Schema Requires a predefined schema; inflexible when dealing with unstructured
Limitation Description
data.
Vertical scaling (adding more power to a single server) is expensive;
⚙️Scaling
horizontal scaling (distributing across multiple servers) is challenging.
🐢 Not Ideal for Big Performance may degrade as data grows into the terabyte or petabyte
Data range.
🔄 Write Relational databases can be slower for write-heavy workloads (e.g., logs,
Performance real-time streaming).
3. NoSQL Databases (Distributed DBMS)
NoSQL databases (like MongoDB, Cassandra, and Couchbase) are designed for high scalability,
flexibility, and can handle unstructured or semi-structured data. They are often used in
distributed systems where horizontal scaling is important.
✅ Pros of NoSQL DBMS:
Advantage Description
⚖️Horizontal Easily scales out across multiple servers and regions, making them
Scalability suitable for big data.
Allows for unstructured or semi-structured data and flexible schemas
🔄 Flexible Schema
(great for rapidly changing data).
Great for write-heavy workloads (e.g., logs, IoT devices) and handling
🚀 High Performance
large volumes of data.
🧠 Eventual Optimized for availability and partition tolerance, often adopting an
Consistency eventual consistency model (e.g., Cassandra).
📈 Specialized Data Supports various data models such as document (MongoDB), key-value
Models (Redis), graph (Neo4j), and column-family (Cassandra).
❌ Cons of NoSQL DBMS:
Limitation Description
🧱 Eventual May not provide strong consistency (ACID) like traditional RDBMS, leading
Limitation Description
Consistency to potential data anomalies.
Complex Data Designing data models can be more challenging, especially for developers
Modeling used to relational databases.
While NoSQL databases support basic queries, they lack full support for SQL-
🔍 Limited
style joins and complex queries (though tools like MongoDB Aggregation
Querying
Framework are improving this).
🧑💻 Young Some NoSQL DBMSs are still evolving, and their ecosystem may not be as
Technology mature as relational databases.
4. Key Differences Between HDFS and Other DBMSs
Here’s a comparison between HDFS and RDBMS/NoSQL DBMS:
Feature HDFS Relational DBMS NoSQL DBMS
Unstructured, semi- Structured (tables, Structured, semi-structured,
Data Type
structured rows, columns) unstructured
Horizontal (distributed across Horizontal (distributed
Scalability Vertical (single server)
nodes) systems)
No native querying (external SQL (Structured Query NoSQL query languages
Querying
tools like Hive or Spark) Language) (varies)
High fault tolerance ACID transactions Eventual consistency or
Consistency
(replication) (strong consistency) strong consistency (varies)
Key-Value, Document,
Data Model Files and blocks Tables with schema
Column-family, Graph
Big Data storage and batch High-volume, write-heavy
Best for Transactional data
processing workloads
Large-scale data storage and OLTP systems (e.g., Real-time applications,
Use Case
processing financial, ERP) distributed systems
🧠 Summary
HDFS: Ideal for storing massive datasets (often unstructured) in a distributed system for
batch processing. It’s not a DBMS in the traditional sense but works as the storage layer
for big data ecosystems.
Relational DBMS: Best for managing structured data with ACID compliance. Excellent
for transactional systems but struggles with scaling to large datasets (TBs and beyond).
NoSQL DBMS: Best for high scalability, handling unstructured or semi-structured data,
and high write throughput. Offers flexibility, but lacks the full feature set of relational
databases (like complex joins).