0% found this document useful (0 votes)
8 views17 pages

Fromat

The document provides an overview of various big data technologies including Apache Hadoop, Apache Storm, Apache Cassandra, CouchDB, Apache Flink, Cloudera, and Apache Hive. Each technology is defined, structured, and explained in terms of data storage, retrieval, and application use cases, along with their advantages and disadvantages. The document serves as a comprehensive guide for understanding the functionalities and characteristics of these technologies in the context of big data processing and management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views17 pages

Fromat

The document provides an overview of various big data technologies including Apache Hadoop, Apache Storm, Apache Cassandra, CouchDB, Apache Flink, Cloudera, and Apache Hive. Each technology is defined, structured, and explained in terms of data storage, retrieval, and application use cases, along with their advantages and disadvantages. The document serves as a comprehensive guide for understanding the functionalities and characteristics of these technologies in the context of big data processing and management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

1.

Apache Hadoop

 Definition: An open-source framework that allows for the distributed processing of


large datasets across clusters of computers using simple programming models. It's
the foundation for many other big data technologies.

 Structure: Primarily composed of:

o HDFS (Hadoop Distributed File System): A distributed file system that stores
data across multiple machines.

o YARN (Yet Another Resource Negotiator): Manages resources and schedules


jobs on the cluster.

o MapReduce: A programming model for processing large datasets in a


distributed and parallel manner.

 How Data is Stored and Retrieved:

o Storage: Data is broken into blocks (typically 128MB or 256MB) and


distributed across nodes in the HDFS cluster. Each block is replicated (default
3 times) for fault tolerance.

o Retrieval: When a MapReduce job runs, the processing logic (Map and
Reduce tasks) is moved to the nodes where the data resides (data locality) to
minimize network I/O.

 Simple Example: Imagine you have billions of lines of log data from a website, and
you want to count how many times each IP address accessed your site.

o Storage: The log files are split and stored across many servers by HDFS.

o Retrieval/Processing (conceptual MapReduce):

 Map: Each server reads its portion of the log file and outputs (IP
address, 1) for every access.

 Shuffle & Sort: All (IP address, 1) pairs with the same IP address are
grouped and sent to the same reducer.

 Reduce: Each reducer counts the '1's for its assigned IP addresses,
resulting in (IP address, total_count).

 Sample Query (Conceptual): Hadoop doesn't have a direct query language like SQL.
Instead, you write MapReduce jobs (often in Java, Python) or use higher-level tools
like Hive or Pig.

o Implicit in the example above: Count occurrences of each unique IP address.


 Kind of Data Stored: Can store structured, semi-structured, and unstructured data
(e.g., log files, images, videos, social media data, sensor data).

 Characteristics: Scalable, fault-tolerant, cost-effective (uses commodity hardware),


flexible (schema-on-read).

 Applications Used In: Large-scale data processing, data warehousing, log analysis,
machine learning data preparation, fraud detection, risk management.

 Advantages:

o Handles massive volumes of data (petabytes to exabytes).

o Highly fault-tolerant due to data replication.

o Scales horizontally by adding more commodity hardware.

o Cost-effective compared to traditional data warehousing.

 Disadvantages:

o Batch processing orientation; not suitable for real-time processing.

o MapReduce can be complex to program directly.

o Higher latency for small data queries.

o Security and data governance can be challenging to implement


comprehensively.

2. Apache Storm

 Definition: An open-source distributed real-time computation system for processing


unbounded streams of data. It's often referred to as the "Hadoop of real-time."

 Structure:

o Nimbus: The master node that distributes code, assigns tasks, and monitors
the cluster.

o Supervisors: Worker nodes that execute assigned tasks.

o Topologies: The logic of a real-time application, composed of:

 Spouts: Data sources (e.g., Kafka, Kinesis).

 Bolts: Processing units that perform operations (filtering, aggregation,


joining).

 How Data is Stored and Retrieved: Storm is primarily for processing data in motion,
not for persistent storage. Data is ingested from sources (spouts) and flows through
the topology (bolts) in real-time. Results are typically output to another system (e.g.,
a database, messaging queue).

 Simple Example: Analyzing real-time Twitter feeds for trending hashtags.

o Spout: Connects to Twitter API and streams tweets.

o Bolt 1 (Parse Tweet): Extracts hashtags from each tweet.

o Bolt 2 (Count Hashtag): Increments a counter for each hashtag.

o Bolt 3 (Output Trending): Periodically outputs the top N trending hashtags.

 Sample Query (Conceptual): Storm doesn't use queries; you define a data flow
topology.

o The example above outlines the processing logic.

 Kind of Data Stored: Processes continuous streams of data, typically semi-structured


or unstructured (e.g., sensor readings, clickstreams, financial transactions, social
media updates).

 Characteristics: Real-time, fault-tolerant, scalable, low-latency.

 Applications Used In: Real-time analytics, continuous computation, distributed RPC,


ETL.

 Advantages:

o Processes data in real-time with very low latency.

o Guaranteed data processing (at least once or exactly once).

o Highly scalable and fault-tolerant.

o Can integrate with various data sources and sinks.

 Disadvantages:

o Can be complex to set up and manage.

o Not designed for batch processing.

o Debugging distributed real-time systems can be challenging.

3. Apache Cassandra
 Definition: A free and open-source distributed NoSQL database management system
designed to handle large amounts of data across many commodity servers, providing
high availability with no single point of failure. It's a wide-column store.

 Structure: Peer-to-peer distributed architecture where all nodes are identical. Data is
distributed across nodes using consistent hashing (ring structure).

 How Data is Stored and Retrieved:

o Storage: Data is partitioned across nodes using a partition key. Replication


ensures data redundancy across multiple nodes. Writes are highly available
and fast, as data is written to multiple replicas concurrently.

o Retrieval: Queries are directed to any node in the cluster (coordinator node),
which then forwards the request to the nodes holding the relevant data. Data
is retrieved from one or more replicas based on the configured consistency
level.

 Simple Example: Storing user profile data for a large social media application.

o Table Schema (simplified): CREATE TABLE users (user_id UUID PRIMARY KEY,
username text, email text, age int, city text);

o Storage: When a new user signs up, their user_id acts as the partition key,
determining which node(s) store their data. The data is replicated based on
the replication factor.

o Retrieval: To get a user's profile: SELECT * FROM users WHERE user_id =


<user_uuid>; Cassandra quickly finds the node(s) holding that user_id and
retrieves the data.

 Sample Query:

Code snippet

INSERT INTO users (user_id, username, email, age, city) VALUES (uuid(), 'johndoe',
'john@example.com', 30, 'New York');

SELECT username, email FROM users WHERE user_id = 123e4567-e89b-12d3-a456-


426614174000;

 Kind of Data Stored: Semi-structured data, often denormalized. Excellent for time-
series data, sensor data, and applications requiring high write throughput and
continuous availability.

 Characteristics: Highly scalable, high availability, eventually consistent (tunable


consistency), high write throughput, no single point of failure.
 Applications Used In: Real-time recommendations, IoT data, social media
applications, messaging systems, fraud detection, customer 360 views.

 Advantages:

o Linear scalability for both reads and writes.

o Always-on architecture with no single point of failure.

o Flexible schema design.

o Excellent for geographically distributed data.

 Disadvantages:

o Eventual consistency can be a challenge for applications requiring strong


consistency.

o Joins and complex queries are not directly supported.

o Requires careful data modeling for efficient queries.

4. CouchDB (Apache CouchDB)

 Definition: An open-source NoSQL database that focuses on ease of use and a multi-
master replication model. It stores data in JSON documents and provides a RESTful
HTTP API for interaction.

 Structure: Document-oriented database. Data is stored as self-contained JSON


documents. Replication is a core feature, allowing for master-master or master-slave
setups.

 How Data is Stored and Retrieved:

o Storage: Each document has a unique ID and a revision ID. When a document
is updated, a new revision is created. This allows for optimistic concurrency
control (MVCC).

o Retrieval: Data is retrieved via HTTP requests to the document's URL or by


using "views" (MapReduce functions written in JavaScript) to query and
transform data.

 Simple Example: Storing blog posts.

o Storage: A blog post is a JSON document:


JSON

"_id": "post_123",

"title": "My First Blog Post",

"author": "Alice",

"content": "This is the content of my post.",

"tags": ["blogging", "tutorial"]

o Retrieval:

 Get a specific post: GET /mydb/post_123

 Find all posts by "Alice" (using a view): You'd define a map function
that emits [doc. author, doc.title] and then query that view.

 Sample Query (using curl for HTTP API):

Bash

curl -X PUT http://localhost:5984/mydb/post_123 -d '{ "title": "My First Blog Post", "author":
"Alice", "content": "This is the content." }' -H "Content-Type: application/json"curl
http://localhost:5984/mydb/post_123

 Kind of Data Stored: Semi-structured data in JSON format, including nested


structures and binary attachments.

 Characteristics: Document-oriented, eventually consistent, master-master


replication, offline-first capabilities, RESTful API.

 Applications Used In: Mobile applications (offline sync), web applications, content
management systems, CRM.

 Advantages:

o Easy to set up and use with a simple RESTful API.

o Excellent for distributed and offline-first applications due to robust


replication.

o High availability through multi-master replication.

o Flexible schema.
 Disadvantages:

o Limited query capabilities compared to SQL databases.

o Views can be slow for complex aggregations as they are pre-computed.

o Not ideal for highly relational data.

5. Apache Flink

 Definition: An open-source stream processing framework that can handle both


bounded (batch) and unbounded (streaming) data sets with high throughput and low
latency. It provides stateful computations.

 Structure: A Flink application consists of a dataflow graph, composed of sources,


transformations, and sinks. It runs on a cluster with JobManagers (master) and
TaskManagers (workers).

 How Data is Stored and Retrieved: Flink primarily processes data in motion. While it
maintains state for computations (e.g., counts, sums over windows), this state is
typically stored in memory or on local disk (RocksDB) and periodically checkpointed
to a persistent storage (like HDFS or S3) for fault tolerance. It doesn't act as a primary
data store.

 Simple Example: Detecting fraudulent credit card transactions in real-time.

o Source: Ingests credit card transactions as they occur.

o Transformation 1 (Windowing): Groups transactions for a user within a


specific time window (e.g., 5 minutes).

o Transformation 2 (Fraud Logic): Checks if the sum of transactions in the


window exceeds a threshold or if suspicious patterns are observed.

o Sink: Outputs suspicious transactions to an alert system.

 Sample Query (Conceptual - using Flink's Table API/SQL):

SQL

-- Assuming 'transactions' is a streaming table

SELECT userId, SUM(amount)

FROM transactions

GROUP BY TUMBLE(proctime, INTERVAL '5' MINUTE), userId

HAVING SUM(amount) > 1000;


 Kind of Data Stored (processed): Primarily unbounded data streams (e.g., IoT sensor
data, financial market data, web clickstreams, log data) and bounded batch data.

 Characteristics: Stateful stream processing, exactly-once processing guarantees, low


latency, high throughput, fault-tolerant, supports various time semantics (event time,
processing time).

 Applications Used In: Real-time analytics, event-driven applications, fraud detection,


monitoring, ETL, machine learning.

 Advantages:

o True stream processing capabilities with stateful operations.

o Guaranteed exactly-once processing, even in case of failures.

o Handles both batch and stream processing with a unified API.

o High performance and low latency.

 Disadvantages:

o Can have a steep learning curve due to its advanced concepts (state
management, time).

o Resource-intensive for very large state.

o Operational complexity in managing clusters.

6. Cloudera

 Definition: A company that provides an enterprise data platform built on open-


source technologies like Hadoop, Spark, Hive, Impala, etc. It simplifies the
deployment, management, and use of these complex big data ecosystems.

 Structure: Cloudera's platform (Cloudera Data Platform - CDP) integrates various


open-source components, providing a unified platform for data engineering, data
warehousing, machine learning, and operational databases. It offers management
tools (Cloudera Manager) and security features (Cloudera SDX).

 How Data is Stored and Retrieved: Cloudera itself doesn't store data directly; it
orchestrates and manages data stored in underlying systems like HDFS, S3, or other
compatible storage. Retrieval depends on the specific component being used (e.g.,
Hive for SQL queries on HDFS, Impala for interactive SQL).

 Simple Example: An organization wants to set up a data lake and perform various
analytics.
o Cloudera's Role: Provides the software and tools to easily deploy HDFS for
storage, Hive for data warehousing, Spark for data processing, and Hue for a
web-based interface, all with integrated security and governance.

 Sample Query (depends on underlying tool, e.g., HiveQL via Cloudera Hue):

SQL

SELECT customer_id, SUM(order_total)

FROM sales_data

WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31'

GROUP BY customer_id

HAVING SUM(order_total) > 1000;

 Kind of Data Stored: Supports all kinds of data (structured, semi-structured,


unstructured) as it leverages underlying technologies like HDFS.

 Characteristics: Enterprise-grade, unified platform, hybrid cloud support, strong


security and governance, focuses on data lifecycle.

 Applications Used In: Building data lakes, enterprise data warehousing, advanced
analytics, machine learning platforms, real-time dashboards.

 Advantages:

o Simplifies deployment and management of complex big data ecosystems.

o Provides enterprise-grade security, governance, and data lineage.

o Offers a comprehensive suite of tools for various data workloads.

o Supports hybrid and multi-cloud environments.

 Disadvantages:

o Can be expensive due to licensing and support costs.

o Requires significant hardware resources.

o Complexity can still be high for new users despite simplification.

7. Apache Hive
 Definition: A data warehouse software project built on top of Apache Hadoop for
querying and managing large datasets residing in distributed storage. It provides a
SQL-like language called HiveQL.

 Structure:

o Hive Metastore: Stores metadata (schema, location) of tables and partitions.

o Driver: Manages the lifecycle of a HiveQL query.

o Compiler: Parses HiveQL queries, performs semantic analysis, and generates


a logical plan.

o Optimizer: Transforms the logical plan into a series of MapReduce or


Tez/Spark jobs.

o Execution Engine: Executes the jobs on the Hadoop cluster.

 How Data is Stored and Retrieved:

o Storage: Data is stored in HDFS (or other compatible file systems like S3) in
various formats (e.g., TextFile, ORC, Parquet). Hive itself does not store the
data; it provides a schema and SQL interface over the data in HDFS.

o Retrieval: HiveQL queries are translated into MapReduce, Tez, or Spark jobs,
which then read the data from HDFS, process it, and return the results.

 Simple Example: Analyzing website clickstream data stored in HDFS.

o Storage: Raw clickstream logs (e.g., CSV files) are put into HDFS.

o Table Creation: CREATE EXTERNAL TABLE clickstream (timestamp STRING,


user_id INT, page_url STRING) ROW FORMAT DELIMITED FIELDS TERMINATED
BY ',' STORED AS TEXTFILE LOCATION '/user/hadoop/clickstream/';

o Retrieval: SELECT page_url, COUNT(*) FROM clickstream GROUP BY page_url


ORDER BY COUNT(*) DESC LIMIT 10; (Find top 10 most visited pages).

 Sample Query:

SQL

SELECT customer_state, COUNT(order_id)

FROM orders

WHERE order_date >= '2024-06-01'

GROUP BY customer_state;

 Kind of Data Stored: Primarily structured and semi-structured data, often in large
batches. It can work with unstructured data if a schema is imposed on it at read time.
 Characteristics: SQL-like interface, batch processing, schema-on-read, integrates with
Hadoop, fault-tolerant.

 Applications Used In: Data warehousing, batch ETL, large-scale data analysis,
reporting, business intelligence.

 Advantages:

o Enables SQL users to query big data in Hadoop without writing complex code.

o Scalable and fault-tolerant by leveraging Hadoop.

o Supports a wide range of data formats.

o Good for long-running batch queries.

 Disadvantages:

o High latency for interactive queries (though improved with Tez/LLAP).

o Not suitable for transactional workloads or real-time processing.

o Schema-on-read can lead to performance issues if not carefully designed.

8. MongoDB

 Definition: A popular open-source NoSQL document database. It stores data in


flexible, JSON-like documents, which means fields can vary from document to
document, and the data structure can be changed over time.

 Structure: Document-oriented. Data is organized into collections (similar to tables),


which contain BSON (Binary JSON) documents. Supports sharding for horizontal
scalability and replication for high availability.

 How Data is Stored and Retrieved:

o Storage: Documents are stored in collections. Each document has a unique


_id field. MongoDB allocates data files and journals for durability. Sharding
distributes data across multiple servers (shards) based on a shard key.

o Retrieval: Queries are executed against collections using a rich query


language that supports various criteria, aggregation pipelines, and indexing.
Data can be retrieved based on specific field values, ranges, or using regular
expressions.

 Simple Example: Storing product catalog information for an e-commerce website.

o Storage: A product document:

JSON
{

"_id": ObjectId("65e4e7e7e7e7e7e7e7e7e7e7"),

"name": "Laptop Pro",

"category": "Electronics",

"price": 1200.00,

"features": ["16GB RAM", "512GB SSD", "Intel i7"],

"reviews": [

{"user": "Alice", "rating": 5, "comment": "Great laptop!"},

{"user": "Bob", "rating": 4, "comment": "Good performance."}

o Retrieval: db.products.find({"category": "Electronics", "price": {"$gt": 1000}})

 Sample Query:

JavaScript

db.users.insertOne({

"name": "Jane Doe",

"email": "jane@example.com",

"interests": ["reading", "hiking"]

});

db.users.find({"interests": "reading"}, {"name": 1, "email": 1});

 Kind of Data Stored: Semi-structured data in JSON/BSON format. Ideal for


hierarchical data and data with evolving schemas.

 Characteristics: Document-oriented, schema-less, highly scalable (sharding), high


performance, rich query language, high availability (replication).

 Applications Used In: Content management systems, e-commerce, mobile


applications, real-time analytics, social networking.

 Advantages:

o Flexible schema allows for rapid development and iteration.


o Scales horizontally with sharding.

o High performance for many read/write operations.

o Rich query language and aggregation framework.

o Easy to get started and use.

 Disadvantages:

o Joins are not natively supported (requires client-side joins or complex


aggregation pipelines).

o Can consume significant memory.

o Lacks ACID transactions for multi-document operations in older versions


(though improved in newer versions).

o Data redundancy can occur due to denormalization.

9. MySQL

 Definition: A widely used open-source relational database management system


(RDBMS). It stores data in structured tables with predefined schemas and enforces
ACID properties.

 Structure: Relational model, where data is organized into tables (relations) with rows
(records) and columns (attributes). Relationships between tables are defined using
primary and foreign keys. Uses storage engines like InnoDB (transactional) and
MyISAM.

 How Data is Stored and Retrieved:

o Storage: Data is stored in tables that conform to a predefined schema. Each


row represents a single record, and columns define the attributes and their
data types. Data files are managed by the storage engine.

o Retrieval: SQL (Structured Query Language) is used to interact with the


database. Queries specify which tables to access, what conditions to apply,
and how to order or aggregate the results.

 Simple Example: Managing customer orders.

o Table Creation:

SQL

CREATE TABLE Customers (

customer_id INT PRIMARY KEY,


name VARCHAR(255),

email VARCHAR(255)

);

CREATE TABLE Orders (

order_id INT PRIMARY KEY,

customer_id INT,

order_date DATE,

total_amount DECIMAL(10, 2),

FOREIGN KEY (customer_id) REFERENCES Customers(customer_id)

);

o Storage: INSERT INTO Customers (customer_id, name, email) VALUES (1,


'Alice', 'alice@example.com');

o Retrieval: SELECT C.name, O.order_id, O.total_amount FROM Customers C


JOIN Orders O ON C.customer_id = O.customer_id WHERE C.name = 'Alice';

 Sample Query:

SQL

INSERT INTO Products (product_id, name, price) VALUES (101, 'Smartphone', 799.99);

UPDATE Products SET price = 749.99 WHERE product_id = 101;

SELECT name, price FROM Products WHERE price < 500 ORDER BY name ASC;

 Kind of Data Stored: Primarily structured data with a fixed schema. Best for
applications requiring strong consistency and transactional integrity.

 Characteristics: Relational, ACID compliant, mature, widely supported, good for


complex joins and aggregations.

 Applications Used In: Web applications (LAMP stack), e-commerce, CRM, ERP
systems, data warehousing (for smaller scale).

 Advantages:

o Strong data integrity with ACID properties.


o Well-established and widely supported with a large community.

o Excellent for complex queries and joins.

o Relatively easy to learn and use.

o High performance for many use cases.

 Disadvantages:

o Scalability challenges for extremely large datasets compared to NoSQL


databases.

o Less flexible schema compared to NoSQL.

o Can become a bottleneck for very high write throughput.

o Vertical scaling often means more expensive hardware.

10. Kaggle

 Definition: An online community and platform for data scientists and machine
learning enthusiasts. It's not a data storage or processing tool in itself, but a platform
that hosts data science competitions, provides datasets, and offers a collaborative
environment for machine learning development.

 Structure: A web-based platform where users can:

o Find Datasets: Access a vast repository of public datasets.

o Participate in Competitions: Solve real-world data science problems with


prizes.

o Share Code (Notebooks): Run Python/R code directly in the browser and
share with the community.

o Discuss: Engage in forums and discussions.

 How Data is Stored and Retrieved:

o Storage: Kaggle hosts datasets (CSV, JSON, images, etc.) on its platform. Users
upload their datasets or use existing ones.

o Retrieval: Users download datasets to their local machines or access them


directly within Kaggle Kernels/Notebooks (cloud-based computational
environments) where the data is readily available for analysis.

 Simple Example: Predicting house prices.


o Kaggle's Role: Provides a dataset of house features and prices. Users can
then:

 Download the dataset.

 Create a Kaggle Notebook.

 Write Python/R code to build a machine learning model (e.g., linear


regression, random forest) to predict prices.

 Submit their predictions to the competition leaderboard.

 Sample Query (Conceptual - within a Python/R notebook): Kaggle doesn't have a


direct query language. Data manipulation is done using programming libraries like
Pandas in Python.

Python

import pandas as pd

df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')

print(df.head())

print(df['SalePrice'].describe())

 Kind of Data Stored: Diverse datasets for data science and machine learning tasks,
often tabular data (CSV), images, text files, time-series data.

 Characteristics: Community-driven, collaborative, competition-focused, learning


platform, access to diverse datasets, cloud-based coding environment.

 Applications Used In: Machine learning model development, data exploration, skill
development, benchmarking ML algorithms, crowdsourcing solutions to data
problems.

 Advantages:

o Excellent for learning and practicing data science and machine learning.

o Access to a vast array of real-world datasets.

o Opportunities to collaborate and learn from a global community.

o Competitions provide motivation and a chance to win prizes.

o Cloud-based notebooks simplify environment setup.

 Disadvantages:

o Not a production-grade data management system.


o Focuses on individual model building rather than end-to-end data pipelines.

o Can be competitive, leading to a focus on leaderboard performance over


practical insights.

You might also like