0% found this document useful (0 votes)
20 views22 pages

BDA Exp (1 To 7)

The document outlines a series of experiments focused on Hadoop and big data technologies, including HDFS operations, Hive for data analysis, MapReduce for word counting and matrix multiplication, and algorithms like DGIM, Bloom Filter, and Flajolet-Martin for stream processing and approximate counting. Each experiment includes software requirements, learning objectives, theoretical background, code procedures, and expected learning outcomes. The overall aim is to provide hands-on experience with distributed data processing and analytics in a big data environment.

Uploaded by

bohrmagneton99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views22 pages

BDA Exp (1 To 7)

The document outlines a series of experiments focused on Hadoop and big data technologies, including HDFS operations, Hive for data analysis, MapReduce for word counting and matrix multiplication, and algorithms like DGIM, Bloom Filter, and Flajolet-Martin for stream processing and approximate counting. Each experiment includes software requirements, learning objectives, theoretical background, code procedures, and expected learning outcomes. The overall aim is to provide hands-on experience with distributed data processing and analytics in a big data environment.

Uploaded by

bohrmagneton99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Experiment 1: Hadoop HDFS Basics

Aim:
To understand and perform basic file system operations in Hadoop Distributed File System
(HDFS), such as creating directories, uploading files, listing contents, and retrieving or deleting
files from HDFS.

Software Requirements

 Operating System: Ubuntu/Linux (Preferred)


 Java JDK: Version 8 or higher
 Apache Hadoop: Version 2.x or 3.x (single-node or pseudo-distributed mode)
 Terminal/Command Line Interface
 Optional: SSH, SCP, and basic Unix/Linux utilities

Learning Objectives

By the end of this experiment, students will be able to:

1. Understand the role of HDFS in the Hadoop ecosystem.


2. Learn basic HDFS shell commands for file operations.
3. Apply command-line tools to interact with HDFS.
4. Differentiate between local file system and HDFS usage.

Theory:
HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop
applications. It provides reliable, scalable, and distributed storage designed to run on
commodity hardware. HDFS is optimized for high-throughput access to large datasets and
follows a write-once, read-many model.

Architecture:

 NameNode (Master): Maintains metadata like the namespace, file locations, block
information, etc.
 DataNode (Slave): Stores the actual data blocks on physical disks and serves them to
clients.

Key Characteristics:

 Fault-tolerant via data replication (default: 3 copies)


 Optimized for large file storage and sequential access
 Supports parallel processing by storing data close to computation

Common Use-Cases:

 Storage layer for big data analytics


 Distributed file management
 Data ingestion from various sources

1
Code / Procedure

Below are essential HDFS commands with syntax and examples:

1. Create a Directory in HDFS

 Syntax:

hdfs dfs -mkdir <hdfs-directory-path>

 Example:

Creates an input folder under the Hadoop user directory in HDFS.

2. Upload (Put) a File to HDFS

 Syntax:

hadoop fs -put <local-source-path> <hdfs-destination-path>

 Example:

Uploads sample.txt from the local file system to /user/hadoop/input in HDFS.

3. List Files/Directories in HDFS

 Syntax:

hadoop fs -ls <hdfs-directory-path>

 Example:

Lists contents of the input directory in HDFS.

4. Read a File from HDFS

 Syntax:

hadoop fs -cat <hdfs-file-path>

 Example:

Displays the content of the file.

2
5. Download (Get) a File from HDFS to Local System

 Syntax:

hadoop fs -get <hdfs-file-path> <local-destination-path>

 Example:

Copies file from HDFS to the local directory /home/user/.


6. Delete a File from HDFS

 Syntax:

hadoop fs -rm <hdfs-file-path>

 Example:

Removes sample.txt from HDFS permanently.

Conclusion

In this experiment, the fundamental file operations on Hadoop Distributed File System were
performed successfully. Students were able to navigate HDFS using command-line utilities,
enhancing their understanding of how distributed file systems work in a big data environment.

Learning Outcomes

After completing this lab, students will be able to:

 Execute basic HDFS commands to manage files and directories.


 Understand the architecture and working principles of HDFS.
 Differentiate between local storage and distributed HDFS storage.
 Apply this knowledge in further experiments involving Hive, MapReduce, and Spark.

3
Experiment 2: Hive and Descriptive Analytics
Aim:
To create a Hive database and table, load structured data, and perform basic descriptive
statistical analysis using Hive Query Language (HiveQL).

Software Requirements

 Apache Hadoop (pre-installed and configured)


 Apache Hive
 Java JDK 8+
 Ubuntu/Linux OS (preferred)
 Sample CSV dataset (e.g., patients.csv)
 Terminal or Hive CLI

Learning Objectives

By the end of this experiment, students will be able to:

1. Understand the fundamentals of Hive and its role in the Hadoop ecosystem.
2. Create and query Hive tables using HiveQL.
3. Perform basic statistical analysis such as count, average, min, and max using SQL-
like queries.
4. Import structured data from a CSV file into Hive tables.

Theory:
Apache Hive is a data warehouse system built on top of Hadoop that enables users to query
and manage large datasets using a SQL-like language called HiveQL. Hive translates queries
into MapReduce jobs under the hood, abstracting away the complexity of distributed
computation.

Key Components of Hive:

 Metastore: Stores metadata about databases, tables, partitions.


 Driver: Manages the lifecycle of a HiveQL statement.
 Compiler: Converts HiveQL into MapReduce jobs.
 Execution Engine: Executes the query plan.

Why Hive?

 Allows SQL developers to write queries on large-scale Hadoop data.


 Supports ETL operations.
 Ideal for batch processing and analysis.

4
Descriptive Analytics in Hive

Descriptive analytics summarizes raw data into meaningful statistics such as:

 COUNT() – Total number of records.


 AVG() – Average of numeric columns.
 MIN() / MAX() – Smallest and largest values in a column.

Code / Procedure

Step 1: Start Hive CLI

Step 2: Create a Hive Database

Step 3: Create a Table for Patients

5
Step 4: Load Data into Hive Table
Make sure your patients.csv file is present in the local system.

Step 5: Perform Descriptive Analytics

 Count total records:

 Average age of patients:

6
 Minimum and Maximum weight:
SELECT MIN(weight), MAX(weight) FROM patients;

 Display all records:

Conclusion

Hive provides an easy-to-use interface for processing structured data stored in HDFS. In this
experiment, a database and table were created, and descriptive statistical functions were
applied to analyze data using HiveQL. This illustrates how traditional SQL concepts integrate
with big data platforms.

Learning Outcomes

After completing this lab, students will be able to:

 Set up and use Hive for querying large datasets.


 Create Hive tables and load data into them.
 Execute SQL-like commands in Hive to perform statistical analysis.
 Apply Hive as a scalable alternative to traditional RDBMS for analytics.

7
Experiment 3: Word Count using MapReduce
Aim
To implement and execute a Hadoop MapReduce program that counts the frequency of each
word in a given input text file.

Software Requirements

 Apache Hadoop (configured in pseudo-distributed mode)


 Java Development Kit (JDK) 8 or higher
 Eclipse/VSCode or terminal-based text editor (for writing Java code)
 Ubuntu/Linux OS (preferred)
 Sample text file (e.g., input.txt)

Learning Objectives

By the end of this experiment, students will be able to:

1. Understand the concept of the MapReduce programming model.


2. Implement Mapper and Reducer logic using Java.
3. Execute a basic MapReduce job on Hadoop.
4. Interpret job outputs and logs from the HDFS output directory

Theory

MapReduce is a programming model for processing large-scale data in parallel on a Hadoop


cluster. It breaks the job into two key phases:

 Map phase: Processes input and produces intermediate key-value pairs.


 Reduce phase: Aggregates key-value pairs from the mapper and produces the final
result.

Word Count Problem

A classic example used to demonstrate the MapReduce model is counting the number of
occurrences of each word in a large document.

Code / Procedure

Step 1: Create Input File and Upload to HDFS

Create a file input.txt with sample content:

Big data analytics is powerful. Big data helps businesses.

Upload it to HDFS:

8
hadoop fs -mkdir /user/hadoop/wordcount
hadoop fs -put input.txt /user/hadoop/wordcount/

Code:

9
Step 5: Compile and Run

Step 6: Check Output

Conclusion

In this experiment, a MapReduce application was successfully implemented to count the


frequency of each word in a text file. This demonstrated the capability of the Hadoop
framework to handle distributed processing of large datasets using the Map and Reduce
paradigms.

Learning Outcomes

After completing this lab, students will be able to:

 Understand and explain the MapReduce programming model.


 Implement custom Mapper and Reducer classes in Java.
 Execute a MapReduce job on Hadoop.
 Interpret results generated from distributed data processing.

10
Experiment 4: Matrix Multiplication using MapReduce

Aim
To implement a Hadoop MapReduce program that performs multiplication of two matrices in
a distributed and parallel environment.

Software Requirements

 Apache Hadoop (version 2.x or 3.x)


 Java JDK 8+
 Linux/Ubuntu OS
 Eclipse/VSCode or terminal-based text editor
 Sample input matrix files (Matrix A and Matrix B)

Learning Objectives

By the end of this experiment, students will be able to:

1. Understand how matrix operations are handled using the MapReduce paradigm.
2. Represent and transform matrix data as key-value pairs.
3. Implement Mapper and Reducer logic to perform multiplication of matrices.
4. Execute and test the matrix multiplication job on a Hadoop system.

Theory
Matrix multiplication is a fundamental linear algebra operation that involves computing the
dot product of rows from one matrix with columns from another. When multiplying two
matrices A (of size m × n) and B (of size n × p), the resulting matrix C has the size m × p and
each element C[i][j] is calculated as:

In a distributed environment like Hadoop, matrices are treated as key-value records, and
MapReduce allows parallel computation of each element of the result matrix.

Code / Procedure

Input:

Matrix A,B (2x2) – file: matrixA.txt

11
Step 1: Upload Input Files to HDFS

hadoop fs -mkdir /user/hadoop/matrix


hadoop fs -put matrixA.txt /user/hadoop/matrix/
hadoop fs -put matrixB.txt /user/hadoop/matrix/

Step 2: Mapper Class (MatrixMapper.java)

Step 3: Reducer Class (MatrixReducer.java)

12
13
Step 4: Driver Class (MatrixMultiply.java)

Step 5: Compile, Package and Run the Job

Step 6: Check Output

Conclusion

In this experiment, students implemented a parallelized matrix multiplication algorithm using


Hadoop’s MapReduce model. By decomposing the problem into key-value computations, the
matrix product was successfully computed in a distributed fashion.

Learning Outcomes

After completing this lab, students will be able to:

 Translate matrix operations into the MapReduce paradigm.


 Understand how Hadoop handles complex computations like multiplication.
 Implement and test distributed processing algorithms.
 Apply MapReduce to solve other linear algebra and data transformation problems.

14
Experiment 5: DGIM Algorithm Implementation
Aim
To implement the DGIM (Datar-Gionis-Indyk-Motwani) algorithm for approximating the
number of 1s in the last N bits of a binary stream using limited memory.

Software Requirements

 Python 3.x or Java (Python preferred for simplicity)


 Text editor (VSCode, Jupyter Notebook, etc.)
 Command line / terminal
 Optional: Matplotlib (for visualization)

Learning Objectives

By the end of this experiment, students will be able to:

1. Understand the challenges of real-time stream processing.


2. Implement the DGIM algorithm to count the approximate number of 1s in the last N
bits of a binary stream.
3. Analyze how space complexity can be reduced using approximation techniques.
4. Observe the trade-off between accuracy and memory usage.

Theory
In big data environments, it's often not feasible to store entire data streams in memory. The
DGIM algorithm provides a space-efficient way to estimate the count of recent events—in
this case, the number of 1s in the last N bits of a stream.

How DGIM Works:

 Uses a bucket-based approach.


 Buckets are of sizes in powers of 2 (1, 2, 4, 8…).
 Only keeps a logarithmic number of buckets.
 Combines older buckets to preserve space.

Each bucket represents a group of 1s. It stores:

 The timestamp (position in stream),


 The size of the bucket (number of 1s it represents).

When a new 1 arrives:

 A new bucket of size 1 is created.


 If more than two buckets of the same size exist, the oldest two are merged.

Estimation:
To estimate the number of 1s in the last N bits, sum the sizes of the buckets that fall within
the window. The oldest bucket’s size is halved to maintain an upper bound.

15
Code / Procedure (Python):

Output:

16
Conclusion

The DGIM algorithm was successfully implemented to process a binary stream in real time
and approximate the number of 1s in the latest window of data. It demonstrated how space
efficiency can be achieved without sacrificing too much accuracy in stream analytics.

Learning Outcomes
After completing this lab, students will be able to:

 Implement real-time algorithms for streaming data.


 Approximate counts using logarithmic memory.
 Understand the importance of probabilistic algorithms in big data.
 Analyze trade-offs between precision and performance.

17
Experiment 6: Bloom Filter Implementation
Aim
To implement a Bloom Filter for efficient membership testing and understand its use in big
data systems where approximate set membership queries are acceptable.

Software Requirements

 Python 3.x or Java (Python recommended for simplicity)


 Text Editor or IDE (VSCode, Jupyter, etc.)
 Terminal/Command Line
 hashlib module (built-in in Python)

Learning Objectives

By the end of this experiment, students will be able to:

1. Understand the concept of probabilistic data structures.


2. Implement a Bloom Filter with multiple hash functions.
3. Demonstrate the time-space trade-off in large-scale data systems.
4. Analyze false positives and understand their implications in real-world applications.

Theory
A Bloom Filter is a space-efficient, probabilistic data structure used to test whether an
element is a member of a set. It may return false positives (element may be reported present
when it is not), but never false negatives (if reported absent, it is definitely absent).

Working Principle:

 Uses a bit array of size m, initialized to 0.


 Uses k different independent hash functions.
 To add an element: Each hash function is applied, and the resulting index positions in
the bit array are set to 1.
 To check membership: Apply the hash functions again. If all corresponding bits are 1,
it might be in the set; if any bit is 0, it’s definitely not.

Trade-offs:

 Reduces space complexity drastically.


 Fast insertion and query operations (O(k)).
 Comes with the risk of false positives, controlled by m and k.

18
Code / Procedure (Python)

19
Conclusion
The Bloom Filter was successfully implemented and tested for set membership operations. It
showed how we can significantly reduce memory usage while tolerating a small probability
of error (false positives). This makes it a suitable choice for large-scale systems such as spam
filters, web caching, and network intrusion detection.

Learning Outcomes

After completing this lab, students will be able to:

 Design and implement probabilistic data structures.


 Evaluate trade-offs between accuracy and performance.
 Apply Bloom Filters to practical big data applications.
 Analyze the effects of false positives in large-scale systems.

20
Experiment 7: Flajolet-Martin Algorithm
Aim
To implement the Flajolet-Martin (FM) algorithm to estimate the number of distinct elements
(cardinality) in a large data stream using probabilistic techniques.

Software Requirements

 Python 3.x or Java (Python preferred)


 Terminal/Command Line
 Text Editor or IDE (e.g., Jupyter, VSCode)
 Built-in hashlib module in Python

Learning Objectives

By the end of this experiment, students will be able to:

1. Understand the concept of cardinality estimation in data streams.


2. Implement the Flajolet-Martin algorithm for distinct count approximation.
3. Analyze the trade-off between accuracy and memory usage in big data.
4. Apply hash-based techniques for real-time analytics.

Theory
The Flajolet-Martin algorithm is a streaming algorithm that estimates the number of
distinct elements in a data stream using a small, fixed amount of memory.

Concept Overview:

 Hash each incoming element to a binary string.


 Count the maximum number of trailing zeros in any hashed value seen so far.
 If the maximum number of trailing zeros is R, the estimated number of distinct
elements is roughly 2R2^R2R.

Why It Works:

 Uniformly hashed values produce trailing zeros with a probability that decreases
exponentially.
 More unique items → greater chance of encountering a hash with many trailing zeros.

Improvement:

To improve accuracy, multiple hash functions or multiple estimators are used and
averaged or median values are taken.

21
Code / Procedure (Python)

Conclusion

The Flajolet-Martin algorithm was successfully implemented to estimate the number of


distinct elements in a data stream. The result demonstrates the usefulness of probabilistic
algorithms in scenarios where full storage of data is impractical or impossible.

Learning Outcomes

After completing this lab, students will be able to:

 Apply streaming algorithms for approximate analytics.


 Implement and validate the Flajolet-Martin algorithm.
 Understand the role of hash functions in big data.
 Use space-efficient methods to derive insights from massive data streams.
22

You might also like