0% found this document useful (0 votes)

20 views22 pages

BDA Exp (1 To 7)

The document outlines a series of experiments focused on Hadoop and big data technologies, including HDFS operations, Hive for data analysis, MapReduce for word counting and matrix multiplication, and algorithms like DGIM, Bloom Filter, and Flajolet-Martin for stream processing and approximate counting. Each experiment includes software requirements, learning objectives, theoretical background, code procedures, and expected learning outcomes. The overall aim is to provide hands-on experience with distributed data processing and analytics in a big data environment.

Uploaded by

bohrmagneton99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views22 pages

BDA Exp (1 To 7)

Uploaded by

bohrmagneton99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Experiment 1: Hadoop HDFS Basics

Aim:
To understand and perform basic file system operations in Hadoop Distributed File System
(HDFS), such as creating directories, uploading files, listing contents, and retrieving or deleting
files from HDFS.

Software Requirements

 Operating System: Ubuntu/Linux (Preferred)

 Java JDK: Version 8 or higher
 Apache Hadoop: Version 2.x or 3.x (single-node or pseudo-distributed mode)
 Terminal/Command Line Interface
 Optional: SSH, SCP, and basic Unix/Linux utilities

Learning Objectives

By the end of this experiment, students will be able to:

1. Understand the role of HDFS in the Hadoop ecosystem.

2. Learn basic HDFS shell commands for file operations.
3. Apply command-line tools to interact with HDFS.
4. Differentiate between local file system and HDFS usage.

Theory:
HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop
applications. It provides reliable, scalable, and distributed storage designed to run on
commodity hardware. HDFS is optimized for high-throughput access to large datasets and
follows a write-once, read-many model.

Architecture:

 NameNode (Master): Maintains metadata like the namespace, file locations, block
information, etc.
 DataNode (Slave): Stores the actual data blocks on physical disks and serves them to
clients.

Key Characteristics:

 Fault-tolerant via data replication (default: 3 copies)

 Optimized for large file storage and sequential access
 Supports parallel processing by storing data close to computation

Common Use-Cases:

 Storage layer for big data analytics

 Distributed file management
 Data ingestion from various sources

 Example:

Removes sample.txt from HDFS permanently.

Conclusion

In this experiment, the fundamental file operations on Hadoop Distributed File System were
performed successfully. Students were able to navigate HDFS using command-line utilities,
enhancing their understanding of how distributed file systems work in a big data environment.

Learning Outcomes

After completing this lab, students will be able to:

 Execute basic HDFS commands to manage files and directories.

 Understand the architecture and working principles of HDFS.
 Differentiate between local storage and distributed HDFS storage.
 Apply this knowledge in further experiments involving Hive, MapReduce, and Spark.

3
Experiment 2: Hive and Descriptive Analytics
Aim:
To create a Hive database and table, load structured data, and perform basic descriptive
statistical analysis using Hive Query Language (HiveQL).

Software Requirements

 Apache Hadoop (pre-installed and configured)

 Apache Hive
 Java JDK 8+
 Ubuntu/Linux OS (preferred)
 Sample CSV dataset (e.g., patients.csv)
 Terminal or Hive CLI

Learning Objectives

By the end of this experiment, students will be able to:

1. Understand the fundamentals of Hive and its role in the Hadoop ecosystem.
2. Create and query Hive tables using HiveQL.
3. Perform basic statistical analysis such as count, average, min, and max using SQL-
like queries.
4. Import structured data from a CSV file into Hive tables.

Theory:
Apache Hive is a data warehouse system built on top of Hadoop that enables users to query
and manage large datasets using a SQL-like language called HiveQL. Hive translates queries
into MapReduce jobs under the hood, abstracting away the complexity of distributed
computation.

Key Components of Hive:

 Metastore: Stores metadata about databases, tables, partitions.

 Driver: Manages the lifecycle of a HiveQL statement.
 Compiler: Converts HiveQL into MapReduce jobs.
 Execution Engine: Executes the query plan.

Why Hive?

 Allows SQL developers to write queries on large-scale Hadoop data.

 Supports ETL operations.
 Ideal for batch processing and analysis.

4
Descriptive Analytics in Hive

Descriptive analytics summarizes raw data into meaningful statistics such as:

 COUNT() – Total number of records.

 AVG() – Average of numeric columns.
 MIN() / MAX() – Smallest and largest values in a column.

Code / Procedure

Step 1: Start Hive CLI

Step 2: Create a Hive Database

Step 3: Create a Table for Patients

5
Step 4: Load Data into Hive Table
Make sure your patients.csv file is present in the local system.

Step 5: Perform Descriptive Analytics

 Count total records:

 Average age of patients:

6
 Minimum and Maximum weight:
SELECT MIN(weight), MAX(weight) FROM patients;

 Display all records:

Conclusion

Hive provides an easy-to-use interface for processing structured data stored in HDFS. In this
experiment, a database and table were created, and descriptive statistical functions were
applied to analyze data using HiveQL. This illustrates how traditional SQL concepts integrate
with big data platforms.

Learning Outcomes

After completing this lab, students will be able to:

 Set up and use Hive for querying large datasets.

 Create Hive tables and load data into them.
 Execute SQL-like commands in Hive to perform statistical analysis.
 Apply Hive as a scalable alternative to traditional RDBMS for analytics.

7
Experiment 3: Word Count using MapReduce
Aim
To implement and execute a Hadoop MapReduce program that counts the frequency of each
word in a given input text file.

Software Requirements

 Apache Hadoop (configured in pseudo-distributed mode)

 Java Development Kit (JDK) 8 or higher
 Eclipse/VSCode or terminal-based text editor (for writing Java code)
 Ubuntu/Linux OS (preferred)
 Sample text file (e.g., input.txt)

Learning Objectives

By the end of this experiment, students will be able to:

1. Understand the concept of the MapReduce programming model.

2. Implement Mapper and Reducer logic using Java.
3. Execute a basic MapReduce job on Hadoop.
4. Interpret job outputs and logs from the HDFS output directory

Theory

MapReduce is a programming model for processing large-scale data in parallel on a Hadoop

cluster. It breaks the job into two key phases:

 Map phase: Processes input and produces intermediate key-value pairs.

 Reduce phase: Aggregates key-value pairs from the mapper and produces the final
result.

Word Count Problem

A classic example used to demonstrate the MapReduce model is counting the number of
occurrences of each word in a large document.

Code / Procedure

Step 1: Create Input File and Upload to HDFS

Create a file input.txt with sample content:

Big data analytics is powerful. Big data helps businesses.

Upload it to HDFS:

8
hadoop fs -mkdir /user/hadoop/wordcount
hadoop fs -put input.txt /user/hadoop/wordcount/

Code:

9
Step 5: Compile and Run

Step 6: Check Output

Conclusion

In this experiment, a MapReduce application was successfully implemented to count the

frequency of each word in a text file. This demonstrated the capability of the Hadoop
framework to handle distributed processing of large datasets using the Map and Reduce
paradigms.

Learning Outcomes

After completing this lab, students will be able to:

 Understand and explain the MapReduce programming model.

 Implement custom Mapper and Reducer classes in Java.
 Execute a MapReduce job on Hadoop.
 Interpret results generated from distributed data processing.

10
Experiment 4: Matrix Multiplication using MapReduce

Aim
To implement a Hadoop MapReduce program that performs multiplication of two matrices in
a distributed and parallel environment.

Software Requirements

 Apache Hadoop (version 2.x or 3.x)

 Java JDK 8+
 Linux/Ubuntu OS
 Eclipse/VSCode or terminal-based text editor
 Sample input matrix files (Matrix A and Matrix B)

Learning Objectives

By the end of this experiment, students will be able to:

1. Understand how matrix operations are handled using the MapReduce paradigm.
2. Represent and transform matrix data as key-value pairs.
3. Implement Mapper and Reducer logic to perform multiplication of matrices.
4. Execute and test the matrix multiplication job on a Hadoop system.

Theory
Matrix multiplication is a fundamental linear algebra operation that involves computing the
dot product of rows from one matrix with columns from another. When multiplying two
matrices A (of size m × n) and B (of size n × p), the resulting matrix C has the size m × p and
each element C[i][j] is calculated as:

In a distributed environment like Hadoop, matrices are treated as key-value records, and
MapReduce allows parallel computation of each element of the result matrix.

Code / Procedure

Input:

Matrix A,B (2x2) – file: matrixA.txt

11
Step 1: Upload Input Files to HDFS

hadoop fs -mkdir /user/hadoop/matrix

hadoop fs -put matrixA.txt /user/hadoop/matrix/
hadoop fs -put matrixB.txt /user/hadoop/matrix/

Step 2: Mapper Class (MatrixMapper.java)

Step 3: Reducer Class (MatrixReducer.java)

12
13
Step 4: Driver Class (MatrixMultiply.java)

Step 5: Compile, Package and Run the Job

Step 6: Check Output

Conclusion

In this experiment, students implemented a parallelized matrix multiplication algorithm using

Hadoop’s MapReduce model. By decomposing the problem into key-value computations, the
matrix product was successfully computed in a distributed fashion.

Learning Outcomes

After completing this lab, students will be able to:

 Translate matrix operations into the MapReduce paradigm.

 Understand how Hadoop handles complex computations like multiplication.
 Implement and test distributed processing algorithms.
 Apply MapReduce to solve other linear algebra and data transformation problems.

14
Experiment 5: DGIM Algorithm Implementation
Aim
To implement the DGIM (Datar-Gionis-Indyk-Motwani) algorithm for approximating the
number of 1s in the last N bits of a binary stream using limited memory.

Software Requirements

 Python 3.x or Java (Python preferred for simplicity)

 Text editor (VSCode, Jupyter Notebook, etc.)
 Command line / terminal
 Optional: Matplotlib (for visualization)

Learning Objectives

By the end of this experiment, students will be able to:

1. Understand the challenges of real-time stream processing.

2. Implement the DGIM algorithm to count the approximate number of 1s in the last N
bits of a binary stream.
3. Analyze how space complexity can be reduced using approximation techniques.
4. Observe the trade-off between accuracy and memory usage.

Theory
In big data environments, it's often not feasible to store entire data streams in memory. The
DGIM algorithm provides a space-efficient way to estimate the count of recent events—in
this case, the number of 1s in the last N bits of a stream.

How DGIM Works:

 Uses a bucket-based approach.

 Buckets are of sizes in powers of 2 (1, 2, 4, 8…).
 Only keeps a logarithmic number of buckets.
 Combines older buckets to preserve space.

Each bucket represents a group of 1s. It stores:

 The timestamp (position in stream),

 The size of the bucket (number of 1s it represents).

When a new 1 arrives:

 A new bucket of size 1 is created.

 If more than two buckets of the same size exist, the oldest two are merged.

Estimation:
To estimate the number of 1s in the last N bits, sum the sizes of the buckets that fall within
the window. The oldest bucket’s size is halved to maintain an upper bound.

15
Code / Procedure (Python):

Output:

16
Conclusion

The DGIM algorithm was successfully implemented to process a binary stream in real time
and approximate the number of 1s in the latest window of data. It demonstrated how space
efficiency can be achieved without sacrificing too much accuracy in stream analytics.

Learning Outcomes
After completing this lab, students will be able to:

 Implement real-time algorithms for streaming data.

 Approximate counts using logarithmic memory.
 Understand the importance of probabilistic algorithms in big data.
 Analyze trade-offs between precision and performance.

17
Experiment 6: Bloom Filter Implementation
Aim
To implement a Bloom Filter for efficient membership testing and understand its use in big
data systems where approximate set membership queries are acceptable.

Software Requirements

 Python 3.x or Java (Python recommended for simplicity)

 Text Editor or IDE (VSCode, Jupyter, etc.)
 Terminal/Command Line
 hashlib module (built-in in Python)

Learning Objectives

By the end of this experiment, students will be able to:

1. Understand the concept of probabilistic data structures.

2. Implement a Bloom Filter with multiple hash functions.
3. Demonstrate the time-space trade-off in large-scale data systems.
4. Analyze false positives and understand their implications in real-world applications.

Theory
A Bloom Filter is a space-efficient, probabilistic data structure used to test whether an
element is a member of a set. It may return false positives (element may be reported present
when it is not), but never false negatives (if reported absent, it is definitely absent).

Working Principle:

 Uses a bit array of size m, initialized to 0.

 Uses k different independent hash functions.
 To add an element: Each hash function is applied, and the resulting index positions in
the bit array are set to 1.
 To check membership: Apply the hash functions again. If all corresponding bits are 1,
it might be in the set; if any bit is 0, it’s definitely not.

Trade-offs:

 Reduces space complexity drastically.

 Fast insertion and query operations (O(k)).
 Comes with the risk of false positives, controlled by m and k.

18
Code / Procedure (Python)

19
Conclusion
The Bloom Filter was successfully implemented and tested for set membership operations. It
showed how we can significantly reduce memory usage while tolerating a small probability
of error (false positives). This makes it a suitable choice for large-scale systems such as spam
filters, web caching, and network intrusion detection.

Learning Outcomes

After completing this lab, students will be able to:

 Design and implement probabilistic data structures.

 Evaluate trade-offs between accuracy and performance.
 Apply Bloom Filters to practical big data applications.
 Analyze the effects of false positives in large-scale systems.

20
Experiment 7: Flajolet-Martin Algorithm
Aim
To implement the Flajolet-Martin (FM) algorithm to estimate the number of distinct elements
(cardinality) in a large data stream using probabilistic techniques.

Software Requirements

 Python 3.x or Java (Python preferred)

 Terminal/Command Line
 Text Editor or IDE (e.g., Jupyter, VSCode)
 Built-in hashlib module in Python

Learning Objectives

By the end of this experiment, students will be able to:

1. Understand the concept of cardinality estimation in data streams.

2. Implement the Flajolet-Martin algorithm for distinct count approximation.
3. Analyze the trade-off between accuracy and memory usage in big data.
4. Apply hash-based techniques for real-time analytics.

Theory
The Flajolet-Martin algorithm is a streaming algorithm that estimates the number of
distinct elements in a data stream using a small, fixed amount of memory.

Concept Overview:

 Hash each incoming element to a binary string.

 Count the maximum number of trailing zeros in any hashed value seen so far.
 If the maximum number of trailing zeros is R, the estimated number of distinct
elements is roughly 2R2^R2R.

Why It Works:

 Uniformly hashed values produce trailing zeros with a probability that decreases
exponentially.
 More unique items → greater chance of encountering a hash with many trailing zeros.

Improvement:

To improve accuracy, multiple hash functions or multiple estimators are used and
averaged or median values are taken.

21
Code / Procedure (Python)

Conclusion

The Flajolet-Martin algorithm was successfully implemented to estimate the number of

distinct elements in a data stream. The result demonstrates the usefulness of probabilistic
algorithms in scenarios where full storage of data is impractical or impossible.

Learning Outcomes

After completing this lab, students will be able to:

 Apply streaming algorithms for approximate analytics.

 Implement and validate the Flajolet-Martin algorithm.
 Understand the role of hash functions in big data.
 Use space-efficient methods to derive insights from massive data streams.
22

Big Data Lab Guide for CS Students
No ratings yet
Big Data Lab Guide for CS Students
53 pages
BDA Practicalfile
No ratings yet
BDA Practicalfile
19 pages
@bigdatalabfile 09
No ratings yet
@bigdatalabfile 09
35 pages
Bda Lab Manual 2024
No ratings yet
Bda Lab Manual 2024
45 pages
Big Data & Analytics Lab Manual
No ratings yet
Big Data & Analytics Lab Manual
51 pages
Bda Lab S
No ratings yet
Bda Lab S
92 pages
Bda Lab Record
No ratings yet
Bda Lab Record
32 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
94 pages
Bda Lab
No ratings yet
Bda Lab
36 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
59 pages
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
No ratings yet
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
28 pages
Big Data Lab Manual Printout
No ratings yet
Big Data Lab Manual Printout
51 pages
Bda Lab Manual - Cse 8 Sem - Compl
No ratings yet
Bda Lab Manual - Cse 8 Sem - Compl
57 pages
Bda File
No ratings yet
Bda File
28 pages
Cloud PDF
No ratings yet
Cloud PDF
47 pages
BIGDATA LAB MANUAL
No ratings yet
BIGDATA LAB MANUAL
27 pages
Big Data Analytics Lab
No ratings yet
Big Data Analytics Lab
18 pages
Big Data Manual
No ratings yet
Big Data Manual
82 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
27 pages
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
No ratings yet
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
30 pages
Bigdata Lab
No ratings yet
Bigdata Lab
55 pages
Big Data Analytics Lab Certificate
No ratings yet
Big Data Analytics Lab Certificate
55 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
32 pages
Data Science
No ratings yet
Data Science
82 pages
CCS334 BDA Lab Manual
No ratings yet
CCS334 BDA Lab Manual
35 pages
Bad601 Lab Maual
No ratings yet
Bad601 Lab Maual
34 pages
Bda Record (24-25)
No ratings yet
Bda Record (24-25)
50 pages
Course: Big Data Analytics Lab Scheme: 2017
No ratings yet
Course: Big Data Analytics Lab Scheme: 2017
25 pages
Big Data Analytics Lab Manual (BE AI&DS)
No ratings yet
Big Data Analytics Lab Manual (BE AI&DS)
29 pages
Big Data Record 2024-25
No ratings yet
Big Data Record 2024-25
46 pages
CSF443 Lab-Report Nimish Shandilya 1000016934
No ratings yet
CSF443 Lab-Report Nimish Shandilya 1000016934
17 pages
CS-702 (D) BigData
No ratings yet
CS-702 (D) BigData
61 pages
20dce017 Bda Pracfil
No ratings yet
20dce017 Bda Pracfil
41 pages
Hadoop Administrator Training - Lab Hand Book
No ratings yet
Hadoop Administrator Training - Lab Hand Book
12 pages
Ccs334 Bda Lab Ex
No ratings yet
Ccs334 Bda Lab Ex
45 pages
Bda Lab
No ratings yet
Bda Lab
4 pages
BIGDATALABCURRENT
No ratings yet
BIGDATALABCURRENT
54 pages
Hadoop Course Outline UPDATED SURESH
No ratings yet
Hadoop Course Outline UPDATED SURESH
5 pages
Bda Megh
No ratings yet
Bda Megh
50 pages
Bda Record 18071a0597-1
No ratings yet
Bda Record 18071a0597-1
28 pages
Ccs 334 Bigdata Manual
No ratings yet
Ccs 334 Bigdata Manual
45 pages
Hadoop Lab Practical Guide
No ratings yet
Hadoop Lab Practical Guide
69 pages
Unit 3 Mapreduce
No ratings yet
Unit 3 Mapreduce
14 pages
BDA Journal
No ratings yet
BDA Journal
52 pages
1 To 8
No ratings yet
1 To 8
16 pages
8CS4-21 BDA Lab - Dr. Varun P Saxena
100% (1)
8CS4-21 BDA Lab - Dr. Varun P Saxena
37 pages
Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
No ratings yet
Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
51 pages
HADOOP One Day Crash Course
No ratings yet
HADOOP One Day Crash Course
19 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
20 pages
Big Data and Hadoop: by - Ujjwal Kumar Gupta
No ratings yet
Big Data and Hadoop: by - Ujjwal Kumar Gupta
57 pages
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
No ratings yet
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
44 pages
BDA LabManual
No ratings yet
BDA LabManual
32 pages
CCS334-BDA LAB MANUAL Final
No ratings yet
CCS334-BDA LAB MANUAL Final
46 pages
Root Disk Mirroring On Debian Linux
No ratings yet
Root Disk Mirroring On Debian Linux
20 pages
15it321 Hci Lesson Plan
No ratings yet
15it321 Hci Lesson Plan
3 pages
Experiment No.5: (For Applied/experimental Sciences/materials Based Labs)
No ratings yet
Experiment No.5: (For Applied/experimental Sciences/materials Based Labs)
11 pages
Laser Cut Wooden Plate Stand DXF File Free Download - 3axis - Co
No ratings yet
Laser Cut Wooden Plate Stand DXF File Free Download - 3axis - Co
1 page
DS Syllabus Introduction (Reference)
No ratings yet
DS Syllabus Introduction (Reference)
44 pages
BIM G3 Residential Project
No ratings yet
BIM G3 Residential Project
2 pages
Class 9 I.T Chapter 3 Digital Documentation Notes
No ratings yet
Class 9 I.T Chapter 3 Digital Documentation Notes
4 pages
ECblue Modbus Description
No ratings yet
ECblue Modbus Description
27 pages
Siemens Polydoros LX 30-50 X-Ray - Adjustment
100% (2)
Siemens Polydoros LX 30-50 X-Ray - Adjustment
82 pages
SQA - Software Quality Assurance - A Student Introduction
No ratings yet
SQA - Software Quality Assurance - A Student Introduction
132 pages
Importing Radar Data From Folder
No ratings yet
Importing Radar Data From Folder
8 pages
Macro Development Tool User Guide V9.0
100% (2)
Macro Development Tool User Guide V9.0
40 pages
Exabeam Alert Triage Documentation-Pdf-En
No ratings yet
Exabeam Alert Triage Documentation-Pdf-En
15 pages
Chapter 3 E-Commerce Infrastructure: The Internet, Web, and Mobile Platform
No ratings yet
Chapter 3 E-Commerce Infrastructure: The Internet, Web, and Mobile Platform
19 pages
PQube Manual 2.1
No ratings yet
PQube Manual 2.1
113 pages
2 Static & Dynamic Web Pages
No ratings yet
2 Static & Dynamic Web Pages
24 pages
LUSAS RC Frame Design Option
No ratings yet
LUSAS RC Frame Design Option
2 pages
FPGA-Based Autonomous Lawnmower
No ratings yet
FPGA-Based Autonomous Lawnmower
9 pages
ConfFreeServerLeDucPer - Conf 2
No ratings yet
ConfFreeServerLeDucPer - Conf 2
6 pages
SOC Framework: Strategies & Processes
No ratings yet
SOC Framework: Strategies & Processes
34 pages
KJ (K.M Vf/kfo - Ifj"kn ) JK¡PH: Ek/ Fed Ijh (KK
No ratings yet
KJ (K.M Vf/kfo - Ifj"kn ) JK¡PH: Ek/ Fed Ijh (KK
1 page
AI Infrastructure 101
No ratings yet
AI Infrastructure 101
8 pages
Fifty Quick Ideas To Improve Your Tests Adzic Instant Access 2025
100% (10)
Fifty Quick Ideas To Improve Your Tests Adzic Instant Access 2025
84 pages
Computer Networks FTP Tips
No ratings yet
Computer Networks FTP Tips
9 pages
Intro To Line Scan Tech Note
No ratings yet
Intro To Line Scan Tech Note
7 pages
11 Ip Pa3 2024-25
No ratings yet
11 Ip Pa3 2024-25
3 pages
Microsoft SDL: Tripti Misra Tripti - Misra@ddn - Upes.ac - in
No ratings yet
Microsoft SDL: Tripti Misra Tripti - Misra@ddn - Upes.ac - in
11 pages
Master List Beneficiaries For School-Based Feeding Program (SBFP) (SY 2021-2022)
No ratings yet
Master List Beneficiaries For School-Based Feeding Program (SBFP) (SY 2021-2022)
44 pages
Algorithm Analysis & Design Guide
No ratings yet
Algorithm Analysis & Design Guide
22 pages
Storage Devices
No ratings yet
Storage Devices
5 pages