0% found this document useful (0 votes)
23 views102 pages

Exp1 Hirday Merged

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views102 pages

Exp1 Hirday Merged

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

Name: Hirday Rochani Experiment No: 1 Roll No: 2213205

Aim: Installation of Hadoop and Experiment on HDFS commands

Theory:

What Is Hadoop?
Hadoop is an open-source framework that allows for the distributed processing of large datasets
across clusters of computers using simple programming models. It is designed to scale up from a
single server to thousands of machines, each offering local computation and storage.
Key Components of Hadoop:
1. Hadoop Distributed File System (HDFS):
 HDFS is the storage system used by Hadoop. It is designed to store very large files
across multiple machines. It provides high throughput access to application data and
is designed to be fault-tolerant by replicating data across multiple nodes.
2. MapReduce:
 MapReduce is the programming model used by Hadoop to process large datasets. It
breaks down a task into smaller sub-tasks (Map), processes them in parallel, and then
aggregates the results (Reduce).
3. YARN (Yet Another Resource Negotiator):
 YARN is the resource management layer of Hadoop. It handles the allocation of
resources in the cluster, ensuring that different tasks have the necessary
computational power to execute.
4. Hadoop Common:
 These are the common utilities and libraries that support the other Hadoop modules.
It provides essential services and functions needed by the other Hadoop modules

Advantages of Hadoop:

1. Cost-Effective: Hadoop is open-source and uses inexpensive commodity hardware, making


it a cost-efficient solution for managing and processing Big Data compared to traditional
relational databases.
2. Scalability: Hadoop is highly scalable, allowing for easy expansion by adding more nodes
to handle larger datasets, unlike traditional RDBMS systems.
3. Flexibility: Hadoop can process various types of data, whether structured, semi-structured,
or unstructured, making it versatile for different data sources and applications.
4. Speed: Hadoop’s distributed file system (HDFS) and parallel processing enable it to handle
large datasets quickly, offering high performance for big data operations.
5. Fault Tolerance: Hadoop automatically replicates data across multiple nodes, ensuring data
availability even if some nodes fail, which enhances its reliability.
6. High Throughput: Hadoop’s ability to process data in parallel across multiple nodes
results in high throughput, efficiently handling large workloads.
7. Minimum Network Traffic: By processing tasks locally on each data node, Hadoop
reduces network traffic within the cluster, optimizing performance.
Name: Hirday Rochani Experiment No: 1 Roll No: 2213205

Disadvantages of Hadoop:
1. Problem with Small Files: Hadoop struggles with large numbers of small files, as it is
optimized for handling large files split into sizable blocks.
2. Vulnerability: Being written in Java, Hadoop is more susceptible to security vulnerabilities,
potentially exposing it to cyber threats.
3. Low Performance with Small Data: Hadoop is designed for large datasets, and its
efficiency drops when processing small amounts of data.
4. Lack of Security: Hadoop’s security features, like Kerberos, are complex to manage and
lack robust encryption, making data security a concern.
5. High Processing Overhead: Hadoop’s read/write operations are disk-based, leading to
processing overhead and inefficiency in handling in-memory calculations.
6. Supports Only Batch Processing: Hadoop is designed for batch processing, with limited
support for real-time or low-latency processing tasks.
Name: Hirday Rochani Experiment No: 1 Roll No: 2213205

Steps For Installation:

Step1:

Use the following links to download above mentioned software successfully:

1. Link for Cloudera:


https://downloads.cloudera.com/demo_vm/virtualbox/clouderaquickstart-vm-5.12.0-0-virtualbox.zip

2. Link for VirtualBox: https://www.virtualbox.org/wiki/Download_Old_Builds_6_0 3 In case of


any error, check the following link to enable Virtualization on your device (please look for the
company whose machine(laptop/PC) you are using):
https://2nwiki.2n.cz/pages/viewpage.action?pageId=75202968

Step2:
After downloading Cloudera, unzip it using a zip extractor and extract the files. Upon completion,
open the virtual box software and select the Import option.

After selecting import, include the path of the previously downloaded Cloudera software.
Name: Hirday Rochani Experiment No: 1 Roll No: 2213205

Step3:
In the appliance settings, change the CPU section value from ‘1’ to ‘4’.
Name: Hirday Rochani Experiment No: 1 Roll No: 2213205

Step4:
Proceed further if your VirtualBox homepage looks like this

Now click on the cloudera-quickstart-vm file which was initially showing powered off.Once you
click on it, change the display settings and keep the video memory value between 0-40MB.
Name: Hirday Rochani Experiment No: 1 Roll No: 2213205

Step5:
Now click on the Start button and wait for a few minutes. Initially, your window will look like
this.
Name: Hirday Rochani Experiment No: 1 Roll No: 2213205

Once loading is completed, this window will appear.

HDFS COMMANDS:

Open a terminal window to the current working directory.

1. Print the Hadoop version


hadoop version

2. List the contents of the root directory in HDFS


hadoop fs -ls /
Name: Hirday Rochani Experiment No: 1 Roll No: 2213205

3. Report the amount of space used and available on currently mounted filesystem.
hadoop fs -df hdfs:/

4. Count the number of directories,files and bytes under the paths that match the specified file
pattern
hadoop fs -count hdfs:/

5. Run a DFS filesystem checking utility


hadoop fsck - /
Name: Hirday Rochani Experiment No: 1 Roll No: 2213205

6. Run a cluster balancing utility


hadoop balancer
Name: Hirday Rochani Experiment No: 1 Roll No: 2213205

7. Create a new directory named "C32" below the /user/training directory in HDFS. Since you're
currently logged in with the "training" user ID, /user/training is your home directory in HDFS.
hadoop fs -mkdir /user/training/C32

8. Add a sample text file from the local directory named "data" to the new directory you created in
HDFS during the previous step.
hadoop fs -put data/sample.txt /user/training/C32

9. List the contents of this new directory in HDFS.


hadoop fs -ls /user/training/C32

10. Add the entire local directory called "retail" to the /user/training/C32 directory in HDFS.
hadoop fs -put data/retail /user/training/C32

11. Since /user/training is your home directory in HDFS, any command that does not have an
absolute path is interpreted as relative to that directory. The next command will therefore list your
home directory, and should show the items you've just added there.

hadoop fs -ls /user/training/C32

12. See how much space this directory occupies in HDFS.

hadoop fs -du -s -h hadoop


Name: Hirday Rochani Experiment No: 1 Roll No: 2213205

13. Delete a file 'customers' from the "retail" directory.


hadoop fs -rmr hadoop/retail/customers

14. Ensure this file is no longer in HDFS.


hadoop fs -ls hadoop/retail/customers

15. Delete all files from the "retail" directory using a wildcard.
hadoop fs -rmr hadoop/retail/*

16. To empty the trash


hadoop fs -expunge

17. Finally, remove the entire retail directory and all of its contents in HDFS.
hadoop fs -rmr hadoop/retail

18. List the C32 directory again


hadoop fs -ls C32

19. Add the purchases.txt file from the local directory named "/home/training/" to the hadoop
directory you created in HDFS
hadoop fs -copyFromLocal /home/training/purchases.txt hadoop/
Name: Hirday Rochani Experiment No: 1 Roll No: 2213205

20. To view the contents of your text file purchases.txt which is present in your hadoop directory.

hadoop fs -cat hadoop/purchases.txt

21. Add the purchases.txt file from "hadoop" directory which is present in HDFS directory
to the directory "data" which is present in your local directory
hadoop fs -copyToLocal hadoop/purchases.txt /home/training/data

22. cp is used to copy files between directories present in C32.

hadoop fs -cp /user/training/*.txt /user/training/C32

23. '-get' command can be used alternaively to '-copyToLocal' command


hadoop fs -get hadoop/sample.txt /home/training/

24. Display last kilobyte of the file "purchases.txt" to stdout.


hadoop fs -tail hadoop/purchases.txt
Name: Hirday Rochani Experiment No: 1 Roll No: 2213205

25. Default replication factor to a file is 3. Use '-setrep' command to change replication factor of a
file.
hadoop fs -setrep -w 2 apache_hadoop/sample.txt

26. Last but not least, always ask for help!

hadoop fs -help
Name: Hirday Rochani Experiment No: 1 Roll No: 2213205
Name: Hirday Rochani Experiment No: 2 Roll No: 2213205

Aim: Use Of Sqoop Tool To Transfer Data Between Hadoop And Relational Database Servers.

Theory:

What is Scoop tool?


Sqoop is a tool used to transfer bulk data between Hadoop and external datastores, such as relational
databases (MS SQL Server, MySQL).
To process data using Hadoop, the data first needs to be loaded into Hadoop clusters from several
sources. However, it turned out that the process of loading data from several heterogeneous sources
was extremely challenging. The problems administrators encountered included:
1. Maintaining data consistency
2. Ensuring efficient utilization of resources
3. Loading bulk data to Hadoop was not possible
4. Loading data using scripts was slow
The solution was Sqoop. Using Sqoop in Hadoop helped to overcome all the challenges of the
traditional approach and it could load bulk data from RDBMS to Hadoop with ease.

Sqoop Features:
Sqoop has several features, which makes it helpful in the Big Data world:
1. Parallel Import/Export
Sqoop uses the YARN framework to import and export data. This provides fault tolerance
on top of parallelism.
Name: Hirday Rochani Experiment No: 2 Roll No: 2213205

2. Import Results of an SQL Query


Sqoop enables us to import the results returned from an SQL query into HDFS.
3. Connectors For All Major RDBMS Databases
Sqoop provides connectors for multiple RDBMSs, such as the MySQL and Microsoft SQL
servers.
4. Kerberos Security Integration
Sqoop supports the Kerberos computer network authentication protocol, which enables
nodes communication over an insecure network to authenticate users securely.
5. Provides Full and Incremental Load
Sqoop can load the entire table or parts of the table with a single command.

Sqoop Architecture:
1. The client submits the import/ export command to import or export data.
2. Sqoop fetches data from different databases. Here, we have an enterprise data warehouse,
document-based systems, and a relational database. We have a connector for each of these;
connectors help to work with a range of accessible databases.

3. Multiple mappers perform map tasks to load the data on to HDFS.


Name: Hirday Rochani Experiment No: 2 Roll No: 2213205

4. Similarly, numerous map tasks will export the data from HDFS on to RDBMS using the Sqoop
export command.

Sqoop Import:
The diagram below represents the Sqoop import mechanism.

1. In this example, a company’s data is present in the RDBMS. All this metadata is sent to the
Sqoop import. Scoop then performs an introspection of the database to gather metadata
(primary key information).
2. It then submits a map-only job. Sqoop divides the input dataset into splits and uses
individual map tasks to push the splits to HDFS.
Name: Hirday Rochani Experiment No: 2 Roll No: 2213205

Sqoop Export

1. The first step is to gather the metadata through introspection.


2. Sqoop then divides the input dataset into splits and uses individual map tasks to push the
splits to RDBMS.

Sqoop Processing:
Processing takes place step by step, as shown below:
1. Sqoop runs in the Hadoop cluster.
2. It imports data from the RDBMS or NoSQL database to HDFS.
3. It uses mappers to slice the incoming data into multiple formats and loads the data in HDFS.
4. Exports data back into the RDBMS while ensuring that the schema of the data in the
database is maintained.

Advantages of Sqoop:

1. Efficient Data Transfer:


• Sqoop is highly optimized for bulk data transfer, making it efficient for moving large
volumes of data between Hadoop and relational databases.
2. Ease of Use:
• Sqoop simplifies the process of transferring data. It uses simple command-line
interfaces to specify data sources, destination directories, and formats.
3. Integration with Hadoop Ecosystem:
• Sqoop integrates well with other Hadoop components like HDFS, Hive, HBase, and
Pig. This makes it easy to use Sqoop for loading data into Hive tables, HBase, or for
analysis using Pig.
Name: Hirday Rochani Experiment No: 2 Roll No: 2213205

4. Support for Incremental Loads:


• Sqoop allows for incremental data imports, enabling users to import only new or
updated data from relational databases without having to re-import the entire dataset.
5. Supports Multiple Databases:
• Sqoop supports a wide range of databases including MySQL, PostgreSQL, Oracle,
SQL Server, and others, making it a flexible tool for diverse environments.
6. Compression Support:
• Sqoop can compress data during import, reducing storage space and speeding up the
data transfer process.
7. Parallel Data Transfer:
• Sqoop supports parallel data transfer, which allows the data to be transferred faster by
splitting the job across multiple tasks.

Disadvantages of Sqoop:

1. Limited Data Transformation Capabilities:


• Sqoop is primarily a data transfer tool, not a data transformation tool. If you need to
perform complex transformations, you will have to rely on other tools like Pig or Hive.
2. Dependency on JDBC Drivers:
• Sqoop uses JDBC drivers to connect to databases, and sometimes there might be
limitations or performance issues depending on the database’s JDBC driver.
3. Requires SQL Expertise:
• Users may need a solid understanding of SQL to fully utilize Sqoop’s capabilities,
especially for custom queries during data import.
4. Lack of Real-Time Data Integration:
• Sqoop is not ideal for real-time or near-real-time data synchronization. It is more suited
for batch processing, and if real-time data transfer is required, tools like Apache Kafka
may be a better fit.
5. Manual Tuning for Large Data Sets:
• For very large datasets, manual tuning is often required to optimize the import/export
process. Factors like partitioning, parallelism, and memory settings need to be
considered.
6. Limited Support for NoSQL Databases:
• While Sqoop is excellent for relational databases, it has limited support for NoSQL
databases. It is mainly focused on SQL-based relational systems.

Commands:

Step1:

Open the Cloudera Terminal and execute the following command in order to start the MySQL
server. (Note: The default password is cloudera for the root user)

mysql -u root -p
Name: Hirday Rochani Experiment No: 2 Roll No: 2213205

Step2:
Creating a database
create database bank1;
Name: Hirday Rochani Experiment No: 2 Roll No: 2213205

Step3:
Creating a Table
(Note: The database must be in use before you create a table.)

Step4:
Insert values
Name: Hirday Rochani Experiment No: 2 Roll No: 2213205

CLOUDERA
After you exit MySQL, create a folder in the Cloudera file system to import the above MySQL table
which was created. (In the following steps,’myfirstdata’ folder is created in /home/cloudera)

Step5:
Importing the table using Sqoop
sqoop import --connect jdbc:mysql://youripaddress:3306/<database_name> --username root --
password cloudera --table <table_name> --target-dir=<target_directory> -m 1
Here,
-m specifies the number of mappers
3306 is the default port for MySQL
Name: Hirday Rochani Experiment No: 2 Roll No: 2213205

Step6:
Displaying the contents in HDFS
hadoop fs -ls /home/cloudera/myfirstdata
hadoop fs -cat /home/cloudera/myfirstdata/part-m-00000

Export Data from HDFS to MySQL


In order to export data from HDFS to MySQL, an appropriate table has to be created in MySQL as
we export data into a particular table. In our case, we will be exporting the contents in the
‘myfirstdata’ folder by creating a table ‘registercopy’ in the ‘bank1’
Name: Hirday Rochani Experiment No: 2 Roll No: 2213205

database. The table which we will be creating needs to have the same structure as the
‘register’ table which we created earlier.

Step7:
Creating the table

Step8:
Exporting data from HDFS to MySQL

Syntax:
sqoop export --connect jdbc:mysql://localhost/db --username root --table <table_name>
--export-dir <directory>
Name: Hirday Rochani Experiment No: 2 Roll No: 2213205

Step9:
Verifying in MySQL
We can see that the data is exported successfully into ‘registercopy
Name:Hirday Rochani Experiment No: 3 Roll No: 2213205

Aim: Programming Exercise In HBASE.

Theory:

HBase (Hadoop Database) is an open-source, distributed, non-relational, column-oriented database


built on top of the Hadoop Distributed File System (HDFS). It is designed to handle large amounts
of sparse data (data that is not densely populated) and is modeled after Google's Bigtable. HBase is
part of the Apache Hadoop ecosystem and provides real-time read/write access to large datasets.

Key Features of HBase:

1. Scalability: HBase scales horizontally by distributing data across many servers (also called
RegionServers). This enables it to store and manage petabytes of data.
2. Column-Oriented Storage: HBase stores data in a column-family format, allowing
efficient reads and writes on column-based datasets. Column-oriented databases are
particularly useful when you have sparse data or need to perform aggregate queries on
specific columns.
3. Real-Time Data Access: Unlike Hadoop's MapReduce jobs, which provide batch
processing, HBase allows random, real-time read and write access to data, making it ideal
for applications requiring low-latency access.
4. NoSQL Design: HBase is schema-less, meaning you don't have to predefine table schemas
(apart from column families). It supports flexible data models with dynamic columns.
Name:Hirday Rochani Experiment No: 3 Roll No: 2213205

5. Automatic Sharding: HBase automatically divides tables into smaller regions and
distributes them across different nodes in a cluster, improving performance and load
balancing.
6. Fault Tolerance: Built on HDFS, HBase inherits its fault tolerance, where data is replicated
across multiple nodes to prevent data loss in case of hardware failures.
7. Strong Consistency: HBase ensures that all operations are atomic and consistent. Data
writes are always immediately visible to subsequent reads.
8. Integration with Hadoop: HBase integrates seamlessly with Hadoop's ecosystem,
including tools like Apache Hive, Apache Pig, and Apache Spark for data analysis and
querying.

Use Cases:

1. Time-series Data: Applications that store and process time-series data, such as monitoring
and IoT data.
2. Data Warehousing: HBase can handle vast amounts of semi-structured or unstructured
data, making it ideal for data warehousing and data lakes.
3. Real-time Analytics: HBase supports real-time querying and data processing, making it
useful for systems that require quick response times.

Architecture of HBASE:
Name:Hirday Rochani Experiment No: 3 Roll No: 2213205

Components of HBase:

1. RegionServers: Each RegionServer hosts multiple regions (partitions of tables) and handles
read and write requests for the data within those regions.
2. HMaster: The master server in the HBase architecture manages RegionServers and is
responsible for load balancing and failover.
3. Zookeeper: Used for distributed coordination, it helps manage the distributed environment
and keeps track of RegionServers.

Difference Between HDFS and HBASE:

Sr. No HDFS HBASE


1. HDFS is a distributed file system suitable HBase is a database built on top of the
for storing large files. HDFS.
2. HDFS does not support fast individual HBase provides fast lookups for larger
record lookups. tables
3. It provides high latency batch processing; It provides low latency access to single
no concept of batch processing. rows from billions of records (random
access).
4. It provides only sequential access of data. HBase internally uses hash tables and
provides random access, storing data in
indexed HDFS files for faster lookups.

Commands:

Step1:

Open the Cloudera Terminal and execute the following command in order to start the MySQL
server. (Note: The default password is cloudera for the root user)
mysql -u root -p
Name:Hirday Rochani Experiment No: 3 Roll No: 2213205

Step2:
Creating a database
create database bank1;

Step3:
Creating a Table
(Note: The database must be in use before you create a table.)
Name:Hirday Rochani Experiment No: 3 Roll No: 2213205

Step4:
Insert values
Name:Hirday Rochani Experiment No: 3 Roll No: 2213205

CLOUDERA
After you exit MySQL, create a folder in the Cloudera file system to import the above MySQL table
which was created. (In the following steps,’myfirstdata’ folder is created in /home/cloudera)

Step5:
Importing the table using Sqoop
sqoop import --connect jdbc:mysql://youripaddress:3306/<database_name> --username root --
password cloudera --table <table_name> --target-dir=<target_directory> -m 1
Here,
-m specifies the number of mappers
3306 is the default port for MySQL
Name:Hirday Rochani Experiment No: 3 Roll No: 2213205

Step6:
Displaying the contents in HDFS
hadoop fs -ls /home/cloudera/myfirstdata
hadoop fs -cat /home/cloudera/myfirstdata/part-m-00000

Export Data from HDFS to MySQL


In order to export data from HDFS to MySQL, an appropriate table has to be created in MySQL as
we export data into a particular table. In our case, we will be exporting the contents in the
‘myfirstdata’ folder by creating a table ‘registercopy’ in the ‘bank1’
database. The table which we will be creating needs to have the same structure as the
‘register’ table which we created earlier.

Step7:
Creating the table
Name:Hirday Rochani Experiment No: 3 Roll No: 2213205

Step8:
Exporting data from HDFS to MySQL

Syntax:
sqoop export --connect jdbc:mysql://localhost/db --username root --table <table_name>
--export-dir <directory>
Name:Hirday Rochani Experiment No: 3 Roll No: 2213205

Step9:
Verifying in MySQL
We can see that the data is exported successfully into ‘registercopy’
Name: Hirday Rochani Experiment No: 4 Roll No: 2213205

Aim: Experiment For Word Counting Using Hadoop Map-Reduce

Theory:
 MapReduce can be used to write applications to process large amounts of data, in parallel, on
large clusters of commodity hardware (a commodity hardware is nothing but the hardware
which is easily available in the local market) in a reliable manner.
 MapReduce is a processing technique as well as a programming model for distributed
computing based on java programming language or java framework.
 The components of MapReduce are:
 Mapper:
The Map tasks accept one or more chunks from a DFS and turn them into a sequence
of key-value pairs. How the input data is converted into key-value pairs is determined
by the code written by the user for the Map function.

 Shuffling:
The process of exchanging the intermediate outputs from the map tasks to where they are
required by the reducers is known as “Shuffling".
Name: Hirday Rochani Experiment No: 4 Roll No: 2213205

 Reduce:
The Reduce tasks combine all of the values associated with a particular key. The
code written by the user for the Reduce function determines how the combination is
done. All of the values with the same key are presented to a single reducer together.

 Role of combiner:
 A combiner is a type of mediator between the mapper phase and the reducer phase.
The use of combiners is totally optional. As a combiner sits between the mapper
and the reducer, it accepts the output of map phase as an input and passes the key-
value pairs to the reduce operation.
 Combiners are also known as semi-reducers as they reside before the reducer. A
combiner is used when the Reduce function is commutative and associative. This
means that the values can be combined in any order without affecting the final
Name: Hirday Rochani Experiment No: 4 Roll No: 2213205

result. For example, addition is an operation which is both commutative as well as


associative.
 In MapReduce, the mapper generates a large amount of intermediate data.
Transferring this data to the reducer can cause network congestion due to the high
bandwidth requirement. A combiner function helps by locally aggregating data on
the mapper node, reducing the amount of data sent over the network, which
alleviates congestion and improves performance.

Advantage of combiners
 Reduces the time taken for transferring the data from Mapper to Reducer.

 Reduces the size of the intermediate output generated by the Mapper.


 Improves performance by minimizing Network congestion.
 Reduces the workload on the Reducer

STEPS FOR WORD COUNTING PROGRAM:


1. Go to eclipse and create new -> java project -> Wordcount(Any suitable name you want)
2. Right click on project Wordcount -> new -> class
Name: Hirday Rochani Experiment No: 4 Roll No: 2213205

(Note: Wordcount and wordCountJob are just folder names. It represents the same folder.)

Name the classes : wordCount


wordMapper
wordReducer

Files:

1) WordCount.java
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.adoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Mapper.Context;

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable {


@Override
public void map (LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
for(String word: line.split("\\W+")) {
if(word.length ()>0)
context.write(new Text(word),new IntWritable(1));
}
}
}
Name: Hirday Rochani Experiment No: 4 Roll No: 2213205

2) WordMapper.java
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.adoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Mapper.Context;

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable {


@Override
public void map (LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
for(String word: line.split("\\W+")) {
if(word.length ()>0)
context.write(new Text(word),new IntWritable(1));
}
}
}

3) WordReducer.java
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordReducer extends Reducer<Text,IntWritable,Text,IntWritable> {


@Override
public void reduce(Text key, Iterable values, Context context) throws
IOException, InterruptedException {
int wordCount = 0;
for(IntWritable value : values) {
wordCount += value.get();
context.write(key, new IntWritable(wordCount));
}
}
}

3. Right click on Wordcount/WordCountJob-> build path -> add external archives.


Name: Hirday Rochani Experiment No: 4 Roll No: 2213205

Go to file system-> usr-> lib->hadoop-> click on the selected file shown in below fig
Name: Hirday Rochani Experiment No: 4 Roll No: 2213205

Go to mozilla firefox and type Hadoop core 1.2.1 jar download - > download the first one below.
Name: Hirday Rochani Experiment No: 4 Roll No: 2213205

Folder structure

Right click on Wordcount and export as JAR file. Export the jar file in the same location where all
other three java files are present.
Name: Hirday Rochani Experiment No: 4 Roll No: 2213205
Name: Hirday Rochani Experiment No: 4 Roll No: 2213205

After all the above procedure go to the terminal to run the following commands.

1) Open terminal
Change the path to cd /home/training/workspace/Wordcount/src
(Check your path of the Wordcount folder on your system and set the path accordingly)

2) Run the command


hadoop jar abcd.jar WordCount sample.txt sampleoutdir

Commands:
Name: Hirday Rochani Experiment No: 4 Roll No: 2213205
Name: Hirday Rochani Experiment No: 5 Roll No: 2213205

Aim: Experiment On Pig.

Theory:

What is Pig in Hadoop?


Pig is a scripting platform that runs on Hadoop clusters designed to
process and analyze large datasets. Pig is extensible, self-optimizing, and
easily programmed.
Programmers can use Pig to write data transformations without knowing
Java. Pig uses both structured and unstructured data as input to perform
analytics and uses HDFS to store the results.

Pig is a high-level platform or tool which is used to process the large datasets. It provides a high-
level of abstraction for processing over the MapReduce. It provides a high-level scripting language,
known as Pig Latin which is used to develop the data analysis codes. First, to process the data
which is stored in the HDFS, the programmers will write the scripts using the Pig Latin Language.
Internally Pig Engine (a component of Apache Pig) converted all these scripts into a specific map
and reduce task. But these are not visible to the programmers in order to provide a high-level of
abstraction. Pig Latin and Pig Engine are the two main components of the Apache Pig tool. The
result of Pig always stored in the HDFS.

Pig Architecture

Pig Architecture:
1. Pig Latin Scripts
This is where the process begins. Users write scripts using the Pig Latin language to define the data
analysis tasks and transformations they want to perform. Pig Latin is designed to be a simpler
alternative to writing complex MapReduce code.

2. Apache Pig Components


Name: Hirday Rochani Experiment No: 5 Roll No: 2213205

Grant Shell: This is the interactive command-line interface where users can write and execute Pig
Latin scripts directly. It’s useful for quick data exploration and script testing.

Pig Server: The Pig server acts as the backend engine that takes Pig Latin scripts, processes them,
and translates them into MapReduce jobs. It handles all the coordination between the various
components in the system.

Parser: The first step after a script is submitted. The parser checks the Pig Latin script for any
syntax errors and ensures that it follows the correct structure. It then generates a Logical Plan,
which outlines the various operations required.

Optimizer: The optimizer analyzes the logical plan and looks for opportunities to improve
efficiency. It removes unnecessary operations, rearranges steps for better performance, and
generally refines the execution plan. This results in an optimized Physical Plan that is more efficient
to execute.

Compiler: Once the physical plan is optimized, it needs to be translated into MapReduce jobs. The
compiler is responsible for turning the physical plan into a series of MapReduce jobs that Hadoop
can execute. Each step in the Pig Latin script corresponds to one or more MapReduce jobs.

Execution Engine: The execution engine is responsible for running the MapReduce jobs generated
by the compiler. It manages job execution on Hadoop, ensuring that the data is processed, results
are collected, and any errors are handled appropriately.

3. MapReduce
Description: Once the Pig components have prepared the MapReduce jobs, they are submitted to
Hadoop’s MapReduce framework for distributed processing. MapReduce is responsible for splitting
the data across nodes in the cluster, running computations in parallel, and aggregating the results.

4. HDFS (Hadoop Distributed File System)


Description: The output of the MapReduce jobs is stored in HDFS, which is Hadoop’s distributed
file system. HDFS is designed for high-throughput data storage and retrieval, and it stores the
results of the Pig jobs for later use or further analysis.

Applications of Apache Pig:


 For exploring large datasets Pig Scripting is used.
 Provides the supports across large data-sets for Ad-hoc queries.
 In the prototyping of large data-sets processing algorithms.
 Required to process the time sensitive data loads.
 For collecting large amounts of datasets in form of search logs and web crawls.
 Used where the analytical insights are needed using the sampling.
Name: Hirday Rochani Experiment No: 5 Roll No: 2213205

Advantages of Pig:
1. Ease of Use: Pig Latin scripts are much simpler to write compared to native Hadoop
MapReduce code. It abstracts away the complex Java-based MapReduce programming and is
more like SQL, making it easier for developers and data analysts.
2. Less Development Time: Pig significantly reduces the time it takes to write, understand,
and maintain the code due to its high-level abstractions over Hadoop.
3. Flexible: Pig can handle both structured and semi-structured data (like JSON, XML, etc.),
making it highly adaptable to different data types.
4. Improved Productivity: Since the code is more concise and requires fewer lines, developers
can quickly prototype, test, and debug scripts. This results in increased productivity for data
processing tasks.
5. Extensibility: Pig allows users to create their own user-defined functions (UDFs) in Java,
Python, or other supported languages, making it customizable to specific needs.
6. Optimized for Performance: Pig optimizes execution plans for Pig Latin scripts, making it
efficient for processing large datasets.
7. Dataflow Approach: It follows a dataflow approach where users specify a sequence of
transformations, and Pig handles how to execute them efficiently.

Disadvantages of Pig:
1. Learning Curve: Despite being easier than MapReduce, Pig still requires learning the Pig
Latin scripting language, which may pose a challenge for users unfamiliar with it.
2. Limited Debugging Tools: While Pig is easier to use than MapReduce, debugging scripts
can still be complex due to limited debugging tools, especially with very large datasets.
3. Less Suitable for Complex Analytics: Pig is better suited for ETL (Extract, Transform,
Load) processes or simple data analytics. For complex machine learning and iterative
algorithms, it’s less powerful compared to frameworks like Apache Spark.
4. Latency: Pig runs on Hadoop, so it is bound by the limitations of Hadoop’s batch processing
model. This can lead to higher latency for processing compared to real-time solutions.
5. Requires Hadoop Setup: Pig requires an underlying Hadoop cluster, meaning it can't be
used without a Hadoop environment. This dependency makes it unsuitable for smaller data
processing tasks.

Apache Pig MapReduce


It is a scripting language. It is a compiled programming language.
Abstraction is at higher level. Abstraction is at lower level.
It have less line of code as compared to Lines of code is more.
MapReduce.
Less effort is needed for Apache Pig. More development efforts are required for
MapReduce.
Code efficiency is less as compared to As compared to Pig efficiency of code is
MapReduce. higher.
Pig provides built in functions for ordering, Hard to perform data operations.
sorting and union.
Name: Hirday Rochani Experiment No: 5 Roll No: 2213205

It allows nested data types like map, tuple and It does not allow nested data types
bag

Commands:

1. Run PIG:
[training@localhost ~]$ pig

2. grunt> fs:

3. copyFromLocal:
grunt> copyFromLocal /home/training/sample.txt /user/training/ grunt> cat sample.txt
Name: Hirday Rochani Experiment No: 5 Roll No: 2213205

copyFromLocal /home/training/students.txt /user/training/ grunt> cat students.txt

4. Dump:
Name: Hirday Rochani Experiment No: 5 Roll No: 2213205

5. Projection:
Name: Hirday Rochani Experiment No: 5 Roll No: 2213205

6. Joins–
grunt> dept = load '/user/training/department.txt' USING PigStorage(',') as
(id:int,name:chararray,deptname:chararray,sal:int); grunt> dump dept;
Name: Hirday Rochani Experiment No: 5 Roll No: 2213205
Name: Hirday Rochani Experiment No: 5 Roll No: 2213205

7. Relational Operators
a. Cross: The cross operator is used to calculate the cross product of two or more relations.
grunt> dump student;
Name: Hirday Rochani Experiment No: 5 Roll No: 2213205

grunt>dump dept;
Name: Hirday Rochani Experiment No: 5 Roll No: 2213205
Name: Hirday Rochani Experiment No: 5 Roll No: 2213205

grunt> x = cross student,dept;


grunt>dump x;
Name: Hirday Rochani Experiment No: 5 Roll No: 2213205

b) ForEach: This operator is used to generate data transformation based on column data.
grunt>X = Foreach dept GENERATE id,name;
grunt> dump X;
Name: Hirday Rochani Experiment No: 6 Roll No: 2213205

Aim: Create HIVE Database And Descriptive Analytics (Basic Statistics).

Theory:
What is Apache Hive?
Apache Hive is a data warehouse and an ETL tool which provides an SQL-like interface between
the user and the Hadoop distributed file system (HDFS) which integrates Hadoop. It is built on top
of Hadoop. It is a software project that provides data query and analysis. It facilitates reading,
writing and handling wide datasets that stored in distributed storage and queried by Structure
Query Language (SQL) syntax. It is not built for Online Transactional Processing (OLTP)
workloads. It is frequently used for data warehousing tasks like data encapsulation, Ad-hoc
Queries, and analysis of huge datasets. It is designed to enhance scalability, extensibility,
performance, fault-tolerance and loose-coupling with its input formats.

Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a
massive scale. A data warehouse provides a central store of information that can easily be analyzed
to make informed, data driven decisions. Hive allows users to read, write, and manage petabytes of
data using SQL.

Hive is built on top of Apache Hadoop, which is an open-source framework used to efficiently
store and process large datasets. As a result, Hive is closely integrated with Hadoop, and is
designed to work quickly on petabytes of data.

Modes of Hive:
Local Mode –
It is used, when the Hadoop is built under pseudo mode which has only one data node, when the
data size is smaller in term of restricted to single local machine, and when processing will be faster
on smaller datasets existing in the local machine.

Map Reduce Mode –


It is used, when Hadoop is built with multiple data nodes and data is divided across various nodes,
it will function on huge datasets and query is executed parallelly, and to achieve enhanced
performance in processing large datasets.
What is HQL?
Hive defines a simple SQL-like query language for querying and managing large datasets called
Hive-QL (HQL). It’s easy to use if you’re familiar with SQL Language. Hive allows programmers
who are familiar with the language to write the custom MapReduce framework to perform more
sophisticated analysis.

The major components of Hive and its interaction with the Hadoop is demonstrated in the figure
below and all the components are described further:
 User Interface (UI) –
As the name describes User interface provide an interface between user and hive. It enables
user to submit queries and other operations to the system. Hive web UI, Hive command line,
and Hive HD Insight (In windows server) are supported by the user interface.
Name: Hirday Rochani Experiment No: 6 Roll No: 2213205

 Hive Server – It is referred to as Apache Thrift Server. It accepts the request from different
clients and provides it to Hive Driver.
 Driver –
Queries of the user after the interface are received by the driver within the Hive. Concept of
session handles is implemented by driver. Execution and Fetching of APIs modelled on
JDBC/ODBC interfaces is provided by the user.

 Compiler –
Queries are parses, semantic analysis on the different query blocks and query expression is
done by the compiler. Execution plan with the help of the table in the database and partition
metadata observed from the metastore are generated by the compiler eventually.

 Metastore –
All the structured data or information of the different tables and partition in the warehouse
containing attributes and attributes level information are stored in the metastore. Sequences
or de-sequences necessary to read and write data and the corresponding HDFS files where
the data is stored. Hive selects corresponding database servers to stock the schema or
Metadata of databases, tables, attributes in a table, data types of databases, and HDFS
mapping.

 Execution Engine –
Execution of the execution plan made by the compiler is performed in the execution engine.
The plan is a DAG of stages. The dependencies within the various stages of the plan is
managed by execution engine as well as it executes these stages on the suitable system
components.
Name: Hirday Rochani Experiment No: 6 Roll No: 2213205

Advantages of Hive Architecture:


1) Scalability: Hive is a distributed system that can easily scale to handle large volumes of data by
adding more nodes to the cluster.
2) Data Accessibility: Hive allows users to access data stored in Hadoop without the need for
complex programming skills. SQL-like language is used for queries and HiveQL is based on
SQL syntax.
3) Data Integration: Hive integrates easily with other tools and systems in the Hadoop ecosystem
such as Pig, HBase, and MapReduce.
4) Flexibility: Hive can handle both structured and unstructured data, and supports various data
formats including CSV, JSON, and Parquet.
5) Security: Hive provides security features such as authentication, authorization, and encryption
to ensure data privacy.

Disadvantages of Hive Architecture:


1) High Latency: Hive’s performance is slower compared to traditional databases because of the
overhead of running queries in a distributed system.
2) Limited Real-time Processing: Hive is not ideal for real-time data processing as it is designed
for batch processing.
3) Complexity: Hive is complex to set up and requires a high level of expertise in Hadoop, SQL,
and data warehousing concepts.
4) Lack of Full SQL Support: HiveQL does not support all SQL operations, such as transactions
and indexes, which may limit the usefulness of the tool for certain applications.
5) Debugging Difficulties: Debugging Hive queries can be difficult as the queries are executed
across a distributed system, and errors may occur in different nodes.

Features Of Hive
Name: Hirday Rochani Experiment No: 6 Roll No: 2213205

Limitations Of Hive

Hive Pig
1) Hive is commonly used by Data Analysts. 1) Pig is commonly used by programmers.
2) It follows SQL-like queries. 2) It follows the data-flow language.
3) It can handle structured data. 3) It can handle semi-structured data.
4) It works on server-side of HDFS cluster. 4) It works on client-side of HDFS cluster.
5) Hive is slower than Pig. 5) Pig is comparatively faster than Hive.

Commands:

1. To enter hive terminal


Command: hive

2. To check the databases


Command: show databases;
Name: Hirday Rochani Experiment No: 6 Roll No: 2213205

3. To check the tables


Command: show tables;

4. To use a particular database


Command: use dbname;

5. To create database
Command: create database retail;

6. To create table emp in retail database


Command: create table <tablename>;

7. Schema information of table


Command: describe <tablename>;
Name: Hirday Rochani Experiment No: 6 Roll No: 2213205

8.To create file in training folder and save as


demo.txt
10, Bhavisha, 10000.0
11, Nipoon, 20000.0
12, Hirday, 15000.0
[training@localhost ~]$ cat /home/training/demo.txt

9.To view contents of table


Command: select * from emp;

10. To rename table name


Command: ALTER TABLE old_table_name RENAME TO new_table_name;

11.Selecting data
hive> select * from emp_sal where id=1;
Name: Hirday Rochani Experiment No: 6 Roll No: 2213205

12. To count number of records in table


hive> select count(*) from emp_sal;

13. Try using aggregate commands using HQL(Try creating tables with group by fields and execute
the aggregate commands)
hive > select AVG(sal) as avg_salary from emp_sal;
Name: Hirday Rochani Experiment No: 6 Roll No: 2213205

hive > select MAX(sal) as max_salary from emp_sal;

14. To drop table


hive> drop table emp_sal;

15. To exit from Hive terminal


hive> exit;
Name: Hirday Rochani Experiment No: 7 Roll No: 2213205

Aim: Implement Bloom Filter Using Python/R Programming.

Theory:

What is Bloom Filter?


A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element
is a member of a set. For example, checking availability of username is set membership problem,
where the set is the list of all registered username. The price we pay for efficiency is that it is
probabilistic in nature that means, there might be some False Positive results. False positive means,
it might tell that given username is already taken but actually it’s not.
Interesting Properties of Bloom Filters
• Unlike a standard hash table, a Bloom filter of a fixed size can represent a set with an
arbitrarily large number of elements.
• Adding an element never fails. However, the false positive rate increases steadily as elements
are added until all bits in the filter are set to 1, at which point all queries yield a positive result.
• Bloom filters never generate false negative result, i.e., telling you that a username doesn’t
exist when it actually exists.
• Deleting elements from filter is not possible because, if we delete a single element by clearing
bits at indices generated by k hash functions, it might cause deletion of few other elements.
How It Works:
• A Bloom filter uses multiple hash functions to map each element to several positions in a
fixed-size bit array (all bits are initially set to 0).
• When an element is added to the set, each hash function produces a bit position, and those
positions are set to 1.
• To check if an element is in the set, the Bloom filter applies the same hash functions to the
element and checks whether all the resulting positions are 1. If any of them are 0, the element
is definitely not in the set. If all positions are 1, the element might be in the set (with some
probability of a false positive).

Advantages:
1. Space-Efficient: It uses much less memory than storing the actual elements.
2. Time-Efficient: Checking if an element is in the set is very fast (constant time).
3. No False Negatives: It guarantees that if the Bloom filter says an element is not in the set, it
definitely isn't.
Name: Hirday Rochani Experiment No: 7 Roll No: 2213205

Disadvantages:
1. False Positives: There is a small probability that the filter will say an element is in the set
when it's not.
2. Not Removable: Once an element is added, you cannot remove it from the Bloom filter
without reconstructing it from scratch.
3. Fixed Size: You have to choose the size of the bit array and the number of hash functions at
the beginning, which makes it less flexible if the dataset grows.

Code:

def main():
# Read the length of the stream data
m = int(input("Enter the length of the stream data (m): "))
data_stream = [0] * m

# Read the number of inputs to train the array


n = int(input("Enter the number of inputs to train the array (n): "))

# Training the array


for _ in range(n):
input_value = int(input("Enter a number to train the array: "))
Name: Hirday Rochani Experiment No: 7 Roll No: 2213205

hash1 = (input_value % 5) % m
hash2 = ((2 * input_value + 3) % 5) % m

data_stream[hash1] = 1
data_stream[hash2] = 1

# Print the training array


print("Training Array:", ' '.join(map(str, data_stream)))

# Read the number of times to test


x = int(input("Enter the number of times to test (x): "))

# Testing the array


for _ in range(x):
test_value = int(input("Enter a number to test: "))
hash1 = (test_value % 5) % m
hash2 = ((2 * test_value + 3) % 5) % m

if data_stream[hash1] == 0 and data_stream[hash2] == 0:


print("Number is not in the stream.")
elif data_stream[hash1] == 1 and data_stream[hash2] == 1:
print("Number is in the stream.")
else:
print("Value not present in the stream.")

if __name__ == "__main__":
main()

Output:
Name: Hirday Rochani Experiment No: 8 Roll No: 2213205

Aim: Implement FM Algorithm Using Python/R.

Theory:

The Flajolet-Martin algorithm is also known as probabilistic algorithm which is mainly used to count
the number of unique elements in a stream or database. This algorithm was invented by Philippe
Flajolet and G. Nigel Martin in 1983 and since then it has been used in various applications such as,
data mining and database management.
The basic idea to which Flajolet-Martin algorithm is based on is to use a hash function to map the
elements in the given dataset to a binary string, and to make use of the length of the longest null
sequence in the binary string as an estimator for the number of unique elements to use as a value
element.
The steps for the Flajolet-Martin algorithm are:
 First step is to choose a hash function that can be used to map the elements in the database to
fixed-length binary strings. The length of the binary string can be chosen based on the
accuracy desired.
 Next step is to apply the hash function to each data item in the dataset to get its binary string
representation.
 Next step includes determining the position of the rightmost zero in each binary string.
 Next, we compute the maximum position of the rightmost zero for all binary strings.
 Now we estimate the number of distinct elements in the dataset as 2 to the power of the
maximum position of the rightmost zero which we calculated in previous step.
The accuracy of Flajolet Martin Algorithm is determined by the length of the binary strings and the
number of hash functions it uses. Generally, with increase in the length of the binary strings or using
more hash functions in algorithm can often increase the algorithm’s accuracy.
The Flajolet Martin Algorithm is especially used for big datasets that cannot be kept in memory or
analyzed with regular methods. This algorithm, by using good probabilistic techniques, can provide
a precise estimate of the number of unique elements in the data set by using less computing.

Advantages of the Flajolet-Martin Algorithm:

1. Space Efficiency:
 The algorithm requires very little memory compared to keeping track of all distinct
elements. It uses hash functions and bit patterns, allowing it to estimate the number of
distinct elements with logarithmic space complexity, i.e., O(logn), where n is the number of
distinct elements.
Name: Hirday Rochani Experiment No: 8 Roll No: 2213205

2. Streaming-Friendly:
 FM is designed for streaming data and can process each element in constant time O(1). It
doesn't require storing the data stream itself, making it suitable for scenarios where elements
are arriving at high velocity.
3. Scalability:
 The FM algorithm scales well with large data volumes because of its low memory and time
complexity. It is ideal for use in big data applications like distributed systems.
4. Simplicity:
 It is relatively simple to implement using hash functions and bit manipulation. This makes it
practical for use in systems with resource constraints.
5. Randomized but Accurate:
 Even though the algorithm is probabilistic, it provides a good approximation of the
cardinality with high accuracy, especially when multiple independent estimations (averaging
or merging techniques) are combined.

Disadvantages of the Flajolet-Martin Algorithm:

1. Approximation, Not Exact:


 The algorithm provides only an approximate count of distinct elements. While it's efficient,
the estimate may have a significant error margin, especially when used in small data sets.
However, the accuracy improves when more estimators are used.
2. Dependent on Good Hash Functions:
 The performance of the FM algorithm heavily relies on the quality of the hash function used.
If the hash function is not well-distributed, the estimation can be highly inaccurate.
3. Sensitive to Collisions:
 Collisions in the hash function can negatively affect the accuracy of the result. In situations
where collisions occur frequently, multiple independent hash functions may need to be used
to improve accuracy, increasing the algorithm's complexity.
4. Variance in Estimates:
 The basic FM algorithm has high variance in its estimates. To reduce this, the algorithm
requires combining multiple independent instances of the estimator (e.g., averaging them or
using median estimations), which may still require extra computation.
5. Not Suitable for Dynamic Range Queries:
 FM is not designed for answering range queries (queries about a subset of the stream),
making it less versatile in some scenarios compared to more advanced algorithms like
HyperLogLog.
Name: Hirday Rochani Experiment No: 8 Roll No: 2213205

Code:

def hash(x):
return (6 * x + 1) % 5

def to_three_bit_binary(num):
binary = bin(num)[2:] # Convert to binary and remove '0b' prefix
return binary.zfill(3) # Pad with leading zeros to ensure 3 bits

def count_trailing_zeros(arr):
result = []
for binary in arr:
count = 0
encountered_one = False
for j in range(len(binary) - 1, -1, -1):
if binary[j] == '0' and not encountered_one:
count += 1
elif binary[j] == '1':
encountered_one = True
if not encountered_one:
count = 0
result.append(count)
return result

def main():
size = int(input("Enter the size of the array: "))
input_array = []

print("Enter elements of the array:")


for i in range(size):
element = int(input(f"Element {i + 1}: "))
input_array.append(element)

hashed_array = [hash(x) for x in input_array]


binary_array = [to_three_bit_binary(x) for x in hashed_array]
trailing_zeros_array = count_trailing_zeros(binary_array)

max_trailing_zeros = max(trailing_zeros_array)

print(f"Maximum trailing zeros: {max_trailing_zeros}")


power_of_two = 2 ** max_trailing_zeros
print(f"The number of distinct elements: {power_of_two}")

if __name__ == "__main__":
main()
Name: Hirday Rochani Experiment No: 8 Roll No: 2213205

Output:
Name: Hirday Rochani Experiment No: 9 Roll No: 2213205

Aim: Data Visualization Using R.

Theory:

What Is Data Visualization?


Data visualization is the representation of data through use of common graphics, such as charts,
plots, infographics and even animations. These visual displays of information communicate complex
data relationships and data-driven insights in a way that is easy to understand.

Types Of Data Visualizations:


 Tables: This consists of rows and columns used to compare variables. Tables can show a
great deal of information in a structured way, but they can also overwhelm users that are
simply looking for high-level trends.
 Pie charts and stacked bar charts: These graphs are divided into sections that represent
parts of a whole. They provide a simple way to organize data and compare the size of each
component to one other.
 Line charts and area charts: These visuals show change in one or more quantities by
plotting a series of data points over time and are frequently used within predictive analytics.
Line graphs utilize lines to demonstrate these changes while area charts connect data points
with line segments, stacking variables on top of one another and using color to distinguish
between variables.
 Histograms: This graph plots a distribution of numbers using a bar chart (with no spaces
between the bars), representing the quantity of data that falls within a particular range. This
visual makes it easy for an end user to identify outliers within a given dataset.
Name: Hirday Rochani Experiment No: 9 Roll No: 2213205

 Scatter plots: These visuals are beneficial in reveling the relationship between two variables,
and they are commonly used within regression data analysis. However, these can sometimes
be confused with bubble charts, which are used to visualize three variables via the x-axis, the
y-axis, and the size of the bubble.
 Heat maps: These graphical representation displays are helpful in visualizing behavioral data
by location. This can be a location on a map, or even a webpage.
 Tree maps, which display hierarchical data as a set of nested shapes, typically rectangles.
Treemaps are great for comparing the proportions between categories via their area size.

Benefits Of Data Visualization:


Data visualization can be used in many contexts in nearly every field, like public policy, finance,
marketing, retail, education, sports, history, and more. Here are the benefits of data visualization:
 Adapt to Emerging Trends: Data visualization allows for the identification of trends as they
develop, enabling timely decision-making based on up-to-date information.
 Save Valuable Time: Presenting data visually makes it easier and faster to understand
complex information, reducing the time spent analyzing raw data.
 Detecting and Limiting Errors: Visualization helps identify inconsistencies or errors in data
more effectively, allowing for quicker resolutions.
Name: Hirday Rochani Experiment No: 9 Roll No: 2213205

 Enhanced Understanding of Operations: Graphical representation of data simplifies


operational insights, making it easier to grasp the overall performance and underlying
processes.
 Find Hidden Patterns: Data visualization tools help uncover patterns and correlations within
data sets that might not be evident in raw formats.

Data Visualization And Big Data:


Companies collect “big data” and synthesize it into information. Data visualization helps portray
significant insights—like a heat map to illustrate regions where individuals search for mental health
assistance. To synthesize all that data, visualization software can be used in conjunction with data
collecting software.

Advantages:
1. Enhanced Understanding: Data visualization helps in simplifying complex data sets,
making it easier to understand patterns, trends, and insights that might be difficult to grasp
from raw data alone.
2. Quick Insights: Visual representations such as charts, graphs, and maps allow users to
quickly grasp key information and make faster decisions based on visual summaries.
Name: Hirday Rochani Experiment No: 9 Roll No: 2213205

3. Effective Communication: Visualizations can convey information more effectively than


text or tables, especially when presenting to non-technical audiences. They can illustrate
relationships and trends clearly.
4. Better Data Exploration: Interactive visualizations allow users to explore data
dynamically, filter information, and drill down into specific aspects, which can lead to
deeper insights and better decision-making.
5. Highlighting Outliers and Patterns: Visualizations can easily highlight anomalies,
outliers, and patterns that might not be obvious in raw data, helping in identifying important
trends or issues.
6. Storytelling: Well-designed visualizations can tell a story and guide the viewer through a
narrative, making data more engaging and memorable.

Disadvantages:
1. Misleading Representations: Poorly designed visualizations can mislead or confuse
viewers. For example, improper scaling or misleading axis labels can distort the data’s true
meaning.
2. Information Overload: Overly complex or cluttered visualizations can overwhelm viewers
with too much information, making it hard to discern key points or insights.
3. Dependency on Design Skills: Effective data visualization requires a good understanding
of design principles. Without proper design, visualizations may fail to communicate the
intended message or may be aesthetically unappealing.
4. Loss of Detail: Simplifying data for visualization might lead to the loss of nuance and
detailed information, which could be important for comprehensive analysis.
5. Accessibility Issues: Not all visualizations are accessible to individuals with visual
impairments or other disabilities. Ensuring that visualizations are inclusive can be a
challenge.
6. Technical Requirements: Creating high-quality visualizations often requires specialized
tools and software, and a certain level of technical expertise, which might not be available to
all users
Name: Hirday Rochani Experiment No: 9 Roll No: 2213205

Code:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Sample wine dataset


data = {
'Wine': ['Wine A', 'Wine B', 'Wine C', 'Wine D', 'Wine E', 'Wine F', 'Wine
G', 'Wine H', 'Wine I', 'Wine J', 'Wine K', 'Wine L'],
'Alcohol': [13.5, 12.7, 13.2, 14.1, 12.5, 13.8, 14.2, 12.9, 13.3, 13.0, 14.0,
12.6],
'Malic_Acid': [1.8, 2.1, 2.5, 1.9, 2.0, 2.3, 2.1, 2.2, 1.7, 2.4, 2.3, 2.0],
'Hue': [1.05, 1.15, 1.10, 1.02, 1.20, 1.12, 1.08, 1.16, 1.05, 1.10, 1.07,
1.13],
'Class': ['Red', 'White', 'Red', 'White', 'Red', 'White', 'Red', 'White',
'Red', 'White', 'Red', 'White']
}

# Create DataFrame
wine_data = pd.DataFrame(data)

# Save to CSV
wine_data.to_csv('wine_data.csv', index=False)

# Load the dataset


wine_data = pd.read_csv('wine_data.csv')

# Boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(x='Class', y='Alcohol', data=wine_data)
plt.title('Boxplot of Alcohol Content by Wine Class')
plt.xlabel('Wine Class')
plt.ylabel('Alcohol Content')
plt.savefig('boxplot.png')
plt.show()

# Histogram
plt.figure(figsize=(10, 6))
sns.histplot(wine_data['Alcohol'], bins=10, color='blue', edgecolor='black')
plt.title('Histogram of Alcohol Content')
plt.xlabel('Alcohol Content')
plt.ylabel('Frequency')
plt.savefig('histogram.png')
plt.show()

# Scatter Plot with Line


plt.figure(figsize=(10, 6))
sns.scatterplot(x='Alcohol', y='Malic_Acid', hue='Class', data=wine_data)
sns.lineplot(x='Alcohol', y='Malic_Acid', hue='Class', data=wine_data,
estimator=None, markers=False)
plt.title('Scatter Plot with Trend Line for Alcohol vs Malic Acid')
Name: Hirday Rochani Experiment No: 9 Roll No: 2213205

plt.xlabel('Alcohol Content')
plt.ylabel('Malic Acid')
plt.savefig('scatter_line_plot.png')
plt.show()

# Line Plot
# Using a combination of 'Alcohol' and 'Malic_Acid' for the line plot
plt.figure(figsize=(10, 6))
sns.lineplot(x='Alcohol', y='Malic_Acid', hue='Class', data=wine_data,
marker='o')
plt.title('Line Plot of Malic Acid vs Alcohol Content by Wine Class')
plt.xlabel('Alcohol Content')
plt.ylabel('Malic Acid')
plt.savefig('line_plot.png')
plt.show()

df = pd.DataFrame({
'time': range(1, 11),
'value': [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
})
plt.figure(figsize=(10, 6))
plt.plot(df['time'], df['value'], marker='o', color='purple')
plt.title('Line Chart Example')
plt.xlabel('Time')
plt.ylabel('Value')
plt.savefig('line_chart.png')
plt.show()

Outputs:
Name: Hirday Rochani Experiment No: 9 Roll No: 2213205
Name: Hirday Rochani Experiment No: 9 Roll No: 2213205
group-46-bda

October 17, 2024

[ ]: !pip install pyspark

Collecting pyspark
Downloading pyspark-3.5.3.tar.gz (317.3 MB)
���������������������������������������� 317.3/317.3
MB 1.5 MB/s eta 0:00:00
Preparing metadata (setup.py) … done
Requirement already satisfied: py4j==0.10.9.7 in /usr/local/lib/python3.10/dist-
packages (from pyspark) (0.10.9.7)
Building wheels for collected packages: pyspark
Building wheel for pyspark (setup.py) … done
Created wheel for pyspark: filename=pyspark-3.5.3-py2.py3-none-any.whl
size=317840625
sha256=2c7af37129d8a15a6d7ceadb2f983ddd365df68fd2264788beb038d700727db9
Stored in directory: /root/.cache/pip/wheels/1b/3a/92/28b93e2fbfdbb07509ca4d6f
50c5e407f48dce4ddbda69a4ab
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.3

[ ]: #Import necessary libraries


from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import RegressionEvaluator,␣
↪MulticlassClassificationEvaluator

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder


from pyspark.ml.classification import NaiveBayes

[ ]: # Start Spark Session


spark = SparkSession.builder.appName("BDA_Project").getOrCreate()

# Cell 2: Load the dataset


file_path = "/content/mumbai-monthly-rains.csv" # Adjust the file path as␣
↪needed

1
data = spark.read.option("header", True).csv(file_path)

# Show the structure of the dataset


data.printSchema()
data.show(5)

root
|-- Year: string (nullable = true)
|-- Jan: string (nullable = true)
|-- Feb: string (nullable = true)
|-- Mar: string (nullable = true)
|-- April: string (nullable = true)
|-- May: string (nullable = true)
|-- June: string (nullable = true)
|-- July: string (nullable = true)
|-- Aug: string (nullable = true)
|-- Sept: string (nullable = true)
|-- Oct: string (nullable = true)
|-- Nov: string (nullable = true)
|-- Dec: string (nullable = true)
|-- Total: string (nullable = true)

+----+-----------+-----------+-----------+-----------+-----------+-----------+--
---------+-----------+-----------+-----------+-----------+-----------+----------
-+
|Year| Jan| Feb| Mar| April| May| June|
July| Aug| Sept| Oct| Nov| Dec| Total|
+----+-----------+-----------+-----------+-----------+-----------+-----------+--
---------+-----------+-----------+-----------+-----------+-----------+----------
-+
|1901|13.11660194| 0| 0|3.949669123|17.13979103|640.7140364|88
8.3696921|545.0457959|64.27151334|9.871696144| 0|
0|2182.478796|
|1902| 0| 0| 0| 0|0.355000585|247.9987823|40
8.4337298|566.5958631|688.9134546|28.65409204|0.488864213|19.52654728|1960.96633
4|
|1903| 0| 0|0.844034374|
0|220.5687404|370.8490478|902.4478963|602.4208281| 264.589816|157.8928768|
0| 0| 2519.61324|
|1904| 0| 0|11.38176918| 0| 0|
723.081969|390.8867992|191.5819273|85.70475449|38.67994848| 0|
0|1441.317168|
|1905|0.662560582|1.713451862| 0| 0|
0|123.8708922|581.8279747|167.3821495|172.2977226|7.365923628|24.90357515|
0| 1080.02425|
+----+-----------+-----------+-----------+-----------+-----------+-----------+--
---------+-----------+-----------+-----------+-----------+-----------+----------

2
-+
only showing top 5 rows

[ ]: # Cell 3: Data Preprocessing


# Convert columns to numeric type
for col_name in data.columns[1:]:
data = data.withColumn(col_name, col(col_name).cast("double"))

# Fill missing values (if any) with the column average


from pyspark.sql.functions import mean

for col_name in data.columns[1:]:


mean_value = data.select(mean(col_name)).collect()[0][0]
data = data.na.fill(mean_value, subset=[col_name])

[ ]: # Cell 4: Feature Engineering


# Create a features vector for MLlib
feature_cols = data.columns[1:-1] # Exclude 'Year' and 'Total'
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
data = assembler.transform(data)

[ ]: # Cell 5: Train-Test Split


train_data, test_data = data.randomSplit([0.8, 0.2], seed=1234)

[ ]: # Cell 6: Linear Regression Model (Total rainfall prediction)


lr = LinearRegression(featuresCol="features", labelCol="Total")
lr_model = lr.fit(train_data)
predictions = lr_model.transform(test_data)

[ ]: # Evaluate the model


evaluator = RegressionEvaluator(labelCol="Total", predictionCol="prediction",␣
↪metricName="rmse")

rmse = evaluator.evaluate(predictions)
print(f"RMSE for Linear Regression: {rmse}")

RMSE for Linear Regression: 3.0926395796319347e-07

[ ]: # Cell 7: Random Forest Classifier (Categorical prediction, e.g. Yearly rain␣


↪category)

# Create a new label for classification (example: High or Low rainfall)


threshold = data.select(mean('Total')).collect()[0][0]
data = data.withColumn("RainfallCategory", when(col("Total") > threshold, 1).
↪otherwise(0))

# Split the data


train_data, test_data = data.randomSplit([0.8, 0.2], seed=1234)

3
# Random Forest
rf = RandomForestClassifier(featuresCol="features", labelCol="RainfallCategory")
rf_model = rf.fit(train_data)
rf_predictions = rf_model.transform(test_data)

# Evaluate the classifier


evaluator = MulticlassClassificationEvaluator(labelCol="RainfallCategory",␣
↪predictionCol="prediction", metricName="accuracy")

accuracy = evaluator.evaluate(rf_predictions)
print(f"Accuracy for Random Forest: {accuracy}")

Accuracy for Random Forest: 0.8387096774193549

[ ]: # Cell 8: Cross Validation and Model Tuning for Random Forest


paramGrid = ParamGridBuilder() \
.addGrid(rf.numTrees, [5, 10, 20]) \
.build()

crossval = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid,␣


↪evaluator=evaluator, numFolds=5)

cv_model = crossval.fit(train_data)

# Evaluate cross-validated model


cv_predictions = cv_model.transform(test_data)
cv_accuracy = evaluator.evaluate(cv_predictions)
print(f"Cross-validated accuracy: {cv_accuracy}")

Cross-validated accuracy: 0.8387096774193549

[ ]: nb = NaiveBayes(featuresCol="features", labelCol="RainfallCategory",␣
↪modelType="multinomial")

nb_model = nb.fit(train_data)
nb_predictions = nb_model.transform(test_data)

# Evaluate Naive Bayes


nb_evaluator = MulticlassClassificationEvaluator(labelCol="RainfallCategory",␣
↪predictionCol="prediction", metricName="accuracy")

nb_accuracy = nb_evaluator.evaluate(nb_predictions)
print(f"Accuracy for Naive Bayes: {nb_accuracy}")

# Optional: Cross Validation for Naive Bayes


paramGrid_nb = ParamGridBuilder() \
.addGrid(nb.smoothing, [0.5, 1.0, 1.5]) \
.build()

4
crossval_nb = CrossValidator(estimator=nb, estimatorParamMaps=paramGrid_nb,␣
↪evaluator=nb_evaluator, numFolds=5)

cv_nb_model = crossval_nb.fit(train_data)

# Evaluate cross-validated model


cv_nb_predictions = cv_nb_model.transform(test_data)
cv_nb_accuracy = nb_evaluator.evaluate(cv_nb_predictions)
print(f"Cross-validated accuracy for Naive Bayes: {cv_nb_accuracy}")

Accuracy for Naive Bayes: 0.5483870967741935


Cross-validated accuracy for Naive Bayes: 0.5483870967741935

[ ]: import matplotlib.pyplot as plt

# Metrics for each algorithm


algorithms = ['Linear Regression', 'Random Forest', 'Naive Bayes']
rmse_values = [rmse, None, None] # Only Linear Regression has RMSE
accuracy_values = [None, accuracy, nb_accuracy]
cv_accuracy_values = [None, cv_accuracy, cv_nb_accuracy]

# Plotting the RMSE for Linear Regression (ignoring None values)


rmse_algorithms = [algorithms[i] for i in range(len(rmse_values)) if␣
↪rmse_values[i] is not None]

rmse_values_filtered = [value for value in rmse_values if value is not None]

plt.figure(figsize=(10, 6))
plt.bar(rmse_algorithms, rmse_values_filtered, color='blue', label='RMSE')
plt.xlabel('Algorithms')
plt.ylabel('RMSE')
plt.title('RMSE Comparison for Linear Regression')
plt.legend()
plt.show()

# Plotting accuracy for Random Forest and Naive Bayes (ignoring None values)
accuracy_algorithms = [algorithms[i] for i in range(len(accuracy_values)) if␣
↪accuracy_values[i] is not None]

accuracy_values_filtered = [value for value in accuracy_values if value is not␣


↪None]

plt.figure(figsize=(10, 6))
plt.bar(accuracy_algorithms, accuracy_values_filtered, color='green',␣
↪label='Accuracy')

plt.xlabel('Algorithms')
plt.ylabel('Accuracy')
plt.title('Accuracy Comparison for Random Forest and Naive Bayes')
plt.legend()
plt.show()

5
# Plotting cross-validated accuracy for Random Forest and Naive Bayes (ignoring␣
↪None values)

cv_accuracy_algorithms = [algorithms[i] for i in range(len(cv_accuracy_values))␣


↪if cv_accuracy_values[i] is not None]

cv_accuracy_values_filtered = [value for value in cv_accuracy_values if value␣


↪is not None]

plt.figure(figsize=(10, 6))
plt.bar(cv_accuracy_algorithms, cv_accuracy_values_filtered, color='red',␣
↪label='Cross-Validated Accuracy')

plt.xlabel('Algorithms')
plt.ylabel('Cross-Validated Accuracy')
plt.title('Cross-Validated Accuracy Comparison for Random Forest and Naive␣
↪Bayes')

plt.legend()
plt.show()

6
7
Thadomal Shahani Engineering College, Bandra, Mumbai
Department of Computer Engineering
Big Data Analytics ( Mini Project) SEM VII

Subject: Big Data Analytics (CSC702)

AY: 2024-25

Experiment 10

(Mini Project)

Aim: Design the infrastructure of a Big Data Application.

Tasks to be completed by the students:


Task 1: Choose a problem definition which requires handling Big Data.
Task 2: Design the data pipeline for your application.
Task 3: Deploy your project on suitable platform.
Task 4: Test your application with different volume, variety and velocity of data.

1
Thadomal Shahani Engineering College, Bandra, Mumbai
Department of Computer Engineering
Big Data Analytics ( Mini Project) SEM VII

Report on Mini Project

Subject: Big Data Analytics (CSC702)

AY: 2024-25

Rainfall Analytics and Prediction


using PySpark
Dhruv Aswani 2213194

Bhavisha Hemwani 2213203

Nipoon Dembani 2213204

Hirday Rochani 2213205

Guided By

Dr. Anagha Durugkar

2
Thadomal Shahani Engineering College, Bandra, Mumbai
Department of Computer Engineering
Big Data Analytics ( Mini Project) SEM VII

CHAPTER 1: INTRODUCTION

In today’s data-driven world, analyzing and predicting rainfall patterns is essential for
effective resource management and planning. To achieve this, our project, "Rainfall Analytics
and Prediction using PySpark," leverages the power of Apache Spark, a robust and scalable
data processing platform.
Spark plays a central role in our system by enabling efficient handling of large datasets. It
allows us to process extensive historical rainfall data with ease, performing complex
transformations and analyses in parallel. With Spark's distributed computing capabilities, we
can handle these data operations at scale, making predictions faster and more accurate.
To enhance the analytical power of our system, we utilize Spark's machine learning libraries
to develop predictive models. These models can analyze past trends in rainfall and generate
accurate predictions for future patterns. By integrating these predictive models with real-time
data, we ensure that our rainfall forecasting remains both relevant and reliable.
This approach not only improves the accuracy of our rainfall predictions but also enhances
decision-making processes for industries dependent on weather patterns, leading to better
preparedness and resource allocation.

3
Thadomal Shahani Engineering College, Bandra, Mumbai
Department of Computer Engineering
Big Data Analytics ( Mini Project) SEM VII

CHAPTER 2: DATA DESCRIPTION AND ANALYSIS

2.1 Data Description:


The Rainfall Analytics and Prediction system using PySpark relies on historical rainfall data to analyze
and predict patterns. The dataset consists of monthly rainfall measurements recorded over a period of
years. The columns in this file include:

1. Year: The year of the recorded data.

2. Monthly Data (Jan, Feb, Mar, etc.): Rainfall data for each month.

3. Total: The sum of rainfall for the entire year.

This dataset forms the basis for analyzing trends and seasonal variations in rainfall. By leveraging
PySpark's distributed computing capabilities, the system efficiently processes and analyzes this data to
derive insights, helping to make accurate predictions about future rainfall patterns.

2.2 Data Collection:


The rainfall data is collected from various sources, including weather APIs, government
meteorological services, and publicly available datasets. The system processes and stores the
collected data in real-time or near real-time to facilitate timely analysis and predictions. The
key components of data collection in this system are:

• Data Sources:

o Weather APIs: Data is fetched from APIs that provide real-time weather updates and
historical rainfall data.

o Meteorological Services: Government and regional meteorological departments provide


reliable data on rainfall patterns and forecasts.

o Public Datasets: Open-source datasets that contain historical rainfall data can be used for
analysis and model training.

• Data Processing:

o The collected data is processed using Apache Spark, which allows for distributed data
4
Thadomal Shahani Engineering College, Bandra, Mumbai
Department of Computer Engineering
Big Data Analytics ( Mini Project) SEM VII
processing. This enables the system to handle large volumes of data efficiently.

o Data cleaning and transformation steps are applied to ensure that the dataset is ready for
analysis. This includes handling missing values, normalizing data formats, and
aggregating data as necessary.

2.3 Data Processing:


The data processing pipeline utilizes Apache Spark to handle rainfall data from CSV files.
Key steps include:

1. Data Ingestion: Load rainfall data into Spark DataFrames for efficient processing.

2. Data Cleaning: Handle missing values and remove outliers to ensure data quality.

3. Data Transformation: Perform feature engineering, extracting relevant features, and


normalizing data for model training.

4. Exploratory Data Analysis (EDA): Visualize relationships and trends in the data to
guide model selection.

5. Model Training: Split data into training and testing sets, applying machine learning
algorithms to build and optimize rainfall prediction models.

6. Real-Time Predictions: (If applicable) Update models with new data periodically for
ongoing accuracy.

Using Spark’s distributed computing power, the system processes large datasets efficiently,
enabling timely rainfall predictions.

2.4 Data Analysis:


The Rainfall Analytics and Prediction system analyzes historical rainfall data to derive
insights and predictions. Key metrics include:

• Annual Rainfall Trends: Examining total rainfall from 1901 to 2021 to identify long-
term fluctuations.

• Seasonal Variability: Analyzing monthly data to determine peak rainfall months (e.g.,
June and July) and dry periods for better planning.

5
Thadomal Shahani Engineering College, Bandra, Mumbai
Department of Computer Engineering
Big Data Analytics ( Mini Project) SEM VII
• Monthly Averages: Calculating average monthly rainfall to establish typical patterns
and highlight significant rainfall months.

• Outliers: Identifying years with extreme rainfall to inform predictive modeling.

• Correlation Analysis: Exploring relationships between months to improve prediction


accuracy.

This analysis enhances rainfall predictions, aiding in resource management and preparedness.

2.5 Data Validation:


Before processing, the input rainfall data undergoes validation to ensure its integrity. The
validation process includes:

• Missing or Corrupt Data: Checking for entries with incomplete or invalid values.

• Duplicates: Identifying and removing duplicate records to maintain data accuracy.

• Consistency: Ensuring uniform formatting across the dataset (e.g., date formats).

By validating the data, we guarantee that our analyses and predictions are based on accurate
and reliable information. This foundational step is crucial for generating meaningful insights
and enhancing the overall effectiveness of the Rainfall Analytics and Prediction system.

6
Thadomal Shahani Engineering College, Bandra, Mumbai
Department of Computer Engineering
Big Data Analytics ( Mini Project) SEM VII

CHAPTER 3: DESIGN OF DATA PIPELINE

Below is the breakdown of the pipeline shown in the diagram:


1. Load Dataset: The pipeline starts by loading the Mumbai monthly rainfall data from a
CSV file.
2. Data Preprocessing: This stage includes converting columns to numeric type and filling
missing values with column averages.
3. Feature Engineering: Here, a feature vector is created for use in MLlib, using
VectorAssembler.
4. Train-Test Split: The data is split into training (80%) and testing (20%) sets.
5. Model Training and Evaluation: The pipeline then branches into three parallel
processes:
o Linear Regression: For predicting total rainfall
o Random Forest Classification: For classifying yearly rainfall into categories
o Naive Bayes Classification: Another method for rainfall category classification
6. Cross-Validation: Both Random Forest and Naive Bayes models undergo cross-
validation for hyperparameter tuning.
7. Visualization: The final stage where results from all models are visualized for
comparison.
This pipeline design allows for parallel processing of different models and techniques,
providing a comprehensive analysis of the rainfall data. It also incorporates both regression (for
prediction) and classification tasks, along with model tuning through cross-validation.

7
Thadomal Shahani Engineering College, Bandra, Mumbai
Department of Computer Engineering
Big Data Analytics ( Mini Project) SEM VII

CHAPTER 4: RESULT ANALYSIS

8
Thadomal Shahani Engineering College, Bandra, Mumbai
Department of Computer Engineering
Big Data Analytics ( Mini Project) SEM VII

9
Thadomal Shahani Engineering College, Bandra, Mumbai
Department of Computer Engineering
Big Data Analytics ( Mini Project) SEM VII

10
Thadomal Shahani Engineering College, Bandra, Mumbai
Department of Computer Engineering
Big Data Analytics ( Mini Project) SEM VII

11
Thadomal Shahani Engineering College, Bandra, Mumbai
Department of Computer Engineering
Big Data Analytics ( Mini Project) SEM VII

CHAPTER 5: CONCLUSION AND FUTURE SCOPE

Conclusion:
The Rainfall Analytics and Prediction system developed using PySpark effectively showcases
how data processing can enhance understanding and forecasting of rainfall patterns. By
leveraging PySpark for distributed data processing, we have established a scalable solution
capable of handling large datasets.

The system is built on principles of data analysis and predictive modeling, allowing for
accurate forecasts based on historical rainfall data. This approach ensures that stakeholders
receive timely insights, which can lead to informed decision-making and improved planning.

Key takeaways from the project:

• Accurate Forecasting: The system utilizes historical data to predict future rainfall,
providing reliable insights for planning and resource allocation.

• Scalability: PySpark’s distributed computing capabilities enable the analysis of large


datasets, ensuring that the system can grow with increasing data volumes.

• Efficient Data Processing: The streamlined workflow from data ingestion to analysis
minimizes processing time, which is crucial for timely predictions.

In summary, the integration of PySpark has provided an effective solution to the challenges of
rainfall analytics and prediction. This solution is adaptable and can be employed across
various domains to enhance the understanding and management of rainfall-related phenomena.

12
Thadomal Shahani Engineering College, Bandra, Mumbai
Department of Computer Engineering
Big Data Analytics ( Mini Project) SEM VII

Future Scope:
While the current implementation successfully provides rainfall analytics and predictions,
several areas for enhancement and expansion could further improve system performance and
user experience:

1. Integration of Advanced Machine Learning Models:

o Future versions of the system could incorporate advanced machine learning


models, such as deep learning-based time series forecasting or ensemble
methods, to enhance prediction accuracy.

o These models could leverage additional data points, such as satellite imagery,
meteorological data, and historical climate patterns, to provide more robust
forecasts.

2. Hybrid Forecasting Approaches:

o Implementing a hybrid approach that combines statistical methods with machine


learning techniques could yield better results. This method could integrate
traditional forecasting models with machine learning to account for both linear
and nonlinear trends in rainfall data.

3. Enhanced Data Visualization and Reporting:

o Integrating visualization tools such as Tableau or Matplotlib for data


presentation could provide stakeholders with clearer insights into rainfall
patterns and trends.

o Dynamic dashboards could facilitate real-time monitoring and allow users to


explore data interactively.

4. User Segmentation and Targeted Alerts:

o Future iterations could include user segmentation based on location, historical


rainfall data, and agricultural practices, allowing for tailored alerts and
recommendations for irrigation or crop management based on predicted rainfall.

13
Thadomal Shahani Engineering College, Bandra, Mumbai
Department of Computer Engineering
Big Data Analytics ( Mini Project) SEM VII
5. Support for Multilingual and Regional Adaptations:

o As the system is utilized across different regions, supporting multiple languages


and adapting to regional climatic conditions will enhance its usability and
relevance.

6. Integration with Other Big Data Technologies:

o Exploring integration with big data technologies like Apache Flink or Google
BigQuery could provide more flexible data processing options, especially for
real-time analytics and batch processing.

7. A/B Testing for Continuous Improvement:

o Incorporating A/B testing frameworks could help optimize prediction models by


continuously evaluating the effectiveness of different forecasting methods based
on accuracy metrics.

8. Real-Time Feedback Mechanism:

o Establishing a real-time feedback loop that incorporates user feedback and actual
rainfall observations would enhance the model's adaptability, ensuring more
accurate predictions over time.

In conclusion, this project serves as a solid foundation for building scalable rainfall analytics
and prediction systems. As technology evolves and data becomes more complex, the
enhancements outlined above will ensure that the system remains efficient, responsive, and
competitive in the field of climate data analytics. With these potential developments, the
system can transition from a basic predictive tool to an advanced platform capable of
delivering high-quality insights and fostering proactive decision-making for stakeholders.

14

You might also like