Exp1 Hirday Merged
Exp1 Hirday Merged
Theory:
What Is Hadoop?
Hadoop is an open-source framework that allows for the distributed processing of large datasets
across clusters of computers using simple programming models. It is designed to scale up from a
single server to thousands of machines, each offering local computation and storage.
Key Components of Hadoop:
   1. Hadoop Distributed File System (HDFS):
            HDFS is the storage system used by Hadoop. It is designed to store very large files
               across multiple machines. It provides high throughput access to application data and
               is designed to be fault-tolerant by replicating data across multiple nodes.
   2. MapReduce:
            MapReduce is the programming model used by Hadoop to process large datasets. It
               breaks down a task into smaller sub-tasks (Map), processes them in parallel, and then
               aggregates the results (Reduce).
   3. YARN (Yet Another Resource Negotiator):
            YARN is the resource management layer of Hadoop. It handles the allocation of
               resources in the cluster, ensuring that different tasks have the necessary
               computational power to execute.
   4. Hadoop Common:
            These are the common utilities and libraries that support the other Hadoop modules.
               It provides essential services and functions needed by the other Hadoop modules
Advantages of Hadoop:
Disadvantages of Hadoop:
   1. Problem with Small Files: Hadoop struggles with large numbers of small files, as it is
      optimized for handling large files split into sizable blocks.
   2. Vulnerability: Being written in Java, Hadoop is more susceptible to security vulnerabilities,
      potentially exposing it to cyber threats.
   3. Low Performance with Small Data: Hadoop is designed for large datasets, and its
      efficiency drops when processing small amounts of data.
   4. Lack of Security: Hadoop’s security features, like Kerberos, are complex to manage and
      lack robust encryption, making data security a concern.
   5. High Processing Overhead: Hadoop’s read/write operations are disk-based, leading to
      processing overhead and inefficiency in handling in-memory calculations.
   6. Supports Only Batch Processing: Hadoop is designed for batch processing, with limited
      support for real-time or low-latency processing tasks.
 Name: Hirday Rochani                   Experiment No: 1                  Roll No: 2213205
Step1:
Step2:
After downloading Cloudera, unzip it using a zip extractor and extract the files. Upon completion,
open the virtual box software and select the Import option.
After selecting import, include the path of the previously downloaded Cloudera software.
 Name: Hirday Rochani                    Experiment No: 1                  Roll No: 2213205
Step3:
In the appliance settings, change the CPU section value from ‘1’ to ‘4’.
 Name: Hirday Rochani                  Experiment No: 1                Roll No: 2213205
Step4:
Proceed further if your VirtualBox homepage looks like this
Now click on the cloudera-quickstart-vm file which was initially showing powered off.Once you
click on it, change the display settings and keep the video memory value between 0-40MB.
 Name: Hirday Rochani                   Experiment No: 1                   Roll No: 2213205
Step5:
 Now click on the Start button and wait for a few minutes. Initially, your window will look like
 this.
 Name: Hirday Rochani                    Experiment No: 1   Roll No: 2213205
HDFS COMMANDS:
3. Report the amount of space used and available on currently mounted filesystem.
hadoop fs -df hdfs:/
4. Count the number of directories,files and bytes under the paths that match the specified file
pattern
hadoop fs -count hdfs:/
7. Create a new directory named "C32" below the /user/training directory in HDFS. Since you're
currently logged in with the "training" user ID, /user/training is your home directory in HDFS.
hadoop fs -mkdir /user/training/C32
8. Add a sample text file from the local directory named "data" to the new directory you created in
HDFS during the previous step.
hadoop fs -put data/sample.txt /user/training/C32
10. Add the entire local directory called "retail" to the /user/training/C32 directory in HDFS.
hadoop fs -put data/retail /user/training/C32
11. Since /user/training is your home directory in HDFS, any command that does not have an
absolute path is interpreted as relative to that directory. The next command will therefore list your
home directory, and should show the items you've just added there.
15. Delete all files from the "retail" directory using a wildcard.
hadoop fs -rmr hadoop/retail/*
17. Finally, remove the entire retail directory and all of its contents in HDFS.
hadoop fs -rmr hadoop/retail
19. Add the purchases.txt file from the local directory named "/home/training/" to the hadoop
directory you created in HDFS
hadoop fs -copyFromLocal /home/training/purchases.txt hadoop/
 Name: Hirday Rochani                      Experiment No: 1                Roll No: 2213205
20. To view the contents of your text file purchases.txt which is present in your hadoop directory.
21. Add the purchases.txt file from "hadoop" directory which is present in HDFS directory
to the directory "data" which is present in your local directory
hadoop fs -copyToLocal hadoop/purchases.txt /home/training/data
25. Default replication factor to a file is 3. Use '-setrep' command to change replication factor of a
file.
hadoop fs -setrep -w 2 apache_hadoop/sample.txt
hadoop fs -help
Name: Hirday Rochani   Experiment No: 1   Roll No: 2213205
 Name: Hirday Rochani                    Experiment No: 2                  Roll No: 2213205
Aim: Use Of Sqoop Tool To Transfer Data Between Hadoop And Relational Database Servers.
Theory:
 Sqoop Features:
 Sqoop has several features, which makes it helpful in the Big Data world:
   1. Parallel Import/Export
      Sqoop uses the YARN framework to import and export data. This provides fault tolerance
      on top of parallelism.
Name: Hirday Rochani                    Experiment No: 2                  Roll No: 2213205
Sqoop Architecture:
1. The client submits the import/ export command to import or export data.
2. Sqoop fetches data from different databases. Here, we have an enterprise data warehouse,
document-based systems, and a relational database. We have a connector for each of these;
connectors help to work with a range of accessible databases.
4. Similarly, numerous map tasks will export the data from HDFS on to RDBMS using the Sqoop
export command.
Sqoop Import:
The diagram below represents the Sqoop import mechanism.
  1. In this example, a company’s data is present in the RDBMS. All this metadata is sent to the
     Sqoop import. Scoop then performs an introspection of the database to gather metadata
     (primary key information).
  2. It then submits a map-only job. Sqoop divides the input dataset into splits and uses
     individual map tasks to push the splits to HDFS.
 Name: Hirday Rochani                   Experiment No: 2                 Roll No: 2213205
Sqoop Export
Sqoop Processing:
Processing takes place step by step, as shown below:
  1. Sqoop runs in the Hadoop cluster.
  2. It imports data from the RDBMS or NoSQL database to HDFS.
  3. It uses mappers to slice the incoming data into multiple formats and loads the data in HDFS.
  4. Exports data back into the RDBMS while ensuring that the schema of the data in the
     database is maintained.
Advantages of Sqoop:
Disadvantages of Sqoop:
Commands:
Step1:
Open the Cloudera Terminal and execute the following command in order to start the MySQL
server. (Note: The default password is cloudera for the root user)
mysql -u root -p
 Name: Hirday Rochani    Experiment No: 2   Roll No: 2213205
Step2:
Creating a database
create database bank1;
 Name: Hirday Rochani                   Experiment No: 2         Roll No: 2213205
Step3:
Creating a Table
(Note: The database must be in use before you create a table.)
Step4:
Insert values
 Name: Hirday Rochani                    Experiment No: 2                  Roll No: 2213205
CLOUDERA
After you exit MySQL, create a folder in the Cloudera file system to import the above MySQL table
which was created. (In the following steps,’myfirstdata’ folder is created in /home/cloudera)
Step5:
Importing the table using Sqoop
sqoop import --connect jdbc:mysql://youripaddress:3306/<database_name> --username root --
password cloudera --table <table_name> --target-dir=<target_directory> -m 1
Here,
-m specifies the number of mappers
3306 is the default port for MySQL
 Name: Hirday Rochani                     Experiment No: 2               Roll No: 2213205
Step6:
Displaying the contents in HDFS
hadoop fs -ls /home/cloudera/myfirstdata
hadoop fs -cat /home/cloudera/myfirstdata/part-m-00000
database. The table which we will be creating needs to have the same structure as the
‘register’ table which we created earlier.
Step7:
Creating the table
Step8:
Exporting data from HDFS to MySQL
Syntax:
sqoop export --connect jdbc:mysql://localhost/db --username root --table <table_name>
--export-dir <directory>
 Name: Hirday Rochani                    Experiment No: 2              Roll No: 2213205
Step9:
Verifying in MySQL
We can see that the data is exported successfully into ‘registercopy
 Name:Hirday Rochani                     Experiment No: 3                  Roll No: 2213205
Theory:
   1. Scalability: HBase scales horizontally by distributing data across many servers (also called
      RegionServers). This enables it to store and manage petabytes of data.
   2. Column-Oriented Storage: HBase stores data in a column-family format, allowing
      efficient reads and writes on column-based datasets. Column-oriented databases are
      particularly useful when you have sparse data or need to perform aggregate queries on
      specific columns.
   3. Real-Time Data Access: Unlike Hadoop's MapReduce jobs, which provide batch
      processing, HBase allows random, real-time read and write access to data, making it ideal
      for applications requiring low-latency access.
   4. NoSQL Design: HBase is schema-less, meaning you don't have to predefine table schemas
      (apart from column families). It supports flexible data models with dynamic columns.
 Name:Hirday Rochani                    Experiment No: 3                   Roll No: 2213205
   5. Automatic Sharding: HBase automatically divides tables into smaller regions and
      distributes them across different nodes in a cluster, improving performance and load
      balancing.
   6. Fault Tolerance: Built on HDFS, HBase inherits its fault tolerance, where data is replicated
      across multiple nodes to prevent data loss in case of hardware failures.
   7. Strong Consistency: HBase ensures that all operations are atomic and consistent. Data
      writes are always immediately visible to subsequent reads.
   8. Integration with Hadoop: HBase integrates seamlessly with Hadoop's ecosystem,
      including tools like Apache Hive, Apache Pig, and Apache Spark for data analysis and
      querying.
Use Cases:
   1. Time-series Data: Applications that store and process time-series data, such as monitoring
      and IoT data.
   2. Data Warehousing: HBase can handle vast amounts of semi-structured or unstructured
      data, making it ideal for data warehousing and data lakes.
   3. Real-time Analytics: HBase supports real-time querying and data processing, making it
      useful for systems that require quick response times.
Architecture of HBASE:
 Name:Hirday Rochani                      Experiment No: 3                 Roll No: 2213205
Components of HBase:
    1. RegionServers: Each RegionServer hosts multiple regions (partitions of tables) and handles
         read and write requests for the data within those regions.
    2. HMaster: The master server in the HBase architecture manages RegionServers and is
         responsible for load balancing and failover.
    3. Zookeeper: Used for distributed coordination, it helps manage the distributed environment
         and keeps track of RegionServers.
Commands:
Step1:
Open the Cloudera Terminal and execute the following command in order to start the MySQL
server. (Note: The default password is cloudera for the root user)
mysql -u root -p
 Name:Hirday Rochani                    Experiment No: 3         Roll No: 2213205
Step2:
Creating a database
create database bank1;
Step3:
Creating a Table
(Note: The database must be in use before you create a table.)
 Name:Hirday Rochani   Experiment No: 3   Roll No: 2213205
Step4:
Insert values
 Name:Hirday Rochani                     Experiment No: 3                  Roll No: 2213205
CLOUDERA
After you exit MySQL, create a folder in the Cloudera file system to import the above MySQL table
which was created. (In the following steps,’myfirstdata’ folder is created in /home/cloudera)
Step5:
Importing the table using Sqoop
sqoop import --connect jdbc:mysql://youripaddress:3306/<database_name> --username root --
password cloudera --table <table_name> --target-dir=<target_directory> -m 1
Here,
-m specifies the number of mappers
3306 is the default port for MySQL
 Name:Hirday Rochani                         Experiment No: 3             Roll No: 2213205
Step6:
Displaying the contents in HDFS
hadoop fs -ls /home/cloudera/myfirstdata
hadoop fs -cat /home/cloudera/myfirstdata/part-m-00000
Step7:
Creating the table
 Name:Hirday Rochani                   Experiment No: 3                 Roll No: 2213205
Step8:
Exporting data from HDFS to MySQL
Syntax:
sqoop export --connect jdbc:mysql://localhost/db --username root --table <table_name>
--export-dir <directory>
 Name:Hirday Rochani                     Experiment No: 3               Roll No: 2213205
Step9:
Verifying in MySQL
We can see that the data is exported successfully into ‘registercopy’
 Name: Hirday Rochani                   Experiment No: 4                 Roll No: 2213205
Theory:
   MapReduce can be used to write applications to process large amounts of data, in parallel, on
     large clusters of commodity hardware (a commodity hardware is nothing but the hardware
     which is easily available in the local market) in a reliable manner.
   MapReduce is a processing technique as well as a programming model for distributed
     computing based on java programming language or java framework.
   The components of MapReduce are:
          Mapper:
             The Map tasks accept one or more chunks from a DFS and turn them into a sequence
             of key-value pairs. How the input data is converted into key-value pairs is determined
             by the code written by the user for the Map function.
       Shuffling:
        The process of exchanging the intermediate outputs from the map tasks to where they are
        required by the reducers is known as “Shuffling".
Name: Hirday Rochani                Experiment No: 4               Roll No: 2213205
      Reduce:
       The Reduce tasks combine all of the values associated with a particular key. The
       code written by the user for the Reduce function determines how the combination is
       done. All of the values with the same key are presented to a single reducer together.
 Role of combiner:
    A combiner is a type of mediator between the mapper phase and the reducer phase.
      The use of combiners is totally optional. As a combiner sits between the mapper
      and the reducer, it accepts the output of map phase as an input and passes the key-
      value pairs to the reduce operation.
    Combiners are also known as semi-reducers as they reside before the reducer. A
      combiner is used when the Reduce function is commutative and associative. This
      means that the values can be combined in any order without affecting the final
 Name: Hirday Rochani                 Experiment No: 4                Roll No: 2213205
   Advantage of combiners
     Reduces the time taken for transferring the data from Mapper to Reducer.
(Note: Wordcount and wordCountJob are just folder names. It represents the same folder.)
Files:
1) WordCount.java
     import   java.io.IOException;
     import   org.apache.hadoop.io.IntWritable;
     import   org.apache.adoop.io.LongWritable;
     import   org.apache.hadoop.io.Text;
     import   org.apache.hadoop.mapreduce.Mapper;
     import   org.apache.hadoop.mapreduce.Mapper.Context;
2) WordMapper.java
    import   java.io.IOException;
    import   org.apache.hadoop.io.IntWritable;
    import   org.apache.adoop.io.LongWritable;
    import   org.apache.hadoop.io.Text;
    import   org.apache.hadoop.mapreduce.Mapper;
    import   org.apache.hadoop.mapreduce.Mapper.Context;
3) WordReducer.java
    import   java.io.IOException;
    import   org.apache.hadoop.io.IntWritable;
    import   org.apache.hadoop.io.Text;
    import   org.apache.hadoop.mapreduce.Reducer;
Go to file system-> usr-> lib->hadoop-> click on the selected file shown in below fig
 Name: Hirday Rochani                  Experiment No: 4                Roll No: 2213205
Go to mozilla firefox and type Hadoop core 1.2.1 jar download - > download the first one below.
 Name: Hirday Rochani                   Experiment No: 4                 Roll No: 2213205
Folder structure
Right click on Wordcount and export as JAR file. Export the jar file in the same location where all
other three java files are present.
Name: Hirday Rochani   Experiment No: 4   Roll No: 2213205
 Name: Hirday Rochani                   Experiment No: 4                 Roll No: 2213205
After all the above procedure go to the terminal to run the following commands.
 1) Open terminal
Change the path to cd /home/training/workspace/Wordcount/src
(Check your path of the Wordcount folder on your system and set the path accordingly)
Commands:
Name: Hirday Rochani   Experiment No: 4   Roll No: 2213205
 Name: Hirday Rochani                    Experiment No: 5                     Roll No: 2213205
Theory:
Pig is a high-level platform or tool which is used to process the large datasets. It provides a high-
level of abstraction for processing over the MapReduce. It provides a high-level scripting language,
known as Pig Latin which is used to develop the data analysis codes. First, to process the data
which is stored in the HDFS, the programmers will write the scripts using the Pig Latin Language.
Internally Pig Engine (a component of Apache Pig) converted all these scripts into a specific map
and reduce task. But these are not visible to the programmers in order to provide a high-level of
abstraction. Pig Latin and Pig Engine are the two main components of the Apache Pig tool. The
result of Pig always stored in the HDFS.
Pig Architecture
Pig Architecture:
1. Pig Latin Scripts
This is where the process begins. Users write scripts using the Pig Latin language to define the data
analysis tasks and transformations they want to perform. Pig Latin is designed to be a simpler
alternative to writing complex MapReduce code.
Grant Shell: This is the interactive command-line interface where users can write and execute Pig
Latin scripts directly. It’s useful for quick data exploration and script testing.
Pig Server: The Pig server acts as the backend engine that takes Pig Latin scripts, processes them,
and translates them into MapReduce jobs. It handles all the coordination between the various
components in the system.
Parser: The first step after a script is submitted. The parser checks the Pig Latin script for any
syntax errors and ensures that it follows the correct structure. It then generates a Logical Plan,
which outlines the various operations required.
Optimizer: The optimizer analyzes the logical plan and looks for opportunities to improve
efficiency. It removes unnecessary operations, rearranges steps for better performance, and
generally refines the execution plan. This results in an optimized Physical Plan that is more efficient
to execute.
Compiler: Once the physical plan is optimized, it needs to be translated into MapReduce jobs. The
compiler is responsible for turning the physical plan into a series of MapReduce jobs that Hadoop
can execute. Each step in the Pig Latin script corresponds to one or more MapReduce jobs.
Execution Engine: The execution engine is responsible for running the MapReduce jobs generated
by the compiler. It manages job execution on Hadoop, ensuring that the data is processed, results
are collected, and any errors are handled appropriately.
3. MapReduce
Description: Once the Pig components have prepared the MapReduce jobs, they are submitted to
Hadoop’s MapReduce framework for distributed processing. MapReduce is responsible for splitting
the data across nodes in the cluster, running computations in parallel, and aggregating the results.
Advantages of Pig:
  1. Ease of Use: Pig Latin scripts are much simpler to write compared to native Hadoop
     MapReduce code. It abstracts away the complex Java-based MapReduce programming and is
     more like SQL, making it easier for developers and data analysts.
  2. Less Development Time: Pig significantly reduces the time it takes to write, understand,
     and maintain the code due to its high-level abstractions over Hadoop.
  3. Flexible: Pig can handle both structured and semi-structured data (like JSON, XML, etc.),
     making it highly adaptable to different data types.
  4. Improved Productivity: Since the code is more concise and requires fewer lines, developers
     can quickly prototype, test, and debug scripts. This results in increased productivity for data
     processing tasks.
  5. Extensibility: Pig allows users to create their own user-defined functions (UDFs) in Java,
     Python, or other supported languages, making it customizable to specific needs.
  6. Optimized for Performance: Pig optimizes execution plans for Pig Latin scripts, making it
     efficient for processing large datasets.
  7. Dataflow Approach: It follows a dataflow approach where users specify a sequence of
     transformations, and Pig handles how to execute them efficiently.
Disadvantages of Pig:
   1. Learning Curve: Despite being easier than MapReduce, Pig still requires learning the Pig
      Latin scripting language, which may pose a challenge for users unfamiliar with it.
   2. Limited Debugging Tools: While Pig is easier to use than MapReduce, debugging scripts
      can still be complex due to limited debugging tools, especially with very large datasets.
   3. Less Suitable for Complex Analytics: Pig is better suited for ETL (Extract, Transform,
      Load) processes or simple data analytics. For complex machine learning and iterative
      algorithms, it’s less powerful compared to frameworks like Apache Spark.
   4. Latency: Pig runs on Hadoop, so it is bound by the limitations of Hadoop’s batch processing
      model. This can lead to higher latency for processing compared to real-time solutions.
   5. Requires Hadoop Setup: Pig requires an underlying Hadoop cluster, meaning it can't be
      used without a Hadoop environment. This dependency makes it unsuitable for smaller data
      processing tasks.
  It allows nested data types like map, tuple and       It does not allow nested data types
                        bag
Commands:
1. Run PIG:
   [training@localhost ~]$ pig
2. grunt> fs:
  3. copyFromLocal:
  grunt> copyFromLocal /home/training/sample.txt /user/training/ grunt> cat sample.txt
Name: Hirday Rochani                  Experiment No: 5                  Roll No: 2213205
4. Dump:
Name: Hirday Rochani   Experiment No: 5   Roll No: 2213205
 5. Projection:
Name: Hirday Rochani                  Experiment No: 5                  Roll No: 2213205
 6. Joins–
    grunt> dept = load '/user/training/department.txt' USING PigStorage(',') as
    (id:int,name:chararray,deptname:chararray,sal:int); grunt> dump dept;
Name: Hirday Rochani   Experiment No: 5   Roll No: 2213205
 Name: Hirday Rochani                    Experiment No: 5                  Roll No: 2213205
7. Relational Operators
     a. Cross: The cross operator is used to calculate the cross product of two or more relations.
     grunt> dump student;
Name: Hirday Rochani   Experiment No: 5   Roll No: 2213205
   grunt>dump dept;
Name: Hirday Rochani   Experiment No: 5   Roll No: 2213205
Name: Hirday Rochani                Experiment No: 5   Roll No: 2213205
b) ForEach: This operator is used to generate data transformation based on column data.
   grunt>X = Foreach dept GENERATE id,name;
   grunt> dump X;
 Name: Hirday Rochani                     Experiment No: 6                  Roll No: 2213205
Theory:
What is Apache Hive?
Apache Hive is a data warehouse and an ETL tool which provides an SQL-like interface between
the user and the Hadoop distributed file system (HDFS) which integrates Hadoop. It is built on top
of Hadoop. It is a software project that provides data query and analysis. It facilitates reading,
writing and handling wide datasets that stored in distributed storage and queried by Structure
Query Language (SQL) syntax. It is not built for Online Transactional Processing (OLTP)
workloads. It is frequently used for data warehousing tasks like data encapsulation, Ad-hoc
Queries, and analysis of huge datasets. It is designed to enhance scalability, extensibility,
performance, fault-tolerance and loose-coupling with its input formats.
Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a
massive scale. A data warehouse provides a central store of information that can easily be analyzed
to make informed, data driven decisions. Hive allows users to read, write, and manage petabytes of
data using SQL.
Hive is built on top of Apache Hadoop, which is an open-source framework used to efficiently
store and process large datasets. As a result, Hive is closely integrated with Hadoop, and is
designed to work quickly on petabytes of data.
Modes of Hive:
Local Mode –
It is used, when the Hadoop is built under pseudo mode which has only one data node, when the
data size is smaller in term of restricted to single local machine, and when processing will be faster
on smaller datasets existing in the local machine.
The major components of Hive and its interaction with the Hadoop is demonstrated in the figure
below and all the components are described further:
    User Interface (UI) –
      As the name describes User interface provide an interface between user and hive. It enables
      user to submit queries and other operations to the system. Hive web UI, Hive command line,
      and Hive HD Insight (In windows server) are supported by the user interface.
Name: Hirday Rochani                     Experiment No: 6                  Roll No: 2213205
    Hive Server – It is referred to as Apache Thrift Server. It accepts the request from different
      clients and provides it to Hive Driver.
    Driver –
      Queries of the user after the interface are received by the driver within the Hive. Concept of
      session handles is implemented by driver. Execution and Fetching of APIs modelled on
      JDBC/ODBC interfaces is provided by the user.
    Compiler –
      Queries are parses, semantic analysis on the different query blocks and query expression is
      done by the compiler. Execution plan with the help of the table in the database and partition
      metadata observed from the metastore are generated by the compiler eventually.
    Metastore –
      All the structured data or information of the different tables and partition in the warehouse
      containing attributes and attributes level information are stored in the metastore. Sequences
      or de-sequences necessary to read and write data and the corresponding HDFS files where
      the data is stored. Hive selects corresponding database servers to stock the schema or
      Metadata of databases, tables, attributes in a table, data types of databases, and HDFS
      mapping.
    Execution Engine –
      Execution of the execution plan made by the compiler is performed in the execution engine.
      The plan is a DAG of stages. The dependencies within the various stages of the plan is
      managed by execution engine as well as it executes these stages on the suitable system
      components.
 Name: Hirday Rochani                    Experiment No: 6                  Roll No: 2213205
                                           Features Of Hive
 Name: Hirday Rochani                   Experiment No: 6                 Roll No: 2213205
Limitations Of Hive
                            Hive                                             Pig
 1)   Hive is commonly used by Data Analysts.    1)   Pig is commonly used by programmers.
 2)   It follows SQL-like queries.               2)   It follows the data-flow language.
 3)   It can handle structured data.             3)   It can handle semi-structured data.
 4)   It works on server-side of HDFS cluster.   4)   It works on client-side of HDFS cluster.
 5)   Hive is slower than Pig.                   5)   Pig is comparatively faster than Hive.
Commands:
5. To create database
Command: create database retail;
11.Selecting data
hive> select * from emp_sal where id=1;
 Name: Hirday Rochani                     Experiment No: 6            Roll No: 2213205
13. Try using aggregate commands using HQL(Try creating tables with group by fields and execute
the aggregate commands)
hive > select AVG(sal) as avg_salary from emp_sal;
 Name: Hirday Rochani                 Experiment No: 6   Roll No: 2213205
Theory:
Advantages:
     1. Space-Efficient: It uses much less memory than storing the actual elements.
     2. Time-Efficient: Checking if an element is in the set is very fast (constant time).
     3. No False Negatives: It guarantees that if the Bloom filter says an element is not in the set, it
           definitely isn't.
 Name: Hirday Rochani                     Experiment No: 7                  Roll No: 2213205
Disadvantages:
   1. False Positives: There is a small probability that the filter will say an element is in the set
        when it's not.
   2. Not Removable: Once an element is added, you cannot remove it from the Bloom filter
        without reconstructing it from scratch.
   3. Fixed Size: You have to choose the size of the bit array and the number of hash functions at
        the beginning, which makes it less flexible if the dataset grows.
Code:
def main():
    # Read the length of the stream data
    m = int(input("Enter the length of the stream data (m): "))
    data_stream = [0] * m
          hash1 = (input_value % 5) % m
          hash2 = ((2 * input_value + 3) % 5) % m
          data_stream[hash1] = 1
          data_stream[hash2] = 1
if __name__ == "__main__":
    main()
Output:
 Name: Hirday Rochani                     Experiment No: 8                    Roll No: 2213205
Theory:
The Flajolet-Martin algorithm is also known as probabilistic algorithm which is mainly used to count
the number of unique elements in a stream or database. This algorithm was invented by Philippe
Flajolet and G. Nigel Martin in 1983 and since then it has been used in various applications such as,
data mining and database management.
The basic idea to which Flajolet-Martin algorithm is based on is to use a hash function to map the
elements in the given dataset to a binary string, and to make use of the length of the longest null
sequence in the binary string as an estimator for the number of unique elements to use as a value
element.
The steps for the Flajolet-Martin algorithm are:
      First step is to choose a hash function that can be used to map the elements in the database to
       fixed-length binary strings. The length of the binary string can be chosen based on the
       accuracy desired.
      Next step is to apply the hash function to each data item in the dataset to get its binary string
       representation.
      Next step includes determining the position of the rightmost zero in each binary string.
      Next, we compute the maximum position of the rightmost zero for all binary strings.
      Now we estimate the number of distinct elements in the dataset as 2 to the power of the
       maximum position of the rightmost zero which we calculated in previous step.
The accuracy of Flajolet Martin Algorithm is determined by the length of the binary strings and the
number of hash functions it uses. Generally, with increase in the length of the binary strings or using
more hash functions in algorithm can often increase the algorithm’s accuracy.
The Flajolet Martin Algorithm is especially used for big datasets that cannot be kept in memory or
analyzed with regular methods. This algorithm, by using good probabilistic techniques, can provide
a precise estimate of the number of unique elements in the data set by using less computing.
   1. Space Efficiency:
          The algorithm requires very little memory compared to keeping track of all distinct
            elements. It uses hash functions and bit patterns, allowing it to estimate the number of
            distinct elements with logarithmic space complexity, i.e., O(logn), where n is the number of
            distinct elements.
 Name: Hirday Rochani                      Experiment No: 8                     Roll No: 2213205
   2. Streaming-Friendly:
          FM is designed for streaming data and can process each element in constant time O(1). It
             doesn't require storing the data stream itself, making it suitable for scenarios where elements
             are arriving at high velocity.
   3. Scalability:
          The FM algorithm scales well with large data volumes because of its low memory and time
             complexity. It is ideal for use in big data applications like distributed systems.
   4. Simplicity:
          It is relatively simple to implement using hash functions and bit manipulation. This makes it
             practical for use in systems with resource constraints.
   5. Randomized but Accurate:
          Even though the algorithm is probabilistic, it provides a good approximation of the
             cardinality with high accuracy, especially when multiple independent estimations (averaging
             or merging techniques) are combined.
Code:
def hash(x):
    return (6 * x + 1) % 5
def to_three_bit_binary(num):
    binary = bin(num)[2:] # Convert to binary and remove '0b' prefix
    return binary.zfill(3) # Pad with leading zeros to ensure 3 bits
def count_trailing_zeros(arr):
    result = []
    for binary in arr:
        count = 0
        encountered_one = False
        for j in range(len(binary) - 1, -1, -1):
            if binary[j] == '0' and not encountered_one:
                count += 1
            elif binary[j] == '1':
                encountered_one = True
        if not encountered_one:
            count = 0
        result.append(count)
    return result
def main():
    size = int(input("Enter the size of the array: "))
    input_array = []
max_trailing_zeros = max(trailing_zeros_array)
if __name__ == "__main__":
    main()
 Name: Hirday Rochani   Experiment No: 8   Roll No: 2213205
Output:
 Name: Hirday Rochani                     Experiment No: 9                   Roll No: 2213205
Theory:
      Scatter plots: These visuals are beneficial in reveling the relationship between two variables,
       and they are commonly used within regression data analysis. However, these can sometimes
       be confused with bubble charts, which are used to visualize three variables via the x-axis, the
       y-axis, and the size of the bubble.
      Heat maps: These graphical representation displays are helpful in visualizing behavioral data
       by location. This can be a location on a map, or even a webpage.
      Tree maps, which display hierarchical data as a set of nested shapes, typically rectangles.
       Treemaps are great for comparing the proportions between categories via their area size.
Advantages:
   1. Enhanced Understanding: Data visualization helps in simplifying complex data sets,
       making it easier to understand patterns, trends, and insights that might be difficult to grasp
       from raw data alone.
   2. Quick Insights: Visual representations such as charts, graphs, and maps allow users to
       quickly grasp key information and make faster decisions based on visual summaries.
 Name: Hirday Rochani                    Experiment No: 9                  Roll No: 2213205
Disadvantages:
  1. Misleading Representations: Poorly designed visualizations can mislead or confuse
     viewers. For example, improper scaling or misleading axis labels can distort the data’s true
     meaning.
  2. Information Overload: Overly complex or cluttered visualizations can overwhelm viewers
     with too much information, making it hard to discern key points or insights.
  3. Dependency on Design Skills: Effective data visualization requires a good understanding
     of design principles. Without proper design, visualizations may fail to communicate the
     intended message or may be aesthetically unappealing.
  4. Loss of Detail: Simplifying data for visualization might lead to the loss of nuance and
     detailed information, which could be important for comprehensive analysis.
  5. Accessibility Issues: Not all visualizations are accessible to individuals with visual
     impairments or other disabilities. Ensuring that visualizations are inclusive can be a
     challenge.
  6. Technical Requirements: Creating high-quality visualizations often requires specialized
     tools and software, and a certain level of technical expertise, which might not be available to
     all users
 Name: Hirday Rochani             Experiment No: 9           Roll No: 2213205
Code:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Create DataFrame
wine_data = pd.DataFrame(data)
# Save to CSV
wine_data.to_csv('wine_data.csv', index=False)
# Boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(x='Class', y='Alcohol', data=wine_data)
plt.title('Boxplot of Alcohol Content by Wine Class')
plt.xlabel('Wine Class')
plt.ylabel('Alcohol Content')
plt.savefig('boxplot.png')
plt.show()
# Histogram
plt.figure(figsize=(10, 6))
sns.histplot(wine_data['Alcohol'], bins=10, color='blue', edgecolor='black')
plt.title('Histogram of Alcohol Content')
plt.xlabel('Alcohol Content')
plt.ylabel('Frequency')
plt.savefig('histogram.png')
plt.show()
plt.xlabel('Alcohol Content')
plt.ylabel('Malic Acid')
plt.savefig('scatter_line_plot.png')
plt.show()
# Line Plot
# Using a combination of 'Alcohol' and 'Malic_Acid' for the line plot
plt.figure(figsize=(10, 6))
sns.lineplot(x='Alcohol', y='Malic_Acid', hue='Class', data=wine_data,
marker='o')
plt.title('Line Plot of Malic Acid vs Alcohol Content by Wine Class')
plt.xlabel('Alcohol Content')
plt.ylabel('Malic Acid')
plt.savefig('line_plot.png')
plt.show()
df = pd.DataFrame({
    'time': range(1, 11),
    'value': [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
})
plt.figure(figsize=(10, 6))
plt.plot(df['time'], df['value'], marker='o', color='purple')
plt.title('Line Chart Example')
plt.xlabel('Time')
plt.ylabel('Value')
plt.savefig('line_chart.png')
plt.show()
Outputs:
Name: Hirday Rochani   Experiment No: 9   Roll No: 2213205
Name: Hirday Rochani   Experiment No: 9   Roll No: 2213205
                                             group-46-bda
    Collecting pyspark
      Downloading pyspark-3.5.3.tar.gz (317.3 MB)
         ���������������������������������������� 317.3/317.3
    MB 1.5 MB/s eta 0:00:00
      Preparing metadata (setup.py) … done
    Requirement already satisfied: py4j==0.10.9.7 in /usr/local/lib/python3.10/dist-
    packages (from pyspark) (0.10.9.7)
    Building wheels for collected packages: pyspark
      Building wheel for pyspark (setup.py) … done
      Created wheel for pyspark: filename=pyspark-3.5.3-py2.py3-none-any.whl
    size=317840625
    sha256=2c7af37129d8a15a6d7ceadb2f983ddd365df68fd2264788beb038d700727db9
      Stored in directory: /root/.cache/pip/wheels/1b/3a/92/28b93e2fbfdbb07509ca4d6f
    50c5e407f48dce4ddbda69a4ab
    Successfully built pyspark
    Installing collected packages: pyspark
    Successfully installed pyspark-3.5.3
                                                       1
data = spark.read.option("header", True).csv(file_path)
root
 |--   Year: string (nullable = true)
 |--   Jan: string (nullable = true)
 |--   Feb: string (nullable = true)
 |--   Mar: string (nullable = true)
 |--   April: string (nullable = true)
 |--   May: string (nullable = true)
 |--   June: string (nullable = true)
 |--   July: string (nullable = true)
 |--   Aug: string (nullable = true)
 |--   Sept: string (nullable = true)
 |--   Oct: string (nullable = true)
 |--   Nov: string (nullable = true)
 |--   Dec: string (nullable = true)
 |--   Total: string (nullable = true)
+----+-----------+-----------+-----------+-----------+-----------+-----------+--
---------+-----------+-----------+-----------+-----------+-----------+----------
-+
|Year|         Jan|        Feb|        Mar|      April|        May|       June|
July|        Aug|       Sept|         Oct|        Nov|        Dec|      Total|
+----+-----------+-----------+-----------+-----------+-----------+-----------+--
---------+-----------+-----------+-----------+-----------+-----------+----------
-+
|1901|13.11660194|           0|          0|3.949669123|17.13979103|640.7140364|88
8.3696921|545.0457959|64.27151334|9.871696144|           0|
0|2182.478796|
|1902|           0|          0|          0|          0|0.355000585|247.9987823|40
8.4337298|566.5958631|688.9134546|28.65409204|0.488864213|19.52654728|1960.96633
4|
|1903|           0|          0|0.844034374|
0|220.5687404|370.8490478|902.4478963|602.4208281| 264.589816|157.8928768|
0|          0| 2519.61324|
|1904|           0|          0|11.38176918|          0|          0|
723.081969|390.8867992|191.5819273|85.70475449|38.67994848|           0|
0|1441.317168|
|1905|0.662560582|1.713451862|           0|          0|
0|123.8708922|581.8279747|167.3821495|172.2977226|7.365923628|24.90357515|
0| 1080.02425|
+----+-----------+-----------+-----------+-----------+-----------+-----------+--
---------+-----------+-----------+-----------+-----------+-----------+----------
                                         2
    -+
    only showing top 5 rows
     rmse = evaluator.evaluate(predictions)
     print(f"RMSE for Linear Regression: {rmse}")
                                            3
     # Random Forest
     rf = RandomForestClassifier(featuresCol="features", labelCol="RainfallCategory")
     rf_model = rf.fit(train_data)
     rf_predictions = rf_model.transform(test_data)
     accuracy = evaluator.evaluate(rf_predictions)
     print(f"Accuracy for Random Forest: {accuracy}")
cv_model = crossval.fit(train_data)
[ ]: nb = NaiveBayes(featuresCol="features", labelCol="RainfallCategory",␣
      ↪modelType="multinomial")
     nb_model = nb.fit(train_data)
     nb_predictions = nb_model.transform(test_data)
     nb_accuracy = nb_evaluator.evaluate(nb_predictions)
     print(f"Accuracy for Naive Bayes: {nb_accuracy}")
                                            4
     crossval_nb = CrossValidator(estimator=nb, estimatorParamMaps=paramGrid_nb,␣
      ↪evaluator=nb_evaluator, numFolds=5)
cv_nb_model = crossval_nb.fit(train_data)
     plt.figure(figsize=(10, 6))
     plt.bar(rmse_algorithms, rmse_values_filtered, color='blue', label='RMSE')
     plt.xlabel('Algorithms')
     plt.ylabel('RMSE')
     plt.title('RMSE Comparison for Linear Regression')
     plt.legend()
     plt.show()
     # Plotting accuracy for Random Forest and Naive Bayes (ignoring None values)
     accuracy_algorithms = [algorithms[i] for i in range(len(accuracy_values)) if␣
      ↪accuracy_values[i] is not None]
     plt.figure(figsize=(10, 6))
     plt.bar(accuracy_algorithms, accuracy_values_filtered, color='green',␣
      ↪label='Accuracy')
     plt.xlabel('Algorithms')
     plt.ylabel('Accuracy')
     plt.title('Accuracy Comparison for Random Forest and Naive Bayes')
     plt.legend()
     plt.show()
                                            5
# Plotting cross-validated accuracy for Random Forest and Naive Bayes (ignoring␣
 ↪None values)
plt.figure(figsize=(10, 6))
plt.bar(cv_accuracy_algorithms, cv_accuracy_values_filtered, color='red',␣
 ↪label='Cross-Validated Accuracy')
plt.xlabel('Algorithms')
plt.ylabel('Cross-Validated Accuracy')
plt.title('Cross-Validated Accuracy Comparison for Random Forest and Naive␣
 ↪Bayes')
plt.legend()
plt.show()
                                       6
7
                                                                  Thadomal Shahani Engineering College, Bandra, Mumbai
                                                                                   Department of Computer Engineering
                                                                             Big Data Analytics ( Mini Project) SEM VII
AY: 2024-25
Experiment 10
(Mini Project)
                                                                                                        1
                              Thadomal Shahani Engineering College, Bandra, Mumbai
                                               Department of Computer Engineering
                                         Big Data Analytics ( Mini Project) SEM VII
AY: 2024-25
Guided By
                                                                    2
                                                                 Thadomal Shahani Engineering College, Bandra, Mumbai
                                                                                  Department of Computer Engineering
                                                                            Big Data Analytics ( Mini Project) SEM VII
CHAPTER 1: INTRODUCTION
In today’s data-driven world, analyzing and predicting rainfall patterns is essential for
effective resource management and planning. To achieve this, our project, "Rainfall Analytics
and Prediction using PySpark," leverages the power of Apache Spark, a robust and scalable
data processing platform.
Spark plays a central role in our system by enabling efficient handling of large datasets. It
allows us to process extensive historical rainfall data with ease, performing complex
transformations and analyses in parallel. With Spark's distributed computing capabilities, we
can handle these data operations at scale, making predictions faster and more accurate.
To enhance the analytical power of our system, we utilize Spark's machine learning libraries
to develop predictive models. These models can analyze past trends in rainfall and generate
accurate predictions for future patterns. By integrating these predictive models with real-time
data, we ensure that our rainfall forecasting remains both relevant and reliable.
This approach not only improves the accuracy of our rainfall predictions but also enhances
decision-making processes for industries dependent on weather patterns, leading to better
preparedness and resource allocation.
                                                                                                       3
                                                                        Thadomal Shahani Engineering College, Bandra, Mumbai
                                                                                         Department of Computer Engineering
                                                                                   Big Data Analytics ( Mini Project) SEM VII
2. Monthly Data (Jan, Feb, Mar, etc.): Rainfall data for each month.
This dataset forms the basis for analyzing trends and seasonal variations in rainfall. By leveraging
PySpark's distributed computing capabilities, the system efficiently processes and analyzes this data to
derive insights, helping to make accurate predictions about future rainfall patterns.
• Data Sources:
   o Weather     APIs: Data is fetched from APIs that provide real-time weather updates and
     historical rainfall data.
   o Public   Datasets: Open-source datasets that contain historical rainfall data can be used for
     analysis and model training.
• Data Processing:
   o The   collected data is processed using Apache Spark, which allows for distributed data
                                                                                                              4
                                                                  Thadomal Shahani Engineering College, Bandra, Mumbai
                                                                                   Department of Computer Engineering
                                                                             Big Data Analytics ( Mini Project) SEM VII
      processing. This enables the system to handle large volumes of data efficiently.
  o Data   cleaning and transformation steps are applied to ensure that the dataset is ready for
      analysis. This includes handling missing values, normalizing data formats, and
      aggregating data as necessary.
1. Data Ingestion: Load rainfall data into Spark DataFrames for efficient processing.
2. Data Cleaning: Handle missing values and remove outliers to ensure data quality.
  4. Exploratory Data Analysis (EDA): Visualize relationships and trends in the data to
       guide model selection.
  5. Model Training: Split data into training and testing sets, applying machine learning
       algorithms to build and optimize rainfall prediction models.
  6. Real-Time Predictions: (If applicable) Update models with new data periodically for
       ongoing accuracy.
Using Spark’s distributed computing power, the system processes large datasets efficiently,
enabling timely rainfall predictions.
  •    Annual Rainfall Trends: Examining total rainfall from 1901 to 2021 to identify long-
       term fluctuations.
  •    Seasonal Variability: Analyzing monthly data to determine peak rainfall months (e.g.,
       June and July) and dry periods for better planning.
                                                                                                        5
                                                               Thadomal Shahani Engineering College, Bandra, Mumbai
                                                                                Department of Computer Engineering
                                                                          Big Data Analytics ( Mini Project) SEM VII
  •   Monthly Averages: Calculating average monthly rainfall to establish typical patterns
      and highlight significant rainfall months.
This analysis enhances rainfall predictions, aiding in resource management and preparedness.
• Missing or Corrupt Data: Checking for entries with incomplete or invalid values.
• Consistency: Ensuring uniform formatting across the dataset (e.g., date formats).
By validating the data, we guarantee that our analyses and predictions are based on accurate
and reliable information. This foundational step is crucial for generating meaningful insights
and enhancing the overall effectiveness of the Rainfall Analytics and Prediction system.
                                                                                                     6
                                                                  Thadomal Shahani Engineering College, Bandra, Mumbai
                                                                                   Department of Computer Engineering
                                                                             Big Data Analytics ( Mini Project) SEM VII
                                                                                                        7
                             Thadomal Shahani Engineering College, Bandra, Mumbai
                                              Department of Computer Engineering
                                        Big Data Analytics ( Mini Project) SEM VII
                                                                   8
Thadomal Shahani Engineering College, Bandra, Mumbai
                 Department of Computer Engineering
           Big Data Analytics ( Mini Project) SEM VII
                                      9
Thadomal Shahani Engineering College, Bandra, Mumbai
                 Department of Computer Engineering
           Big Data Analytics ( Mini Project) SEM VII
                                      10
Thadomal Shahani Engineering College, Bandra, Mumbai
                 Department of Computer Engineering
           Big Data Analytics ( Mini Project) SEM VII
                                      11
                                                                 Thadomal Shahani Engineering College, Bandra, Mumbai
                                                                                  Department of Computer Engineering
                                                                            Big Data Analytics ( Mini Project) SEM VII
Conclusion:
The Rainfall Analytics and Prediction system developed using PySpark effectively showcases
how data processing can enhance understanding and forecasting of rainfall patterns. By
leveraging PySpark for distributed data processing, we have established a scalable solution
capable of handling large datasets.
The system is built on principles of data analysis and predictive modeling, allowing for
accurate forecasts based on historical rainfall data. This approach ensures that stakeholders
receive timely insights, which can lead to informed decision-making and improved planning.
  •   Accurate Forecasting: The system utilizes historical data to predict future rainfall,
      providing reliable insights for planning and resource allocation.
  •   Efficient Data Processing: The streamlined workflow from data ingestion to analysis
      minimizes processing time, which is crucial for timely predictions.
In summary, the integration of PySpark has provided an effective solution to the challenges of
rainfall analytics and prediction. This solution is adaptable and can be employed across
various domains to enhance the understanding and management of rainfall-related phenomena.
                                                                                                       12
                                                               Thadomal Shahani Engineering College, Bandra, Mumbai
                                                                                Department of Computer Engineering
                                                                          Big Data Analytics ( Mini Project) SEM VII
Future Scope:
While the current implementation successfully provides rainfall analytics and predictions,
several areas for enhancement and expansion could further improve system performance and
user experience:
          o   These models could leverage additional data points, such as satellite imagery,
              meteorological data, and historical climate patterns, to provide more robust
              forecasts.
                                                                                                     13
                                                                 Thadomal Shahani Engineering College, Bandra, Mumbai
                                                                                  Department of Computer Engineering
                                                                            Big Data Analytics ( Mini Project) SEM VII
  5. Support for Multilingual and Regional Adaptations:
          o   Exploring integration with big data technologies like Apache Flink or Google
              BigQuery could provide more flexible data processing options, especially for
              real-time analytics and batch processing.
          o   Establishing a real-time feedback loop that incorporates user feedback and actual
              rainfall observations would enhance the model's adaptability, ensuring more
              accurate predictions over time.
In conclusion, this project serves as a solid foundation for building scalable rainfall analytics
and prediction systems. As technology evolves and data becomes more complex, the
enhancements outlined above will ensure that the system remains efficient, responsive, and
competitive in the field of climate data analytics. With these potential developments, the
system can transition from a basic predictive tool to an advanced platform capable of
delivering high-quality insights and fostering proactive decision-making for stakeholders.
14