0% found this document useful (0 votes)
90 views48 pages

Bda Unit 2

The document provides an overview of the MapReduce framework, detailing its components, execution stages, and optimization techniques. It explains how MapReduce processes data through mapping and reducing tasks, emphasizing its scalability and fault tolerance. Additionally, it discusses HBase as a complementary technology for real-time data access in the Hadoop ecosystem.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
90 views48 pages

Bda Unit 2

The document provides an overview of the MapReduce framework, detailing its components, execution stages, and optimization techniques. It explains how MapReduce processes data through mapping and reducing tasks, emphasizing its scalability and fault tolerance. Additionally, it discusses HBase as a complementary technology for real-time data access in the Hadoop ecosystem.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 48
Hh Warehouses, Big Data Analysis with OutputFormats, Customi Execution with Combiner, I Data. 2.1 UNperstanpine Maprepuce FUNDAMENTALS AND HBase 2.1.1. The Mapreduce Framework Ql. What is MapReduce ? Explain its components and features. Ans: MapReduce is a processing technique and a Program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map. takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. ‘The major advantage of MapReduce is that itis easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the | Understanding MapReduce Fundamentals and HBase: The Ma Framework, Techniques to Optimize MapReduce Jobs, Role of HBase in | Processing Exploring the Big Data Stack, Virtualization and Big Data, Vit | Approaches Storing Data in Databases and Data Warehouses: RDBMS and Big Data, Non- Relational Daiabase, Integrating Big Data with Tradit in Big Data Era.Processing Your Data with MapReduce: Developin, MapReduce Application, Points to Consider while Designing MapReduce. Customizing MapReduce Execution: Controlling MapReduce Execution with | nputformat, Reading Data with Custom RecordReader, Organizing Output Data izing Data with RecordWriter, Optimizing MapReduce implementing a MapReduce Program for Sorting Text |. TAFE a EE! HA PReduee | Big Data |) itualization || | ional Data || ‘and Data Warehouse, Changing Deployment Models [ 1g Simple | The Algorithm > Generally MapReduce paradigm is based on sending the computer to where the data resides! MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage. » * Map stage : The map or mapper's job is to process the input data. Generally the input data isin the form of file or directory and is stored in the Hadoop file systém (HDFS). The input file is passed to the ™mapper function line by line. The mapper processes the data and creates several small chunks of data. Reduce stage : This stage is. the combination of the Shufflestage and the Reduce stage. The Reducer's job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS. During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster. The framework manages all the details of data- Passing such as issuing tasks, verifying task completion, and copying data around the clustet between the nodes, Most of the computing takes place on nodes with data on local disks that reduces the network MapReduce model. Rahul Publ traffic, — BIG DATA ANALYTICS unit we ‘After completion of the given tasks, the cluster collects and recluces the data to form an appropriate , yesult, and sends it back to the Hadoop server. Reduce() > The sched + As of today, for Hadoop, there are various schedulers > Dealing with stragglers +. Job execution time depends on the slowest map and reduce tasks * Speculative execution can help with slow machines But data locality may be at stake > Dealing with skew in the distribution of values + E.g.: temperature readings from sensors + In this case, scheduling cannot help * tis possible to work on customized partitioning and sampling to solve such issues Rahul Publications MSc Il Year 2. Data/code co-location » The effectiveness of the data processing mechanism depends on the location of the code and data required for the code to execute, » — The best result is obtained when both code and data reside in the same machine. 3. Synchronization » In MapReduce, execution of several processes require synchronization. » — synchronization is achieved by the “shuffle andsort” manner. > The framework tracks all the tasks along with their timings and starts the reduction process after the completion of mapping, » The shuffle and sort mechanism is used for collecting the rap data and preparing for reduction 4. Handling Errors and faults » MapReduce generally provides a high level of fault tolerance and robustness in handling errors. » Using quite simple mechanisms, the MapReduce framework dealswith: + Hardware failures ~ Individual machines: disks, RAM - Networking equipment - Power / cooling + Software failures - Exceptions, bugs + Corrupt and/or invalid input data » More over MapReduce engine designs the ability to find out the tasks that are incomplete and eventually assign them to different nodes I Somos, > It is always beneficial to have multiple split because time taken to process a split is smal % compared tothe time taken for processing & the whole input. When the splits are smalley the processing is better load balanced since We are processing the splits in parallel, » However, it is also not desirable to have splis too small in size. When splits are too small, the overload of managing the splits and map tas, creation begins to dominate the total job execution time. » For most jobs, itis better to make split size equal to the size of an HDFS block (which is 64 MB, by default) » Execution of map tasks results into writing output to a local disk on the respective node and not to HDFS. » Reason for choosing local disk over HDFS is, to avoid replication which takes place in case of HDFS store operation. » Map output is intermediate output which is processed by reduce tasks to produce the final output. » Once the job is complete, the map output can be thrown away. So, storing it in HDFS with replication becomes overkill, > In the event of node failure before the map output is consumed by the reduce task, Hadoop teruns the map task on another node and re- creates the map output. > — Reduce task don't work on the concept of data locality. Output of every map task is fed to the teduce task. Map output is transferred to the machine where reduce task is running. >. On this machine-the output is merged and then passed to the user defined reduce function. » Unlike to the map output, reduce output is stored in HDFS (the first replica is stored on the local node‘and other replicas are stored on off-rack nodes). So, writing the reduce output Q2. How mapreduce will work? Explain, How MapReduce Organizes Work? Aus : Hadoop divides the job into tasks, There are Working of MapReduce two types of tasks : » One map task is created for each split which | 1+ Map tasks (Spilts & Mapping) then executes map function for each record in} 2+ Reduce tasks (Shutfling, Reducing) the split. as mentioned above, Rahul Publications {20} ———— ~~ unit =H BIG DATA ANALYTICS The complete execution process (execution of Map and Reduce tasks, both) is controlled by two of entities called a 1, Jobtracker : Acts like a master (responsible for complete execution of submitted job) 2, Multiple Task Trackers : Acts like slaves, each of them performing the job For every job submitted for execution in the system, there is one Jobtracker that resides on Namenode and there are multiple tasktrackers which reside on Datanode. 3 Task Trackers On 3 Datanodos Jab Tacker On Nomenode > Inaddition, tasktracker periodically sends *heartbeat’ signal to the Jobtracker so as to notify him of current state of the system. » Thus jobtracker keeps track of overall progress of each job. In the event of task failure, the jobtracker can reschedule it on a different tasktracker. Q3. Explore Map and Reduce functions. Aas: Exploring Map and Reduce functions The map function has been a part of many functional programming languages for. To further your understanding of why the map function is a good choice for big data (and the reduce function is as well), its important to understand a little bit about functional programming. YF “ I MSc Il Year I Semmes, Operators in functional languages do not modify the structure of the data; Reva ne dy structures as their output. More importantly, the original data itself is unmodified as well. The following is a way to represent how to construct a solution to the problem. One way to accomplish the solution is to identify the input data and create a list: ™mylist = (“all counties in the US that participated in the most recent general election”) Create the function howManyPeople using the map function. This selects only the counties with more than 50,000 people: map howManyPeople (mylist) = [ howManyPeople “county 1”;howManyPeople “county 2° howManyPeople “county 3”; howManyPeople “county 4”; . . . ] Now produce a new ouitput list of all the counties with populations greater than 50,000: (no, county 1; yes, county 2; no, county 3; yes, county 4; ?, county nnn) Adding the reduce Function Like the map function, reduce has been a feature of functional programming languages for many years. 7 The reduce function takes the output of a map function and “reduces” the list in whatever fashion the Programmer desires. The first step that the reduce function requires is to place a value in something called an accumulator, which holds an initial value. ¥ Revisit the map function example now to see what the reduce function is capable of doing. Suppose that you need to identify the counties where the majority of the votes were for the Democratic candidate. Remembe# that your howMany-People map function looked at each element of the input lst and created an outputt list of the counties with more than 50,000 people (yes) and the counties with les: than 50,000 people (no). After invoking the howManyPeople map function, you are left with the following output list (no, county 1; yes, county 2; no, county 3; yes, county 4; ?, county nnn) This is now the input for your reduce function. Here is what it looks like: countylist = (no, county 1; yes, county 2; no, county 3; yes, county 4; 2, county nnn) reduce isDemocrat (countylist) 2.1.2 Techniques to Optimize Mapreduce Jobs Q4. Explain various techniques to optimize mapreduce jobs ? Aus : , An analysis of map reduce program execution shows that it involves a series of steps in which every step has its own set of resource requirements. In case of short jobs, the one’s which user needs a quick response to the quenties, the response time became more important. Hence we need MapReduce job to be optimised. Encountering a deadlock even for a single resource will slow down the execution process. We can organize the MapReduce optimization techniques in the following categories: » Hardware or network topologies D » Synchronization > File system Rahul Publications (2) UNIT - I 1. Hardware/network topology ‘A distinct advantage of MapReduce is the capability to run on inexpensive clusters of commodity hardware and standard networks. If you don't pay attention to where your servers are physically organized, you won't get the best performance and high degree of fault tolerance necessary to support big data tasks. Commodity hardware is often stored in racks in the data center. The proximity of the hardware within the rack offers a performance advantage as slave style of distribution,where the master node stores all the metadata, access rights, mapping andlocation of files and blocks, and so on. ‘The slaves are nodes where the actual data is stored. All the requests go to the master and then are handled by the appropriate slave node. As you contemplate the design of the file system you need to support MapReduce implementation, you should consider the following : > Keep it warm : master node could get ove! everything begins there. Additionally, As you might expect, the worked because if the BIG DATA ANALYTICS master node fails, the entire file system is inaccessible until the master is restored. A very important optimization is to create a “warm standby” master node that can jump into service if a problem occurs with the online master. >» The bigger the better : File size is also an important consideration. Lots of small files (less than 100MB} should be avoided. Distributed file systems supporting MapReduce engines work best when they are populated with a modest number of large files. we nae HBasosis-acdistributed. database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable. HBase is a data model that is similar to Google's big table designed to provide quick random access to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File System (HDFS). It is a part of the Hadoop ecosystem that provides random real-time read/write access fo data in the Hadoop File System. (33} ‘Rahul Publications YS MSc Il Year M Semes, One can store the data in HFS ether directly or through HBase, Data consumer readslacceses deta in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides re , and write access, Features of HBase > HBase is linearly scalable. > — It has automatic failure support. » It provides consistent read and writes, » It integrates with Hadoop, both as a source and a destination, > Ithas easy java API for client, » It provides data replication across clusters. Where to Use HBase » Apache HBase is used to have random, real-time read/write access to Big Data. > — It hosts very large tables on top of clusters of commodity hardware. » Apache HBase is a non-relational database modeled after Google’s Bigtable, Bigtable acts up on : Google File System, likewise Apache HBase works on top of Hadoop and HDES, Applications of HBase > Itis used whenever there is a need to write heavy applications. » — HBase is used whenever we need to Provide fast random access to available data. » Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally. Q6. How to install HBase in various platforms ? Discuss, Aue é HBase - Installation HBase can be installed in three modes. The features of these modes are mentioned below. 1. Standalone mode installation (No dependency on Hadoop system) > This is default mode of HBase > Itruns against local file system > It doesn't use Hadoop HDFS* > Only HMaster daemon can run >» — Not recommended for production environment >» — Runs in single JVM ere TE UNIT - Il BIG DATA ANALYTICS + All Daemons run in single node » Recommend for production environment ye istrib 3, Fully Distributed mode installation ( MultinodeHadoop environment + HBase installation) > It runs on Hadoop HDFS » All daemons going to run across all nodes present in the cluster » Highly recommended for production environment Hbase - Standalone mode installation: Installation is performed on Ubuntu with Hadoop already installed. Batty, Place hbsce cl Spinstasd eee a § cd hbase-1.1.1/¢ ./hbase-1.1.1/conf$| hbase-envsh x # Set environment vartables here. ple times over the course of 5 # This script sets vartables multt hbase process, a so try to keep things tdenpotent unless you want ft into the startup scripts (bin/hbase, etc.) to take an # The java implementation to use. Java 1.7+ required. # export JAVA_HOME=/us¢/ java/jak1.6.0/ ee SAVA_HOME=/usr/L4b/Jve/Java-7 -openjak-ando4/ Jre , # Extra Java CLASSPATH elements. Optional. # export HBASE_CLASSPATH= E_HOME path as shown in below Step 4) Open ~/:bashrc file and mention HBAS PATH= $PATH:SHBASE_HOME/bin export HBASE_HOME=/home/hduser/hbase-1.1.1 export Rahul Publicat MSc II Year _ MW Lr he LU eee cua hduser@ubuntu:~$ cd hbase-1.1.4/conf h hduser@ubuntu:~/hbase-1.1.1/conf$ gedit hbase-env.s! MCS ee Liles et PT Lary AZ: bashre x #export HADOOP_COMMON_LIB_NATIVE_DIR=$(HADOOP_PREFIX}/Uib/na #export HADOOP_OPTS="-Djava. library. path=SHADOOP_PREFIX/1ib" export HADOOP_COMMON_LIB_NATIVE_DIR=SHADOOP_HOME/1tb/native export HADOOP_OPTS="-Djava. library. path=SHADOOP_HOME/1ib" export HADOOP_INSTALL=SHADOOP_HOME export HIVE_HOME=/home/hduser /apache-hi export PATH=S$PATH: SHIVE_HOME/bin HBASE PATH export HBASE_HOME=/home/hduser/hbase-1.1.1 xport PATH=S$PATH: SHBASE_HOME/bin Step 5) Open hbase-sitexml and place the following properties inside the file hduser@ubuntus ged hbase-site.xml(code as below) hbase.rootdir file:///home/hduser/HBASE/hbase hbase.zookeeper.property.dataDir /home/hduser/HBASE/zookeeper Rahul Publications {38} J UNIT - Il GNU nano 2.2.6 fa hast pean Pecetiad : fiteenan erated Re Paes wists mF PS WE cle ea acti ee e = [hb ype 3 co c hosts * Vocathost vubunty Tines are desirable for IP ¢. {p6- Loopback 7.0. 127.0.0-1 ‘ocathos' {p6-Localnet Apo-ncastpref tx F The following i ips-al nodes ipé-atlrouters BIG DATA ANALYTICS Hodt fied cA Cur Pos Momo Bowes ve capable hosts MSc Il Year Step 7) Now Run Start-hbase.sh in hbase-1.1.1/bin location as shown below. And we can check by jps command to see HMaster is running or not. Step8) HBase shell can start by using “hbase shell” and it will enter into interactive shell mode as shoy in below screenshot. Once it enters into shell mode, we can perform all type of commands. The standalone mode does not require Hadoop daemons to start, HBase can run independently Hbase - Pseudo Distributed mode of installation: This is another method for Hbase Installation, known as Pseudo Distributed mode of Installation. Below are the steps to install HBase through this method, Step 1) Place hbase-1.1. 2-bin.tar.gz in /home/hduser Step 2) Unzip it by executing command$tar -xuf hbase-1.1.2-bin.targe. It will unzip the contents, and it will create hbase-1.1.2 in the location /homelhduser Step 3) Open hbase-enush as following below and mention JAVA_HOME path and Region ser! Path inthe location and export the command as shown Rahul Publications —— BIG DATA ANALYTICS HAMS BEEBE SEED Var Mention JAVA, HOME path Koes process, evtes nuCpE HBASE_REGIONSERVERS- 5 50 try to keep things (denpotent un ei) eermiremenrbacireaneeynee, —HBASE_MANAGES_ZK: a The java inplenentation to use & export JAVA_HOME=/usr/java/jdki ‘export, JAVA_HOME=/usr/Utb/ Jvn/ Java-7 -open}dk-and6s/Jre Java 4 export: HBASE_REGIONSERVERS=, export RUSE. CLINE yey s/h Udy hase hbese-0,¢- Al canf/cesianseNST> @ Extra Java CLASSPATH elenents + Opt # export HBASE_CLASSPAI eee: Jhbase-1.1.1/conf$ cd ~§ cd hbase-1.1.1/conf /hbase-1.1.1/conf$ Pind Tere uae) hbase: pute [aed ey eile ; WNT ts PV ed Ris 3 cbashrc x = X} /Lib/narp E #export HADOOP_COMMON_LIB_NATIVE_OIR ${HADOOP_PREF jna t gexport HADOOP_OPTS: “java. Library. path=SHAD0OP_PREFIX/ LAP ; - HOME/Lib/native b -t HADOOP_COMMON_LIB_NATIVE_OIR SHADOOP_ ‘ export HADOOP_OPTS: pjava. Library-path=sHAD00? HOME/ TP export HADOOP_INSTALL=SHADOOP_HOME export saive_HoHE=/hone/hduser /apache-hy export PATHZSPATH: SHIVE_HOME/bin ——— home/hduser /hbase- 2.1.1 HBASE, Home /bin _ HBASE PATH xport HBASE_HOME=/| xport PATH=SPAT s MSc Il Year Step 5) Open HBase-site.xml and mention the below properties in the file.(Code as below) Ses LU EE LL get eer Bisel B eum 2 ea shbasessite.xml x hdfs://locathost: 9000/hbase 5 valuestrue Fnaneshbase. Zookeeper .quorun [values locathost icnameshbase. rootdtr @ ‘name>hbase. cluster. distributed @ ms valuemt . [anane>hbase . Zookeeper .property.clientPort 2181 Fproperty> | nameshbase. zookeeper .property.dataDir “ya lue> hone /hduser /hbase/zookeeper [Iproperty> XML» Tab Width:8 © Rahul Publications {2} Setting up Hbase root directory in this property For distributed set up we have to set this property ZooKeeper quorum property should be set up here Ml Se, Mey * i 1n49, Col t Replication set up done in this property. By default we are placing replication as 1. INS oe — wel unit BIG DATA ANALYTICS In the fully distributed mode, multiple data nod . ing more In due inthe cisteplcaton mong oe present so we can increase replication by placing more 5, Client port should be mentioned in this property 6. ZooKeeper data directory can be mentioned in this property Step 6) Start Hadoop daemons first and after that start HBase daemons as shown below Here first you have to start Hadoop daemons by using” /start-all.sh” command as shown in below. siallation 1s same a6 pseudo d multiple nodes. > Theconfigurations files mentioned in HBase-site.xml and hbase-env.sh is same as mentioned in pseudo mode. . a 2.1.4 Exploring The Big Data Stack Q7. Explain the big data architecture with neat diagram. Aas: Big data analysis also needs the creation of model, The architecture of bigdata environment must fullll all the foundational requirements and must be albe to perform the following fucntions: Data is first properly validated and cleaned; After that, the ingestion layer would take over. (1) Rahul Publications MSc I Year Big Data System’s Architecture ton PAE) Lup { f y f } Feoure zt: » Viswatvution Layer Madioap Adminis Madoop Phittorm rtanagcn™ 3 P10 Data Wate nourie: Hsdoop, .F Sota asc ats | t es Aoetianices wre circa) | mar : Jt a Htadoop ge . ee biota! Gurtxed Werner wee Eas wisinidodiinae SS ‘ Sosunty Levor tdi Lar Ingestion Layer The responsibility of this layer is to separate the noise from the relevant information. The ingestion layer should be able to validate, cleanse, transform, reduce, and integrate the data into the big data tech slack for further processing. I Identification M7 ae Filtration L] [eee | — ., Hadoop Storage Layer _ Integration | Cesar | Validation | WOES | | Compression Noise Reduction Fig. : Functions of Data Ingestion Layer Distributed (Hadoop) Storage Layer : The data is stored in massively distributed storage systems such as, Hadoop distributed file syste" (HDFS). Hadoop is an open source framework that allows us to store huge volumes of data in ® distrbufed fashion across low cost machines. Hadoop can support petabytes of data and massive scalable map reduce engine that computes results in batches. HDFS requires complex file readiwnlé programs to be uritten by skilled developers. These programs are called NoSQL databases. Rahul. Publications _ uNIT tl BIG DATA ANALYTICS . eae :MapReduce was adopted by Google for efficiently executing a set of functions a sa lage amount of data in batch mode, The map component distributes the problem or distributes the re ie of systems and handles the placement of the tasks in a way that fer the dsttiiied comin Poe came ere computation is completed, another function called reduce combines all that analyze large ans A provide a result. MapReduce simplifies the creation of processes thet ane lege amounts of unsctred and suc dt in parallel. Underlying hardware oat, nnsparently for user applications, providing a reliable and fault-tolerant gcccurity Covey nahn ee As big data analysis becomes a mainstream functionality for companies, security of that data becomes a prime concern. The security requirements have to be part of the big data fabric from the beginning and not an afterthought. > Monitoring Layer Monitoring systems have to be aware of large distributed cl mode. The system should aso provide tools for data storage and visualization, Performance is @ key parameter to monitor so that there is very low overhead and high parallelism, Open source tools like Ganglia and Nagios are widely used for monitoring big data tech stacks. lusters that are deployed in a federated > Visualization Layer A huge volume of big data can lead to information overload. However, if visualization is incorporated carly-on es an integral part of the big data tech stack, it will be useful for data analysts and scents io gain insights faster and increase their ability to look at diferent aspects of the data in various visual —(} Rahul Publications - Paes M.Sc Il Year I Sema ty modes. Once the big data Hadoop processing aggregated output is scooped into the traditional) data warehouse, and data marts for further analysis along with the transaction data, the visvaizag® layers can work on top of this consolidated aggregated data. Additionally, if real-time insights requis the real-time engines powered by complex event processing (CEP) engines and event-die’ architectures (EDAs) can be utilized. . E= ea Visualization Took : i Teadtional 81 Toots | Baten ots | Wewenauses, ae ota Scoop Gata Lakes ~ _ — i = | pees | Strectured Oata ts Visualisation Conceptual Architecture 2.1.5. Virtualization and Big Data Q8. What is the need of virtualization in big data ? Aue: The essential idea with virtualization is that heterogeneous or distributed systems are’ represented a complex systems through specific interfaces that replace physical hardware or data storage designations with virtual components. These are called virtual machines. Each virtual machine contains a separate copy of the operating system with its own hardware resources. Although virtualization is not very important in big data analytics, the MapReduce works very efficiently in a virtualised environment. Virtual Machine Monitor Peri ‘” Publications UNIT - Il The following is the basic features of virtualization : » Partitioning - multiple applications and operating systems are supported by a single physical system by partitioning the avai psa @ the available » Isolation - each virtual machine runs in an isolated manner from its host physical system and other virtual machines. The benefit ofthis is, if one machine crashes, the other virtual machines and host systems are not effected. ization for optimizing the computing environment. 2.1.6 Virtualisation Approaches Q9. Explain about various virtualization approaches in big data. Aas: In big data environment you can virtualize almost every element like servers, storage, applications, data, network etc. the following are the some form of virtualizations. BIG DATA ANALYTICS > ' Big data server virtualization In server virtualization, one physical server is partitioned into multiple virtual servers. The hardware and resources of a machine — including the random access memory (RAM). CPU, hard drive, and network controller — can be virtualized into a series of virtual machines that each runs its own applications and operating system. A virtual machine (VM) is a software representation of a physical machine that can execute or perform the same fungtions as the : a hysic A.thin. lays » Big data application virtualization Application infrastructure virtualization provides an efficient way to manage applications in context with customer demand. The application is encapsulated in a way that removes its dependencies from the underlying physical computer system. This helps to improve the overall manageability and portability of the application. In addition, the application infrastructure virtualization software typically allows for Rahul Publications MSc 1i Year codifying business and technical usage policies to make sure that each of your applications levera al and physical resources in a predictable way. Efficiencies are gained because you can more easily distribute IT resources according to the relative business.value of your applications. es Application infrastructure virtualization used in Combination with server virtualization can help to ensure that business service-level agreements are met. Server virtualization monitors CPU and memory usage, but does not account for variations in business priority when allocating resources. Big data network virtualization Network virtualization provides an efficient way to use networking as a pool of connection tesources. Instead of relying on the physical network for managing traffic, you can create multiple virtual networks all utilizing the same physical implementation. This can be useful if you need to define a network for data gathering with a certain set of performance characteristics and capacity and another network for applications with different performance and capacity. Virtualizing the network helps reduce these bottlenecks and improve the capability to manage the large distributed data required for big data analysis. Big data processor and memory virtualization Processor virtualization helps to optimize the processor and maximize performance, Memory virtualization decouples memory from the servers, In big data analysis, you may have repeated queries of large data sets and the creation of advanced analytic algorithms, all designed to look for pattems and trends that are not yet understood. These advanced analytics can tequire lots of processing power (CPU) and memory (RAM). For some of these computations, it can take a long time without sufficient CPU and memory resources, I Som, Sty » Big data and storage virtualization Data virtualization can be used to cre, platform for dynamic linked data service, x allows data to be easily searched ang line through a unified reference source. As a y, data virtualization provides an abstract sen, that delivers data ina consistent form regarq® of the underlying physical database. In addig, data virtualization exposes cached data to a applications to improve performance, Storage virtualization combines physical resources so that they are more effectively shared, This reduces the cost of storage and makes it easie to manage data stores required for big data analysis 2 Storinc Data in DATABASES AND Data WareHouses 2.2.1 RDBMS and Q10. Explain the relationship between RDBMS and Big Data. Aue : Big data is becoming an important element in the way organizations are leveraging high-volume data at the right speed to solve specific data problems. Relational Database Management Systems are important for this high volume, Big dats does not live in isolation, To be effective, companies often need to be able to combine the results of big data analysis with the data that exists within the business. » BIG DATA BASICS : RDBMS AND PERSISTENT DATA One of the most important services provided by operational databases (also called data stores) is persistence, Persistence guarantees thet the data stored in a database won't be changed without permissions and that it will available as long as it is important to the business. Given this most important requirement, you must then think about what kind of data you want to persist, how can you access and update it, and how can you use it to'make busines decisions, At this most fundamental level, tH choice of your database engines is critical Data unit = your overall success with your big data implementation. Even though the underlying technology has been around for quite some time, many of these systems are in operation today because the businesses they support are highly dependent on the data. To replace them would be akin to changing the engines of an airplane on a transoceanic flight. BIG DATA BASICS: RDBMS AND TABLES Relational databases are built on one or more In companies both small and large, most of their important operational information is probably stored in RDBMSs. Many companies have different RDBMSs for different areas of their business. Transactional data-might be BIG DATA ANALYTICS open source relational database. Several factors contribute to the popularity of PostgreSQL. As an RDBMS with support for the SQL standard, it does all the things expected in a database product, plus its longevity and wide usage have made it “battle tested.” It is also available on just about every variety of operating system, from PCs to mainframes. Providing the basics and doing so reliably are only part of the story. PostgreSQL also supports many features only found in expensive proprietary RDBMSs, including the followin; » Indexing methods » Procedural languages This high level of customization makes PostgreSQL desirable when rigid, proprietary products won't get the job done. It is infinitely extensible. stored in one vendor's database, while customer information could be stored in another. > POSTGRESQL, AN OPEN SOURCE Finally, the PostgreSQL license permits RELATIONAL DATABASE modification and distribution in any form, open or closed source. Any modifications can be kept private @ a ee During your big data implementation, you'll | ©. shared with the community as you wish. likely come across PostgreSQL, a widely used, Rahul Publications MSc WYeor 2.2.2 Non-Relational Database Q11.What are called Non-Relational databases ? Anas: Non-telational databases do not rely on the table/key model endemic to RDBMSs. In short, specialty data in the big data world requires specialty persistence and data manipulation techniques. One emerging, popular class of non-relational database is called not only SQL (NoSQL). Originally the originators envisioned databases that did not tequire the relational model and SQL. As these products were introduced into the market, the definition softened a bit and now they are thought of as “not only SQL,” again bowing to the ubiquity of SQL. The other class is databases that do not support the relational model, but rely on SQL as a primary means of manipulating the data within, Even though relational and non-relational databases have similar fundamentals, how the fundamentals are accomplished creates the differentiation. Non-telational database technologies have the following characteristics in common : » Scalability: In this instance,this refers to the capability to write data across multiple data stores simultaneously without regard to physical limitations of the underlying infrastructure. Another important dimension is seamlessness. The databases must be able to expand and contract in response to data flows and do so invisibly to the end users. » Data and Query model: Instead of the row, column, key structure, nonrelational databases use specialty frameworks to store data with a requisite set of specialty query APIs to intelligently access the data. » Persistence design: Persistence is still a critical element in nonrelational databases. Due to the high velocity, variety, and volume of big data, these databases use difference mechanisms for persisting the data. The highest performance option is “in memory,” where the entire database is kept in the very fast memory system of your servers. » Interface diversity: Although most of these technologies support RESTful APIs as their “go oY MW Se Te to” interface, they also offer a Wide vay connection mechanisms for programm’) 2 database managers, including analysis vo hy, reporting/visualization. » Eventual Consistency: While RDB; ACID (Atomicity, Consistency, Isciy% Durability) for ensuring the consistency of e non-relational DBMS use BASE, BASE for Basically Available, Sott state, and Even” Consistency. Eventual consistency is ny important because it is responsible for conf, resolution when data is in motion be nodes ina distributed implementation. Thed, state is maintained by the software and 4 access model relies on basic availabilty 2.2.3 Integrating Big Data with Tradition, Data Warehouses Q12. What are the challenges facing by b, data in usage of traditional da warehouses ? Aus: While the worlds of big data and the traditiog data warehouse will intersect, they are unlikely, merge anytime soon. Think of a data warehouses a system of record for business intelligence, mus like a customer relationship management (CR or accounting system. These systems are high structured and optimized for specific purposes § addition, these systems of record tend to be high centralized. The diagram shows a typical approach to di flows with warehouses and marts : Transact onal Systens} Rahul Publications (4) a unit - Il Organizations will inevitably continue to use data warehouses to manage the type of structured and tional data that characterizes systems of record, These data warehouses will stil provide business analysts with the ability to analyze key data, trends, snd so on. However, the advent of big data is both challenging the role of the data warehouse and providing a complementary approach. The key challenges are outlined here and will be discussed with each architecture option, Data Loading » With no definitive format or metadata or Data Availability > Data availability has been a challenge for any system that relates to processing and transforming data for use by end users, and big data is no exception. The benefit of Hadoop or NoSQL is to mitigate this riskand make data available for analysis immediately upon acquisition. The challenge is to load the data quickly as there is no pre-transformation required. > Data availability depends on the specificity of metadata to the SerDe or Avro layers. If data BIG DATA ANALYTICS can be adequately cataloged on acquisition, it can be available for analysis and discovery immediately. Since there is no update of data in the big data layers, reprocessing new data containing updates will create duplicate data, and this needs to be handled to minimize the impact on availability. Data volumes > Big data volumes can easily get out of control due to the intrinsic nature of the data. Care id to, the growth Je : lethe administrators with tools and tips to zone the infrastructure to mark the data in its own area, minimizing both risk and performance impact. Data exploration and mining is a very common activity that is a driver for big data acquisition across organizations, and also produces large data sets as the output of processing. These data sets need to be maintained in the big data system by periodically sweeping and deleting intermediate data sets. This is an area that normally is ignored by organizations and can be a performance drain over a period of time. Rahul Publications MSc Il Year Storage performance » Disk performance is an important consideration when building big data systems, and the appliance model can provide a better focus on the storage class and tiering architecture, This will provide the starting kit for longer-term planning and growth management of the storage infrastructure. » If a combination of in-memory, SSD and traditional storage architecture is planned for big data processing, the persistence and exchange of data across the different layers can be consuming both processing time and cycles. Care needs to be extended in this area, and the appliance architecture provides a reference for such complex storage requirements. Operational costs Calculating the operational cost for a data warehouse and its big data platform is a complex task that includes initial acquisition costs for infrastructure, plus labor costs for implementing the architecture, plus infrastructure and labor costs for ongoing maintenance, including external help commissioned from consultants and experts. 2.2.4 Big Data Analysis and Data Warehouse Q13. Discuss about big data. and data warehouse. Aus: Big data is analysed to know the present behaviour or trends and, make future predictions. Various big data solutions are used for extracting useful information from big data. Big data canbe useful : » Enables the storage of very large amounts of heterogeneous data > Holds data in low cost storage devices » Keeps data in a raw or un structured format. You will find value in bringing the capabilities of the data warehouse and the big data environment together. You need to create a hybrid environment where big data can work hand in hand with the data warehouse. _ MI Semas, " The data warehouse for what it has be, designed to do — provide a well-vetted version ® the trth about a topic thatthe business wan analyze. The warehouse might include informa, about a particular company’s product line, customers, its supplies, andthe deals of a year worth of transactions. The information managed in the dat, warehouse or a departmental data mart has been carefully constructed so that metadata is accurate With the growth of new web-based information, is practical and often necessary to analyze this massive amount of data in context with historicg data. This is where the hybrid model comes in. Certain aspects of marrying the data warehouse with big data can be relatively easy. For example, many of the big data sources come from sources that include their own well-designed metadata, Complex e-commerce sites include well-defined data elements. Therefore, when conducting analysis between the warehouse and the big data source, the information management organization is working with two data sets with carefully designed metadata models that have to be rationalized. Of course, in some situations, the information sources lack explicit metadata. Before an analyst can combine the historical transactional data with the less structured big data, work has to be done. ‘Typically, initial analysis of petabytes of data will reveal interesting patterns that can help predict subtle changes in business or potential solutions to a patient’s diagnosis. The initial analysis can be completed leveraging tools like MapReduce with the Hadoop distributed file system framework. At this point, you can begin to understand whether it is able to help evaluate the problem being addressed. In the process of analysis, it is just as important to eliminate unnecessary data as it is to identify data relevant to the business context. When this phase’s complete, the remaining data needs to be transformed so that metadata definitions are precise. In this way, when the big data is combined with traditional, historical data from the warehouse, the results will be accurate and meaningful.s Rahul Publications 150) — unit ll ‘The Big Data Integration Lynchpin This process requires a well-defi i -defin integration strategy. While data eee critical element of managing big data, it is e ie important when creating a hybrid analysis itt te data warehouse. In fact, the process of extractir e data and transforming it in a hybrid ers very similar to how this process is executed withi traditional data warehouse. ae In the data warehouse, data is extracted from traditional source systems such as CRM etree oO systems. Itis critical that elements from these cee th iche » sources cam: neaningful I provide the business with ‘on the need to analyze a requires monitoring, sciheond typical data warehouse will a snapshot of data based particular business issue that such as inventory or sales. The distributed structure of big data will often data into a series of lead organizations to first load nodes and then perform the extraction and transformation. When creating 2 hybrid of the traditional data warehouse and the big data environment, the distributed nature of the big data environment can dramatically change the capability of organizations to analyze huge volumes of data in context with the business. BIG DATA ANALYTICS 2.2.5 Changing Deployment Models in Big Data Era Q14. What are the various deployment models in big data era? Discuss. Aus : With the advent of big data, the deployment models for managing data are changing. The traditional data warehouse is typically implemented ona single, large system within the data center. The costs of this model have led organizations to optimize these warehouses and limit the scope and size of lytical'eng! to simplify the process of analyzing data from multiple sources. The appliance is therefore asingle- purpose system that typically includes interfaces to make it easier to connect to an existing data warehouse. * THE BIG DATA CLOUD MODEL The cloud is becoming a compelling platform to manage big data and can be used’in a hybrid environment with ori-premises environments. Some Gf the new innovations in loading and transferring data are already changing the potential viability of the cloud as a big data warehousing platform. » ‘Rahul Publ For example, Aspera, a company that specializes in fast data transferring between networks, is partnering with Amazon.com to offer cloud data management services. Other vendors such as FileCatalyst and Data Expedition are also focused on this market. In essence, this technology category leverages the network and optimizes it for the purpose-of moving files with reduced latency, As this problem of latency in data transfer continues to evolve, it will be the norm to store big data systems in the cloud that can interact with a data warehouse that is also cloud based or a warehouse that sits in the data center. Processine Your Dara Wit MapReDuce 2.3.1 Developing Simple Mapreduce Application QI5. Expalin Word Count - Hadoop Map Reduce Example - How it works ? Aus: Hadoop WordCount operation occurs in 3 stages ~ i. Mapper Phase ii, Shuffle Phase iii, Reducer Phase Hadoop WordCount Example- Mapper Phase Execution The text from the input text file’ is'tokenized into words to form a key value pair with all the words present in the input text file. The key is the word from the input file and value is ‘1’. For instance if you consider the sentence “An elephant is an animal”. The mapper phase in the WordCount example will split the string into individual tokens i.e. words. In this case, the entire sentence will be split into 5 tokens with a value 1 as shown below - Key-Value pairs from Hadoop Map Phase Execution- (an,1) (elephant,1) (is,1) (an,1) (animal,1) MSc Il Year I Semey, ‘ ~ ey Hadoop WordCount Example- Shute py, Execution m After the map phase execution is comp] cuccesil, shufle phases executed automaye wherein the key-value pairs generated in the. phase are taken as input and then sortes™ alphabetical order. After the shutle pha” executed from the WordCount example code, output will look like this - (an,1) (an,1) (animal,1) (elephant, 1) (is,1) Hadoop WordCount Example- Reducer Phase Execution In the reduce phase, all the keys are groupes together and the values for similar keys are addej up to find the occurrences for a particular word, | is like an aggregation phase for the keys generated by the map phase, The reducer phase takes the ‘output of shuffle phase as input and then reduces the key-value pairs to unique keys with values added up. In our example “An elephant is an animal.” is the only word that appears twice in the sentence After the execution of the reduce phase of MapReduce WordCount example progran, appears as a key only once but with a count of 2 as shown below - (an,2) (animal,1) (elephant, 1) (is,1) This is how the MapReduce word count program executes and outputs the number of occurrences of a word in any given input file. An important point to note during the execution of the WordCount example is that the mapper class in the WordCount program will execute completely on the entire input file and not just a single sentence. Suppose if the input file has 15 lines then the mapper class will split the words of all the 15 lines and form initial key value pairs for the entire dataset The reducer execution will begin only after the mapper phase is executed successfully. Rahul Publications (@)— — unit =I BIG DATA ANALYTICS Running the WordCount Example in Hadoop MapReduce using Java Project with Eclipse step 1 Let’s create the java project with the name “Sample WordCount” as shown below - File > New > Project > Java Project > Next, “Sample WordCount” as our project name and click “Finish”: Edit Source Refactor Navigate Search Project Run Window ee ww Package Class Interface Enum ‘Annotation Source Folder Java Working Set Folder Create a Java Project Create a Java project in the workspace or in an extemal & Use default location JRE : @ Use an execution enyironment JRE: © Use a project specific JRE: © Use defgult JRE (currently ‘jdk1.6.0_32') Configure IRES... Project layout. . (=) Rahul Publications a. MSc Il Year H Semey. Step 2 : The next step is to get references to hadoop libraries by clicking on Add JARS as follows 5 Openinten Wen open petteahy _ ee cari on soieennewe > eee ose coy on RE Sten ray CFL Ones Nene (esi and cusy ice ouee utero > Same stitetes > tsa sence > bout etn 5 ise jet Close ured ject Assi Woteg Se. ess > etn as > Yolate en > emo wen > “Rest em ts ety. Seoieweccre Comput > io Binne i IER seca nn maa Gp oc ws cs. Pancreat fepctione sas = sous » ian iv nace ae aces oo enh 2161 ant ahemeroen — en917 poorer actions, = SARUM asaapee | ae 1 5 ssp arth gray poe 1-5 zarast0 1 catanamepeee oe a-ha = tc 02) ssa Coto paced toca ange — 2 sontone-szrteatnce » Smashes 2 tere tet pat dep 1 = take penaniee 1 m.nesyseny nse) Rahul Publications eaReferenced ut S00¥ cue ierface > citraining Copy Qualified Name om paste carey Arnea00 Datete alte ava Working Se Java Package Create a new java package estes elders amespontng to pacages, source ge Sanpe Werte Name: omnes 1D Greate packagesinin java Novilet'stimplem project com.code.dezyre. 't Package Explorer a ‘Open in ew Window oe eae _— oon auio's lee ES 2 Sead = ‘Copy Qual on. ‘Annotation a oe oes : = Rahul Publications MSc Il Year in Sem, et, ea aes G sauce tegen (Sane Worcoutin Pactage: comeade deve browse. Ends pe 3 | sen: ees : | Masten: pie 0 eet | esc td | sues: finalrbic | bows | mertaces: ‘we meth sts wad you eta crete? \ peste ld mais 05) | (Govan tom specs | Step 5 Create a Mapper class within the WordCount class which extends MapReduceBase Class to impleme mapper interface. The mapper class will contain - 1, Code to implement “map” method. 2. Code for implementing the mapper-stage business logic should be written within this method, Mapper Class Code for WordCount Example in Hadoop MapReduce publicstaticclassMapextendsMapReduceBaseimplementsMapper{ privatefinalstatic IntWritable one = new IntWritable(1); private Text word = new Text); publicvoidmap(LongWritable key, Text value, OutputCollector output, Repers reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = newStringTokenizer(line); while (tokenizer hasMoreTokens()) { word set(tokenizer:nextToken()); coutput.collect(word, one); } } } In the mapper class code, we have used the String Tokenizer class which takes the entire line # breaks into small tokens (string/word). Step 6 Create a Reducer class within the WofdCount class extending MapReduceBase Class to implet# reducer interface. The reducer class for the wordcount example in hadoop will contain the - 1. Code to implement “reduce” method 2. Code for implementing the reducer-stage business logic should be written within this method - Rahul Publications —Cs} ed — a eee unit BIG DATA ANALYTICS | Reducer Class Code for WordCount Example in Hadoop MapRed aticinaReduceeend Mr RodueBaiconetfahe a ublicvoidreduce(Text key, Iterator values, OutputCollector output, a a teporter)throws IOException { while (values.hasNext()) { sum += values.next().get(); } , ones new IntWritable(sum)); conf.setMapperClass(Map.class); IIeonf setCombinerClass(Reduce.class); cont setRecucerClass(Reduceclas} conf setinputFormat( TextinputFormat class); conf setOutputFormat(TextOutputFormat.class} FilelnputFormat:setinputPaths(cont, new Path(args{0])); FileOutputFormat.setOutputPath(cont, new Path(args{1))); JobClient.rundob(conf); M.Sc Il Year Step § Create the JAR file for the wordcount class - cae + po grr passes ces Rahul Publications Sere ly, BIG DATA ANALYTICS JAR File Specification Dehne which resources shouibe exported ito the JAR Select the resources to gapor: > einise % asepat BESSY project ; ExpOr generated clas files and resources Exportall eutput folders for checked projects Export java source files and resources Export refacorings for checked projects, Select the export destination . 5 JAR te: (momeiciouderaibesk tone Options: ‘% Compress the contents of the JAR fle & Age directory enties Qvervnte existing files without warning Bi cai cance, is How to execute the Hadoop MapReduce WordCount program ? >>hadoop jar (jar file name) (className_along_with_packageName) (input file) (output folderpath) ~——hadoop jar dezyre_wordcount.jar com.code.dezyre. WordCount Juser/cloudera/Input/war_and_peace /user/cloudera/Output eee it View Search Jemunal Help count jar com udera@localhost Desktop] hadoop jar dezyre wordcot fatoane Juser/Cloudera/Tnput/uar and peace /user/cloudera/Output ll code dezyre. {#}— Rahul Publications M.Sc Il Year Mis, M. . ee, Output of Executing Hadoop WordCount Example - HOFS/userktoudera/Outputpar.. | & | 7 cates lcaidomain on Abie ear Tete gine Prost vistee™ \lewera loudera Manager hue £°MOFSNamelode ‘ HadooplobMacket | )MBase Mag 2.3.2 Points to Consider While Designing Mapreduce Q16. What are the points need to be considered while designing the mapreduce ? Aus : You need to consider the following guidelines while designing the mapreduce applications : When dividing information.among the map tasks , make sure that you do not create too man; mappers. » — Having the right number of mapper has the following benefits : Excessive mappers might create scheduling and infrastructure overhead. In some cases it a: even terminate the job tracker. 7 Having fewer mappers may lead to under utilization of some parts of the storage and excessie burden on the other parts of the storage. «Avast number of small mappers perform a large number of seek operations, Another important factor is the number of reducers configured for the application. Too many « too few reducers. may have negative-impact on productivity. = Then number of reducers designed for an application is an essential variable for the applicator = Besides planning framework overhead, having too may reducers generates a correspondi number of outputs , which has the negative effect on name node. = Having few reducers has the same negative impact of having few mappers. Rahul Publications uNit-tl BIG DATA ANALYTICS Use job counters properly as described below: + Computers are suitable for keeping track of global data of bits of information. They are not meant to calculate aggregate statistics, + Counters are costly because the job tracker must keep track of all counters of each mapiteduce task. > You must compress the output of an application with appropriate compressor. » Use a suitable file format for storing the output of mapreduce jobs. » You must use the large output block size for individual input/output documents. 4 Customizine Maprepuct EXECUTION inputFonmat Inputspiit RecordReader InputKey InputValue RecordReador Partioner Parttioner Parttioner Fig: Functionality of input format {a1} : Rahul Publications "Yeo ee MN Sem, s - “ery, » We have 2 methods to get the data to. mapper in MapReduce: getsplits() and createRecordReny ac shown below a) vicabstraciclassinputFormat abstractList getSplits(JobContext context) \s IOCNception, InterruptedException; ‘iRecordReader TecgrdReader(InputSplit split, “ontext context) throws IOException, ‘Exception; “nputSplit in Hadoop MapReduce is the logical representation of data. It describes a Unt g ‘hat contains a single map task in a MapReduce program. »p InputSplit represents the data which is processed by an individual Mapper. The split is Pd into records. Hence, the mapper process each record (which is a key-value pair). “'apPeduce InputSplitength is measured in bytes and every InputSplit has storage locations (hostname ). MapReduce system use storage locations to place map tasks as close to split’s data as Possible asks are processed in the order of the size of the splits so that the largest one gets processed fry eedy approximation algorithm) and this is done to minimize the job runtime (Leam MapReduce jo, opti n techniques)The important thing to notice is that Inputsplit does not contain the input data; i reference to the data. y ““s @ user, we don’t need to deal with InputSplit directly, because they are created by 2» InputFormat (InputFormat creates the Inputsplit and divide into records). FileInputFormat, by defaui, sveaks a file into 128MB chunks (same as blocks in HDFS) and by setting mapred.min. split size paramete: pred-site.xml_we can control this value or by overriding the parameter in the Job object used ty mint particular MapReduce job, We can also contol how the fil is broken up into splits, by urtg @ custom InputFormat. InputSpit in Hadoop is user defined. User can control split size according to the size of data is Reduce program. Thus the number of map tasks is equal to the number of InputSplits, The client (running the job) can calculate the splits fora job by calling ‘getSplit(), and then sent io ‘2 application master, which uses their storage locations to schedule map tasks that will process them on the cluster. Then, map task passes the split to the creatéRecordReader() method on InputFormat to ga RecordReader for the split and RecordReader generate record (key-value pait), which it passes tothe map function. ; RecordReader MapReduce has a simple model of data processing. Inputs and Outputs for the map and reduce ‘unctions are key-value pairs. The map and reduce functions in Hadoop MapReduce have the following gen2ra. form ; » map: (K1, V1) — list(K2, V2) + reduce: (K2, list(V2)) > list(K3, V3) * Now before processing, it needs to know on which data to Process, this is achieved with the InputFormat class. InputFormat is the class which selects the file fromHDFS that should be input to the map function. An InputFormat is also responsible for creating (alval Publications | BIG DATA ANALYTICS Del 3a ving them into records, The data is divded into the numberof eae ne in . This i i it whi i i ty a single map. sis called as inputsplit which is the input that is processed InputFormat class calls the getSplits() function and computes splits for each file and then sends them to the JobTracker, which uses their storage Tocations to schedule map tasks to process them on the TaskTrackers. Map task then passes the split to the create RecordReader() method on InputFormat in task tracker to obtain a RecordReader for that split, The RecordReader load’s data from its source and converts into key-value pairs suitable for reading by the mapper, Hadoop RecordReader uses the data within the boundaries that are being created by the inpuspt and creates Key-value pairs for the mapper. The “start” is the byte position in the file where the RecordReader should start generating key/value pairs and the “end” i 2.4.2 Reading Data with Custom Recordreader Q18, Explain how to read data with MapReduce record reader. A RecordReader is more than iterator over records, and map task uses one record to generate key- value pair which is passed to the map function. We can see this by using mapper's run function : Publicvoid run(Context context) throws IOException, InterruptedException{ ‘Spre>setup(context); . while(context.nextKeyValue()) (#) Rahul Publications MSc Il Year Sem, eee — eth, i map(context. setCurrentKey(),context.getCurrentValue(),context) } cleanup(context); } After running setup(), the nextKeyValue() will repeat on the context, to populate the key value objects for the mapper. The key and value is retrieved from the record reader by way of ¢ and passed to the map() method to do its work. An input to the map function, which is a keyyy pair(K, V), gets processed as per the logic mentioned in the map code. When the record gets to the, of the record, the nextKeyValue() method returns false. A RecordReader usually stays in between the boundaries created by the inputsplit to generate j, value pairs but this is not mandatory. A custom implementation can even read more data outside of 4 inputsplit, but it is not encouraged a lot. The RecordReader instance is defined by the InpitFormat. By default, it uses TextInputFormat , converting data into a key-value pair. TextInputFormat provides 2 types of RecordReaders: Data set example TTake the below example of a (dummy) data set where all your records are separated by a sp String. pleff lorem monaq morel plaff lerom baple merol plff ipsum ponaq mipsu ploff pimsu caple supim pluff sumip qonaq issum daple ussum ronaq ossom fap25 abcde tonag fghij merol pliff ipsum ponaq mipsu ploff pimsu caple supim pluff sumip qonaq issum daple ussum ronaq ossom faple abc75 tonaq fghij gaple kimno vonag parst haple uvwxy nonaq qonag issum daple ussum ronaq ossom fal25 abcde lerom baple merol pif ipsum ponaq mipsu ploff pimsu caple supim pluff sumih Qonaq Implementation In that case (records are always separated by a same “10-dash” String), the implementation: somehow out of the box. Indeed, default LineReader can take as an argument a record Delini# Bytes byte array that can be retrieved / set directly from the Hadoop configuration. This parameter wil used as a String delimiter to separate distinct records. Just make sure to set it up in your MapReduce driver code 1. Configuration conf = newConfiguration(true); 2. conf.set( “textinputformat.record.delimiter”,”_—”), ..and to specify the default TextInputFormat for your MapReduce job’s InputFormat. unit - tt BIG DATA ANALYTICS 1, Job job = newJob(conf); 2, jobsetInputFormat(TextInputFormat.class); Instead of processing 1 given line at a time, you should be able to process a full NLines record. Will be supplied to your mappers instances the following keys / values : > Key is the offset (location of your record’s first line) » Value is the record itself Note that the default delimiter is CRLF (additionally CR) character. Using the Hadoop default configuration, LineReader can be seen as a Record-by-Record reader that uses a CRLF delimiter, thus >» MapReduce job checks that the output directory does not already exist.” > OutputFormat provides the RecordWriter implementation to be used to write the output files of the job. Output files are stored in a FileSystem. FileOutputFormat.setOutputPath() method is used to set the output directory. Every Reducer writes separate file in a common output directory. Types of OutputFormat in MapReduce There are various types of Hadoop OutputFormat. Let us see some of them below : (=) Rahul Publications Se BS M.Sc II Year 5. M Sem, — —_——*2Mey, ~y Typesof OutputFoat in Mapheduce . TextOutputFormat MapReduce default OutputFormat is TextOutputFormat, which writes (key, value) pairs on individyy lines of text files and its keys and values can be of any type since TextOutputFormat tums them string by calling toString() on them. Each key-value pair is separated by a tab character, which can changed using MapReduce.output.textoutputformat.separator property. Key Value Text Outpy Format is used for reading these output text files since it breaks lines into key-value pairs based on configurable separator. SequenceFileOutputFormat Itis an OutputFormat which writes sequences files for its output and it is intermediate format us between MapReduce jobs, which rapidly serialize arbitrary data types to the file; and the correspon SequenceFileInputFormat will deserialize the fle into the same types and presents the data tothe ney mapper in the same manner as it was emitted by the previous reducer, since these are compact anj readily compressible. Compression is controlled by the static methods on SequenceFileOutputForma, SequenceFileAsBinaryOutputFormat Itis another form of SequenceFilelnputFormat which writes keys and values to sequence file in binary format. MapFileOutputFormat Itis another form of FileOutputFormat in Hadoop, which is used to write output as map files. The ky in a MapFile must be added in order, so we need to ensure that reducer emits keys in sorted orde: MultipleOutputs It allows writing data to files whose names are derived from the output keys and values, or in fa from an arbitrary string. LazyOutputFormat ‘Sometimes FileOutputFormat will create output files, even if they are empty. LazyOutputFormat is ¢ wrapper OutputFormat which ensures that the output file will be created only when the record s emitted for a given partition. DBOutputFormat | DBOutputFormat in Hadoop is an OutputFormat for writing to relational databases and HBase. sends the reduce output to a SQL table. It accepts key-value pairs, where the key has a type extendind DBuritable. Returned RecordWriter writes only the key to the database with a batch SQL query. 66 Rahul Publications 8} | ih 7 oe ww —— __HG DATA AHALITICS 9.44 Customizing Data with Recordwriter 20 Explain about use of Hadoop Record Writer, hus Hadoop RecordWriter swe know, Reducer takes as input a set of an intermediate key-value pairproduced by the mapper and runs a reducer function on them to generate output that is again zero or more key- value pairs. RecordWriter writes these output key-value pairs from the Reducer phase to output files. cetRecordlriter) is to provide a RecordWriter, checkOutputSpecs() is to check for validity of the output- ecification for the job, and getOutputCommitter() is to return an OutputCommitter ‘As I described in ISTUSE jevidusTpOsts; ee fin ination, 4 ihe RecordWriter unites key/value pairs to an output i bie abstract class Reooraiter { Ser * ee a key/value pair. es * i | + @param key the key to write. . | @param value the e value to Ly public abstract void tet ken V value 2 throws Reet, Int uledEscepen z | i : 5 * Close this Recorder to filuge 0 operations. | | ‘ i * @param context the context i I Agood example of the RecordWriter and the: OutputFormat is ilustated in the class “TextOutputForma, where the LineRecordWriter is defined as follows. public class TextOutputFormat extends FileOutputFormat { public static String SEPERATOR = “mapreduce.output.textoutputformat.separator”; Protected static class LineRecordWriter extends RecordWriter { Private static final String utf8 = “UTF-8”; Private static final byte[] newline; static { ty { - /luse “\n” as the newline delimiter Rahul Publications MSc Il Year M Se, . ey newline = “\n" getBytes(utf8); } catch (UnsupportedEncodingException uee) { throw new IilegalArgumentException(‘‘can't find “ + utf8 + “ encoding”); } } protected DataOutputStream out; private final byte{] keyValueSeparator, Public LineRecordWriter(DataOutputStream out, String keyValueSeparator) { this.out = out; try { // set key value separator this keyValueSeparator = keyValueSeparator getBytes(utf8); } catch (UnsupportedEncodingException uee) { throw new IllegalArgumentException(“‘can't find “ + utf8 + “ encoding’); y } // if not key value separator is defined, use tab as the separator public LineRecordWriter(DataOutputStream out) { this(out, “\t”); y . ne : * White the object to the byte stream, handling Text as a special * case. * @param o the object to print * @throws IOException if the write throws, we pass it on ¥ : private void writeObject(Object 0) throws IOException { if (0 instanceof Text) { Text to = (Text) 0; out.write(to.getBytes(), 0, to.getLength()); } else { out. write(.toString(). getBytes(utf8)); } } public synchronized void write(K key, V value) throws IOException { boolean nullKey = key == null || key instanceof NullWritable; boolean nullValue = value == null || value instanceof NullWritable; // return if null key or null value if (nullKey && nullValue) { return; 7 i Rahul Publications (3) a — ree NIT I BIG DATA ANALYTICS _ sprite key first if inullKev) { uriteOject(key); } |) write separator if ((nullKey | nullValue)) { pal write(keyValueSeparator); } - Q21. Explain about MapReduce Combiner. Aas : Ona large dataset when we run MapReduce job, so large chunls of intermediate data is generated by the Mapper and this intermediate data is passed on the Reducer for further processing, which leads to enormous network congestion. MapReduce framework provides a function known as Combiner that plays a key role in reducing network congestion. The combiner in MapReduce is also known as ‘Mini-teducer’. The primary job of Combiner is to Process the output data from the Mapper, before passing it to Reducer. It runs after the mapper and before the Reducer and its use is optional. (#2) Rahul Publications MSc Il Yeor nh ly Mapper Output Total 9 Keys Niece CD) Ch) ca Scie! uae Dal CA) ea Co) 05) Intermediate data Total 3 keys one Fig.: MapReduce program without Combiner In the above diagram, no combiner is used. Input is split into two mappers and 9 keys are generates from the mappers. Now we have (9 key/value) intermediate data, the further mapper will send di this data to reducer and while sending data to the reducer, it consumes some network bandwidth (bandwidth Means time taken to transfer data between 2 machines). It will take more time to transfer data to Teducer if the size of data is big. Now in between mapper and reducer if we use a combiner, then combiner shuffles intermediate day (9 key/value) before sending it to the reducer and generates 4 key/value pair as an output. MapReduce program with Combiner Tota 9 Keys, Intermediate data y Sean eateed dic. up rr Total dtheys " oe Ep Suman esorny Reducer Output Fig. : MapReduce program with Combiner In between Mapper and Reducer Rahul Publications

You might also like