0% found this document useful (0 votes)
22 views35 pages

Saas

The document discusses cloud security, emphasizing the importance of protecting cloud-based systems, data, and infrastructure through various policies and technologies. It highlights the benefits of cloud security, including centralized management, reduced costs, and enhanced reliability, while also addressing challenges such as lack of control and data privacy. Additionally, it covers SaaS security practices, the significance of compliance with regulations, and the role of organizations like the Open Cloud Consortium and Distributed Management Task Force in establishing standards for cloud computing.

Uploaded by

sravan9935
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views35 pages

Saas

The document discusses cloud security, emphasizing the importance of protecting cloud-based systems, data, and infrastructure through various policies and technologies. It highlights the benefits of cloud security, including centralized management, reduced costs, and enhanced reliability, while also addressing challenges such as lack of control and data privacy. Additionally, it covers SaaS security practices, the significance of compliance with regulations, and the role of organizations like the Open Cloud Consortium and Distributed Management Task Force in establishing standards for cloud computing.

Uploaded by

sravan9935
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

UNIT-5 SAAS

Security in Clouds
 Cloud Security, also known as cloud computing security, consists of a set of policies,
controls, procedures and technologies that work together to protect cloud-based systems,
data, and infrastructure.
 These security measures are configured to protect cloud data, support regulatory
compliance and protect customers' privacy as well as setting authentication rules for
individual users and devices.
 From authenticating access to filtering traffic, cloud security can be configured to the exact
needs of the business. And because these rules can be configured and managed in one place,
administration overheads are reduced and IT teams empowered to focus on other areas of
the business.
 The way cloud security is delivered will depend on the individual cloud provider or the
cloud security solutions in place. However, implementation of cloud security processes
should be a joint responsibility between the business owner and solution provider.
 For businesses making the transition to the cloud, robust cloud security is imperative.
Security threats are constantly evolving and becoming more sophisticated, and cloud
computing is no less at risk than an on-premise environment. For this reason, it is essential
to work with a cloud provider that offers best-in-class security that has been customized for
your infrastructure.
Benefits of Cloud Security
1. Centralized security: Just as cloud computing centralizes applications and data,
cloud security centralizes protection. Cloud-based business networks consist of numerous
devices and endpoints that can be difficult to manage when dealing with shadow IT or
BYOD. Managing these entities centrally enhances traffic analysis and web filtering,
streamlines the monitoring of network events and results in fewer software and policy
updates. Disaster recovery plans can also be implemented and actioned easily when they
are managed in one place.

2. Reduced costs: One of the benefits of utilizing cloud storage and security is that it
eliminates the need to invest in dedicated hardware. Not only does this reduce capital
expenditure, but it also reduces administrative overheads. Where once IT teams were
firefighting security issues reactively, cloud security delivers proactive security features
that offer protection 24/7 with little or no human intervention.
3. Reduced Administration: When you choose a reputable cloud services provider or
cloud security platform, you can kiss goodbye to manual security configurations and almost
constant security updates. These tasks can have a massive drain on resources, but when you
move them to the cloud, all security administration happens in one place and is fully
managed on your behalf.
4. Reliability: Cloud computing services offer the ultimate in dependability. With the
right cloud security measures in place, users can safely access data and applications within
the cloud no matter where they are or what device they are using.

SaaS security refers to the practices and policies implemented by the providers of software-as-a-
service (SaaS). These security policies make SaaS apps safe and trustworthy. Let us dive into how
SaaS security as a service can help make your software more secure.
What is SaaS Security?

SaaS (Software as a Service) security refers to the measures and processes implemented to protect
the data and applications hosted by a SaaS provider. This typically includes measures such as
encryption, authentication, access controls, network security, and data backup and recovery.

Why is SaaS Security important?

SaaS (Software as a Service) has become increasingly popular in recent years due to its flexibility,
cost-effectiveness, and scalability. However, this popularity also means that SaaS providers and
their customers face significant security challenges.

SaaS Security is important because:

 Sensitive data would be well-protected and not compromised by hackers, malicious insiders
or other cyber threats.
 SaaS security helps avoid severe consequences such as legal liabilities, damage to reputation
and loss of customers.
 Aids in increasing the trust of the SaaS provider to the customers.
 Aids in compliance with security standards and regulations.
 Ensures the security and protection of applications and data hosted from cyber threats,
minimizing the chances of data breaches and other security incidents.

Challenges in SaaS security

Some of the most significant challenges in SaaS security include:

1. Lack of Control

SaaS providers typically host applications and data in the cloud, meaning that customers have less
direct control over their security. This can make it challenging for customers to monitor and manage
security effectively.
2. Access Management

SaaS applications typically require users to log in and authenticate their identity. However,
managing user access can be challenging, particularly if the provider is hosting applications for
multiple customers with different access requirements.

3. Data Privacy

SaaS providers may be subject to data privacy regulations, which can vary by jurisdiction. This can
make it challenging to ensure compliance with all relevant laws and regulations, particularly if the
provider hosts data for customers in multiple countries.

4. Third-party integration

SaaS providers may integrate with third-party applications, such as payment processors or
marketing platforms. However, this can increase the risk of security incidents, as vulnerabilities in
third-party software can potentially affect the entire system.

5. Continuous monitoring

SaaS providers must continuously monitor their systems for security threats and vulnerabilities.
This requires a high level of expertise and resources to detect and respond to security incidents
effectively.

SAAS SECURITY:-SaaS security is the managing, monitoring, and safeguarding of sensitive


data from cyber-attacks. With the increase in efficiency and scalability of cloud-based IT
infrastructures, organizations are also more vulnerable.

SaaS maintenance measures such as SaaS security posture management ensure privacy and safety
of user data. From customer payment information to inter-departmental exchange of information,
strengthening the security of SaaS applications is vital to your success.

To help this cause, regulatory bodies worldwide have issued security guidelines such as GDPR
(General Data Protection Regulation of EU), EU-US and the Swiss-US Privacy Shield
Frameworks.

Every SaaS business must adopt these guidelines to offer safe and secure services. Whether you
are starting anew or adding an aspect to your IT arsenal, SaaS security is essential for successful
ventures.
Who needs SaaS Security?

Do you cater to a sizeable market?

Do you deal with hundreds of concurrent sessions?

Are these sessions run by thousands of users every day?

If your answer to the above questions is yes, SaaS security is a must for you. Moreover, if you
relate to the following statements, you need to have s SaaS Security system in place on the double!

 I wish to eliminate the legacy IT infrastructure. It gets outdated faster than we can adapt to it.
However, I am worried about data privacy.
 I am sure that SaaS and cloud-based technologies are the future, but how does one ensure that
there are no data breaches?
 It is high time that we employ cloud-based products and services. The competition is killing us
in the market. But how will we secure user data without physical servers?

Whether you’re an established business or an upcoming start-up, safeguarding user data proves
to be very helpful in attracting, engaging, and retaining customers. Hyper-competitive markets of
today leave no space for error. A single data breach can be the cause of your SaaS business being
blacklisted in the minds of consumers forever.

The Anatomy of SaaS Security

Every organization offering a cloud-based service can leverage preventive measures such as SaaS
security posture management to continuously monitor and protect sensitive information.

Let us understand the anatomy of SaaS security in cloud computing environments. If we look at
an ideal SaaS product technology stack from a bird’s eye view, it forms a three-layer cake where
each part represents different environments.

Three layers of SaaS security:

 Infrastructure (server-side)
 Network (the internet)
 Application and Software (client-side)
Infrastructure

The server-side of your technology stack refers to the internal exchange of information. For
instance, if your SaaS business is using AWS, you must secure every point of information
exchange between the cloud storage provider and your software platform.

Every IoP initiated from the client-side starts at this level. Moreover, depending upon the kind of
storage you purchase (shared, dedicated, or individual server), you must enhance your SaaS
security measures.

Network

The exchange of information between the server-side and client-side is done over the internet.
This is by far the most vulnerable layer of every SaaS business. Hackers are well versed in finding
back-doors through weak encryptions of data packets exchanged over the internet.

The effectiveness of SaaS security is directly proportional to the integrity of data encryption
methods and the ability for real-time monitoring of information exchange over the internet. With
the advent of digital payments and online KYCs, businesses are constantly sending and receiving
sensitive information. Hence it becomes even more important to install network security
measures.
Common Cloud Security Standard

Cloud Security encompasses the technologies, controls, processes, and policies which
combine to protect your cloud-based systems, data, and infrastructure. It is a sub-domain
of computer security and more broadly, information security.

The most well-known standard in information security and compliance is ISO 27001, developed
by the International Organization for Standardization.

The ISO 27001 standard was created to assist enterprises in protecting sensitive data by best
practices.

Cloud compliance is the principle that cloud-delivered systems must be compliant


with the standards their customers require. Cloud compliance ensures that cloud
computing services meet compliance requirements.

The OCC Mission


The Open Commons Consortium (aka OCC - formerly the Open Cloud Consortium) is
a 501(c)(3) non-profit venture which provides cloud computing and data commons resources to
support "scientific, environmental, medical and health care research." OCC manages and operates
resources including the Open Science Data Cloud (aka OSDC), which is a multi-
petabyte scientific data sharing resource.[1] The consortium is based in Chicago, Illinois, and is
managed by the 501(c)3 Center for Computational Science Research.
● The purpose of the Open Cloud Consortium is to support the development of standards for
cloud computing and to develop a framework for interoperability among various clouds.
● The OCC supports the development of benchmarks for cloud computing.
● Manages cloud computing testbeds, such as the Open Cloud Testbed, to improve cloud
computing software and services.
● Develops reference implementations, benchmarks and standards, such as the MalStone
Benchmark, to improve the state of the art of cloud computing.
● Sponsors workshops and other events related to cloud computing to educate the community
The Open Cloud Consortium (OCC) is-
 A not for profit
 Manages and operates cloud computing infrastructure to support scientific, medical, health
care and environmental research.
 OCC members span the globe and include over 10 universities, over 15 companies, and
over 5 government agencies and national laboratories.
 The OCC is organized into several different working groups

The Distributed Management Task Force (DMTF)

● DMTF enables more effective management of millions of IT systems worldwide by


bringing the IT industry together to collaborate on the development, validation and
promotion of systems management standards.
● The group spans the industry with 160 member companies and organizations, and more
than 4,000 active participants crossing 43 countries.
● The DMTF board of directors is led by 16 innovative, industry-leading technology
companies. The Distributed Management Task Force (DMTF) DMTF management
standards are critical to enabling management interoperability among multi vendor
systems, tools and solutions within the enterprise.
The DMTF started the Virtualization Management Initiative (VMAN).
The Open Virtualization Format (OVF) is a fairly new standard that has emerged within
the VMAN Initiative.
Benefits of VMAN are
* Lowering the IT learning curve, and
* Lowering complexity for vendors implementing their solutions

 DMTF is a 501(c)(6) non-profit industry standards organization that creates open


manageability standards spanning diverse emerging and traditional IT infrastructures
including cloud, virtualization, network, servers and storage.
 Member companies and alliance partners collaborate on standards to improve interoperable
management ofinformation technologies.
 Based in Portland, Oregon, the DMTF is led by a board of directors representing
technology companies including:Broadcom Inc., Cisco, Dell Technologies, Hewlett
Packard Enterprise, Intel Corporation, Lenovo, NetApp, PositiveTecnologia S.A., and
Verizon.
 Founded in 1992 as the Desktop Management Task Force, the organization’s first standard
was the now-legacy Desktop Management Interface (DMI). As the organization evolved
to address distributed management through additional standards, such as the Common
Information Model (CIM), it changed its name to the Distributed Management Task Force
in 1999 , but is now known as, DMTF.
 The DMTF continues to address converged, hybrid IT and the Software Defined Data
Center (SDDC)with its latest specifications, such as the CADF (Cloud Auditing Data
Federation), CIMI (Cloud Infrastructure Management Interface), CIM (Common
Information Model), DASH (Desktop and Mobile Architecture for System Hardware),
MCTP (Management Component Transport Protocol), NC-SI (Network Controller
Sideband Interface), OVF (Open Virtualization Format), PLDM (Platform Level Data
Model), Redfish Device Enablement (RDE), Redfish (Including Protocols, Schema, Host
Interface, Profiles) SMASH (Systems Management Architecture for Server Hardware) and
SMBIOS (System Management BIOS).The Distributed management Task Force (DMTF)
 DMTF enables more effective management of millions of IT systems worldwide by
bringing the IT industry together to collaborate on the development, validation and
promotion of systems management standards.
 The group spans the industry with 160 member companies and organizations, and more
than 4,000 active participants’ crossing43 countries.
 The DMTF board of directors is led by 16 innovative, industry-leading technology
companies.
 DMTF management standards are critical to enabling management interoperability among
multi-vendor systems, tools and solutions within the enterprise.
 The DMTF started the Virtualization Management Initiative (VMAN).
 The Open Virtualization Format (OVF) is a fairly new standard that has emerged within
the VMAN Initiative.
 Benefits of VMAN are lowering the IT learning curve, and Lowering complexity for
vendors implementing their solutions
UNIT-5 HADOOP
What is Hadoop
Hadoop is an open source framework from Apache and is used to store process and analyse
data which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing. It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the
cluster.

Modules/Ecosystem of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that
HDFS was developed. It states that the files will be broken into blocks and stored in nodes over
the distributed architecture. A distributed file system for storing application data on
commodity hardware. It provides high-throughput access to data and high fault
tolerance. The HDFS architecture features a NameNode to manage the file system
namespace and file access and multiple DataNodes to manage data storage
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster. It
supports more workloads, such as interactive SQL, advanced modeling and real-time streaming
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation on
data using key value pair. The Map task takes input data and converts it into a data set which can
be computed in Key value pair. The output of Map task is consumed by reduce task and then the
out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop
modules. The common utilities and libraries that support the other Hadoop modules.
Also known as Hadoop Core
5. Hadoop Ozone: A scalable, redundant and distributed object store designed for big data
applications
Fig.1 Hadoop modules/ecosystem

Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or
YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes
Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes DataNode
and TaskTracker.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains a
master/slave architecture. This architecture consist of a single NameNode performs the role of
master, and multiple DataNodes performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily run
the NameNode and DataNode software.

NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening, renaming and
closing the files.
o It simplifies the architecture of the system.
DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.

Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using
NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This process can also
be called as a Mapper.

MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce job
to Job Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers.
Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.

Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval.
Even the tools to process the data are often on the same servers, thus reducing the processing
time. It is able to process terabytes of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really
cost effective as compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the network, so
if one node is down or some other network failure happens, then Hadoop takes the other copy
of data and use it. Normally, data are replicated thrice but the replication factor is configurable.
o Compatible: Another big advantage of Hadoop is that apart from being open
source, it is compatible on all the platforms since it is Java based.
Disadvantages:

 Not very effective for small data.


 Hard cluster management.
 Has stability issues.
 Security concerns.
 Complexity: Hadoop can be complex to set up and maintain, especially
for organizations without a dedicated team of experts.
 Latency: Hadoop is not well-suited for low-latency workloads and may
not be the best choice for real-time data processing.
 Limited Support for Real-time Processing: Hadoop’s batch-oriented
nature makes it less suited for real-time streaming or interactive data
processing use cases.
 Limited Support for Structured Data: Hadoop is designed to work with
unstructured and semi-structured data, it is not well-suited for structured
data processing
 Data Security: Hadoop does not provide built-in security features such
as data encryption or user authentication, which can make it difficult to
secure sensitive data.
 Limited Support for Ad-hoc Queries: Hadoop’s MapReduce
programming model is not well-suited for ad-hoc queries, making it
difficult to perform exploratory data analysis.
 Limited Support for Graph and Machine Learning: Hadoop’s core
component HDFS and MapReduce are not well-suited for graph and
machine learning workloads, specialized components like Apache Graph
and Mahout are available but have some limitations.
 Cost: Hadoop can be expensive to set up and maintain, especially for
organizations with large amounts of data.
 Data Loss: In the event of a hardware failure, the data stored in a single
node may be lost permanently.
 Data Governance: Data Governance is a critical aspect of data
management, Hadoop does not provide a built-in feature to manage data
lineage, data quality, data cataloguing, data lineage, and data audit.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google
File System paper, published by Google.

Let's focus on the history of Hadoop in the following steps: -

o In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It is an
open source web crawler software project.

o While working on Apache Nutch, they were dealing with big data. To store that data they have to
spend a lot of costs which becomes the consequence of that project. This problem becomes one
of the important reason for the emergence of Hadoop.
o In 2003, Google introduced a file system known as GFS (Google file system). It is a proprietary
distributed file system developed to provide efficient access to data.
o In 2004, Google released a white paper on Map Reduce. This technique simplifies the data
processing on large clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS (Nutch
Distributed File System). This file system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project, Dough
Cutting introduces a new project Hadoop with a file system known as HDFS (Hadoop Distributed
File System). Hadoop first version 0.1.0 released in this year.
Year Event

2003 Google released the paper, Google File System (GFS).

2004 Google released a white paper on Map Reduce.

2006
o Hadoop introduced.
o Hadoop 0.1.0 released.
o Yahoo deploys 300 machines and within this year reaches 600 machines.

2007
o Yahoo runs 2 clusters of 1000 machines.
o Hadoop includes HBase.

2008
o YARN JIRA opened
o Hadoop becomes the fastest system to sort 1 terabyte of data on a 900 node cluster
within 209 seconds.
o Yahoo clusters loaded with 10 terabytes per day.
o Cloudera was founded as a Hadoop distributor.

2009
o Yahoo runs 17 clusters of 24,000 machines.
o Hadoop becomes capable enough to sort a petabyte.
o MapReduce and HDFS become separate subproject.

2010
o Hadoop added the support for Kerberos.
o Hadoop operates 4,000 nodes with 40 petabytes.
o Apache Hive and Pig released.

2011
o Apache Zookeeper released.
o Yahoo has 42,000 Hadoop nodes and hundreds of petabytes of storage.

2012 Apache Hadoop 1.0 version released.


2013 Apache Hadoop 2.2 version released.

2014 Apache Hadoop 2.6 version released.

2015 Apache Hadoop 2.7 version released.

2017 Apache Hadoop 3.0 version released.

2018 Apache Hadoop 3.1 version released.

o Doug Cutting gave named his project Hadoop after his son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node cluster within
209 seconds.
o In 2013, Hadoop 2.2 was released.
o In 2017, Hadoop 3.0 was released.

Hadoop - MapReduce

MapReduce is a framework using which we can write applications to process huge amounts of
data, in parallel, on large clusters of commodity hardware in a reliable manner.
What is MapReduce?

MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called mappers
and reducers. Decomposing a data processing application into mappers and reducers is sometimes
nontrivial. But, once we write an application in the MapReduce form, scaling the application to
run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a
configuration change. This simple scalability is what has attracted many programmers to use the
MapReduce model.
MapReduce and HDFS are the two major components of Hadoop which makes it so powerful
and efficient to use. MapReduce is a programming model used for efficient processing in
parallel over large data-sets in a distributed manner. The data is first split and then combined to
produce the final result. The libraries for MapReduce is written in so many programming
languages with various different-different optimizations. The purpose of MapReduce in Hadoop
is to Map each of the jobs and then it will reduce it to equivalent tasks for providing less
overhead over the cluster network and to reduce the processing power. The MapReduce task is
mainly divided into two phases Map Phase and Reduce Phase.

MapReduce Architecture:

Components of MapReduce Architecture:

1. Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for processing
to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is comprised
of so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result of
all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
1. Map: As the name suggests its main use is to map the input data in key-value pairs. The
input to the map may be a key-value pair where the key can be the id of some kind of address
and value is the actual value that it keeps. The Map() function will be executed in its memory
repository on each of these input key-value pairs and generates the intermediate key-value
pair which works as input for the Reducer or Reduce() function.

2. Reduce: The intermediate key-value pairs that work as input for Reducer are shuffled and
sort and send to the Reduce() function. Reducer aggregate or group the data based on its key-
value pair as per the reducer algorithm written by the developer.

Steps in Map Reduce


o The map takes data in the form of pairs and returns a list of <key, value> pairs. The keys
will not be unique in this case.
o Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This sort
and shuffle acts on these list of <key, value> pairs and sends out unique keys and a list of
values associated with this unique key <key, list(values)>.
o An output of sort and shuffle sent to the reducer phase. The reducer performs a defined
function on a list of values for unique keys, and Final output <key, value> will be
stored/displayed.
Sort and Shuffle

The sort and shuffle occur on the output of Mapper and before the reducer. When the Mapper task
is complete, the results are sorted by key, partitioned if there are multiple reducers, and then
written to disk. Using the input from each Mapper <k2,v2>, we collect all the values for each
unique key k2. This output from the shuffle phase in the form of <k2, list(v2)> is sent as input to
reducer phase.
Usage of MapReduce
o It can be used in various application like document clustering, distributed sorting, and web
link-graph reversal.
o It can be used for distributed pattern-based searching.
o We can also use MapReduce in machine learning.
o It was used by Google to regenerate Google's index of the World Wide Web.
o It can be used in multiple computing environments such as multi-cluster, multi-core, and
mobile environment.

The Algorithm
 Generally MapReduce paradigm is based on sending the computer to where the data
resides!
 MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce
stage.
o Map stage − The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file system
(HDFS). The input file is passed to the mapper function line by line. The mapper
processes the data and creates several small chunks of data.
o Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
 During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
 The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
 Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
 After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
Inputs and Outputs (Java Perspective)
The MapReduce framework operates on <key, value> pairs, that is, the framework views the input
to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of
the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework and hence, need
to implement the Writable interface. Additionally, the key classes have to implement the Writable-
Comparable interface to facilitate sorting by the framework. Input and Output types of
a MapReduce job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).

Input Output

Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

Terminology
 PayLoad − Applications implement the Map and the Reduce functions, and form the core
of the job.
 Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
 NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
 DataNode − Node where data is presented in advance before any processing takes place.
 MasterNode − Node where JobTracker runs and which accepts job requests from clients.
 SlaveNode − Node where Map and Reduce program runs.
 JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
 Task Tracker − Tracks the task and reports status to JobTracker.
 Job − A program is an execution of a Mapper and Reducer across a dataset.
 Task − An execution of a Mapper or a Reducer on a slice of data.
 Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode.

Advantages of MapReduce
1. Scalability
2. Flexibility
3. Security and authentication
4. Faster processing of data
5. Very simple programming model
6. Availability and resilient nature
Simple tips on how to improve MapReduce performance

1. Enabling uber mode


2. Use native library
3. Increase the block size
4. Monitor time taken by map tasks
5. Identify if data compression is splittable or not
6. Set number of reduced tasks
7. Analyze the partition of data
8. Shuffle phase performance movements
9. Optimize MapReduce code
UNIT-5 Google App Engine
A scalable runtime environment, Google App Engine is mostly used to run Web applications.
These dynamic scales as demand change over time because of Google’s vast computing
infrastructure. Because it offers a secure execution environment in addition to a number of
services, App Engine makes it easier to develop scalable and high-performance Web apps.
Google’s applications will scale up and down in response to shifting demand. Croon tasks,
communications, scalable data stores, work queues, and in-memory caching are some of these
services.
The App Engine SDK facilitates the testing and professionalization of applications by emulating
the production runtime environment and allowing developers to design and test applications on
their own PCs. When an application is finished being produced, developers can quickly migrate
it to App Engine, put in place quotas to control the cost that is generated, and make the
programmer available to everyone. Python, Java, and Go are among the languages that are
currently supported.
The development and hosting platform Google App Engine, which powers anything from web
programming for huge enterprises to mobile apps, uses the same infrastructure as Google’s
large-scale internet services. It is a fully managed PaaS (platform as a service) cloud computing
platform that uses in-built services to run your apps. You can start creating almost immediately
after receiving the software development kit (SDK). You may immediately access the Google
app developer’s manual once you’ve chosen the language you wish to use to build your app.
After creating a Cloud account, you may Start Building your App
 Using the Go template/HTML package
 Python-based webapp2 with Jinja2
 PHP and Cloud SQL
 using Java’s Maven
The app engine runs the programmers on various servers while “sandboxing” them. The app
engine allows the program to use more resources in order to handle increased demands. The app
engine powers programs like Snapchat, Rovio, and Khan Academy.

What is Google App Engine?


 Google App Engine (GAE) is a platform-as-a-service product that provides web app
developers and enterprises with access to Google's scalable hosting and tier 1 internet
service.

 GAE requires that applications be written in Java or Python, store data in


Google Bigtable and use the Google query language. Noncompliant applications
require modification to use GAE.
 GAE provides more infrastructure than other scalable hosting services, such as Amazon
Elastic Compute Cloud (EC2). GAE also eliminates some system administration and
development tasks to make writing scalable applications easier.

 Google provides GAE free up to a certain amount of use for the following resources:

 processor (CPU)
 storage
 application programming interface (API) calls
 concurrent requests

 Users exceeding the per-day or per-minute rates can pay for more of these resources.
How is GAE used?
GAE is a fully managed, serverless platform that is used to host, build and deploy web
applications. Users can create a GAE account, set up a software development kit and write
application source code. They can then use GAE to test and deploy the code in the cloud.

One way to use GAE is building scalable mobile application back ends that adapt to workloads
as needed. Application testing is another way to use GAE. Users can route traffic to different
application versions to A/B test them and see which version performs better under various
workloads.

The App Engine architecture in cloud computing looks like this:


Features of App Engine

Runtimes and Languages

To create an application for an app engine, you can use Go, Java, PHP, or Python. You can
develop and test an app locally using the SDK’s deployment toolkit. Each language’s SDK and
nun time are unique. Your program is run in a:
 Java Run Time Environment version 7
 Python Run Time environment version 2.7
 PHP runtime’s PHP 5.4 environment
 Go runtime 1.2 environment

Generally Usable Features

These are protected by the service-level agreement and depreciation policy of the app engine. The
implementation of such a feature is often stable, and any changes made to it are backward-
compatible. These include communications, process management, computing, data storage,
retrieval, and search, as well as app configuration and management. Features like the HRD
migration tool, Google Cloud SQL, logs, datastore, dedicated Memcached, blob store,
Memcached, and search are included in the categories of data storage, retrieval, and search.

Features in Preview

In a later iteration of the app engine, these functions will undoubtedly be made broadly accessible.
However, because they are in the preview, their implementation may change in ways that are
backward-incompatible. Sockets, MapReduce, and the Google Cloud Storage Client Library are
a few of them.

Experimental Features

These might or might not be made broadly accessible in the next app engine updates. They might
be changed in ways that are irreconcilable with the past. The “trusted tester” features, however,
are only accessible to a limited user base and require registration in order to utilize them. The
experimental features include Prospective Search, Page Speed, OpenID,
Restore/Backup/Datastore Admin, Task Queue Tagging, MapReduce, and Task Queue REST
API. App metrics analytics, datastore admin/backup/restore, task queue tagging, MapReduce, task
queue REST API, OAuth, prospective search, OpenID, and Page Speed are some of the
experimental features.

Third-Party Services

As Google provides documentation and helper libraries to expand the capabilities of the app
engine platform, your app can perform tasks that are not built into the core product you are familiar
with as app engine. To do this, Google collaborates with other organizations. Along with the
helper libraries, the partners frequently provide exclusive deals to app engine users.
Advantages of Google App Engine
The Google App Engine has a lot of benefits that can help you advance your app ideas. This
comprises:
1. Infrastructure for Security: The Internet infrastructure that Google uses is arguably the
safest in the entire world. Since the application data and code are hosted on extremely secure
servers, there has rarely been any kind of illegal access to date.
2. Faster Time to Market: For every organization, getting a product or service to market quickly
is crucial. When it comes to quickly releasing the product, encouraging the development and
maintenance of an app is essential. A firm can grow swiftly with Google Cloud App Engine’s
assistance.
3. Quick to Start: You don’t need to spend a lot of time prototyping or deploying the app to
users because there is no hardware or product to buy and maintain.
4. Easy to Use: The tools that you need to create, test, launch, and update the applications are
included in Google App Engine (GAE).
5. Rich set of APIs & Services: A number of built-in APIs and services in Google App Engine
enable developers to create strong, feature-rich apps.
6. Scalability: This is one of the deciding variables for the success of any software. When using
the Google app engine to construct apps, you may access technologies like GFS, Big Table,
and others that Google uses to build its own apps.
7. Performance and Reliability: Among international brands, Google ranks among the top
ones. Therefore, you must bear that in mind while talking about performance and reliability.
8. Cost Savings: To administer your servers, you don’t need to employ engineers or even do it
yourself. The money you save might be put toward developing other areas of your company.
9. Platform Independence: Since the app engine platform only has a few dependencies, you
can easily relocate all of your data to another environment.

10. Ease of setup and use. GAE is fully managed, so users can write code without
considering IT operations and back-end infrastructure. The built-in APIs enable users to
build different types of applications. Access to application logs also
facilitates debugging and monitoring in production.

11. Pay-per-use pricing. GAE's billing scheme only charges users daily for the resources they
use. Users can monitor their resource usage and bills on a dashboard.

12. Scalability. Google App Engine automatically scales as workloads fluctuate, adding and
removing application instances or application resources as needed.

13. Security. GAE supports the ability to specify a range of acceptable Internet Protocol (IP)
addresses. Users can allowlist specific networks and services and blocklist specific IP
addresses.
GAE challenges
 Lack of control. Although a managed infrastructure has advantages, if a problem occurs
in the back-end infrastructure, the user is dependent on Google to fix it.

 Performance limits. CPU-intensive operations are slow and expensive to perform using
GAE. This is because one physical server may be serving several separate, unrelated app
engine users at once who need to share the CPU.

 Limited access. Developers have limited, read-only access to the GAE filesystem.

 Java limits. Java apps cannot create new threads and can only use a subset of the Java
runtime environment standard edition classes.
Examples of Google App Engine
One example of an application created in GAE is an Android messaging app that stores user
log data. The app can store user messages and write event logs to the Firebase Realtime
Database and use it to automatically synchronize data across devices.

Java servers in the GAE flexible environment connect to Firebase and receive notifications
from it. Together, these components create a back-end streaming service to collect messaging
log data.

GAE can be used in many different application contexts. Additional sample application code
in GitHub includes the following:

 a Python application that uses Blobstore;

 a program that uses MySQL connections from GAE to Google Cloud Platform SQL; and

 code that shows how to set up unit tests in GAE.


Programming Support of Google App Engine
GAE programming model for two supported languages: Java and Python. A client
environment includes an Eclipse plug-in for Java allows you to debug your GAE on your local
machine. Google Web Toolkit is available for Java web application developers. Python is used
with frameworks such as Django and CherryPy, but Google also has webapp Python
environment.

There are several powerful constructs for storing and accessing data. The data store is a
NOSQL data management system for entities. Java offers Java Data Object (JDO) and Java
Persistence API (JPA) interfaces implemented by the Data Nucleus Access platform, while
Python has a SQL-like query language called GQL. The performance of the data store can be
enhanced by in-memory caching using the memcache, which can also be used independently of
the data store.
Recently, Google added the blobstore which is suitable for large files as its size limit is 2
GB. There are several mechanisms for incorporating external resources. The Google SDC Secure
Data Connection can tunnel through the Internet and link your intranet to an external GAE
application. The URL Fetch operation provides the ability for applications to fetch resources and
communicate with other hosts over the Internet using HTTP and HTTPS requests.
An application can use Google Accounts for user authentication. Google Accounts
handles user account creation and sign-in, and a user that already has a Google account (such as
a Gmail account) can use that account with your app. GAE provides the ability to manipulate
image data using a dedicated Images service which can resize, rotate, flip, crop, and enhance
images. A GAE application is configured to consume resources up to certain limits or quotas.
With quotas, GAE ensures that your application won’t exceed your budget, and that other
applications running on GAE won’t impact the performance of your app. In particular, GAE use
is free up to certain quotas.
Google File System (GFS)
GFS is a fundamental storage service for Google’s search engine. GFS was designed for
Google applications, and Google applications were built for GFS. There are several concerns in
GFS. rate). As servers are composed of inexpensive commodity components, it is the norm
rather than the exception that concurrent failures will occur all the time. Another concerns the
file size in GFS. GFS typically will hold a large number of huge files, each 100 MB or larger,
with files that are multiple GB in size quite common. Thus, Google has chosen its file data block
size to be 64 MB instead of the 4 KB in typical traditional file systems. The I/O pattern in the
Google application is also special. Files are typically written once, and the write operations are
often the appending data blocks to the end of files. Multiple appending operations might be
concurrent. The customized API can simplify the problem and focus on Google applications.
Figure shows the GFS architecture. It is quite obvious that there is a single master in the
whole cluster. Other nodes act as the chunk servers for storing data, while the single master
stores the metadata. The file system namespace and locking facilities are managed by the master.
The master periodically communicates with the chunk servers to collect management
information as well as give instructions to the chunk servers to do work such as load balancing or
fail recovery.
The master has enough information to keep the whole cluster in a healthy state. Google
uses a shadow master to replicate all the data on the master, and the design guarantees that all the
data operations are performed directly between the client and the chunk server. The control
messages are transferred between the master and the clients and they can be cached for future
use. With the current qualityof commodity servers, the single master can handle a cluster of more
than 1,000 nodes.
Figure shows the data mutation (write, append operations) in GFS. Data blocks must be
created for all replicas.

The goal is to minimize involvement of the master. The mutation takes the following
steps:
1. The client asks the master which chunk server holds the current lease for the chunk and the
locations of the other replicas. If no one has a lease, the master grants one to a replica it chooses
(not shown).
2. The master replies with the identity of the primary and the locations of the other (secondary)
replicas. The client caches this data for future mutations. It needs to contact the master again
only when the primary becomes unreachable or replies that it no longer holds a lease.
3. The client pushes the data to all the replicas. Each chunk server will store the data in an
internal LRU buffer cache until the data is used or aged out. By decoupling the data flow from
the control flow, we can improve performance by scheduling the expensive data flow based on
the network topology regardless of which chunk server is the primary.
4. Once all the replicas have acknowledged receiving the data, the client sends a write request to
the primary. The request identifies the data pushed earlier to all the replicas. The primary assigns
consecutive serial numbers to all the mutations it receives, possibly from multiple clients, which
provides the necessary serialization. It applies the mutation to its own local state in serial order.
5. The primary forwards the write request to all secondary replicas. Each secondary replica
applies mutations in the same serial number order assigned by the primary.
6. The secondaries all reply to the primary indicating that they have completed the operation.
7. The primary replies to the client. Any errors encountered at any replicas are reported to the
client. In case of errors, the write corrects at the primary and an arbitrary subset of the secondary
replicas. The client request is considered to have failed, and the modified region is left in an
inconsistent state. Our client code handles such errors by retrying the failed mutation. It will
make a few attempts at steps 3 through 7 before falling back to a retry from the beginning of the
write.
GFS was designed for high fault tolerance and adopted some methods to achieve this
goal. Master and chunk servers can be restarted in a few seconds, and with such a fast recovery
capability, the window of time in which the data is unavailable can be greatly reduced. As we
mentioned before, each chunk is replicated in at least three places and can tolerate at least two
data crashes for a single chunk of data. The shadow master handles the failure of the GFS master
Big Table
BigTable was designed to provide a service for storing and retrieving structured and
semistructured data. BigTable applications include storage of web pages, per-user data, and
geographic locations. The database needs to support very high read/write rates and the scale
might be millions of operations per second. Also, the database needs to support efficient scans
over all or interesting subsets of data, as well as efficient joins of large one-to-one and one-to-
many data sets. The application may need to examine data changes over time. The BigTable
system is scalable, which means the system has thousands of servers, terabytes of in-memory
data, petabytes of disk-based data, millions of reads/writes per second, and efficient scans.
BigTable is used in many projects, including Google Search, Orkut, and Google Maps/Google
Earth, among others.
The BigTable system is built on top of an existing Google cloud infrastructure. BigTable
uses the following building blocks:
1. GFS: stores persistent state
2. Scheduler: schedules jobs involved in BigTable serving
3. Lock service: master election, location bootstrapping
4. MapReduce: often used to read/write BigTable data
BigTable provides a simplified data model called Web Table, compared to traditional
database systems. Figure (a) shows the data model of a sample table. Web Table stores the data
about a web page. Each web page can be accessed by the URL. The URL is considered the row
index. The column provides different data related to the corresponding URL

The map is indexed by row key, column key, and timestamp—that is, (row:string, column:
string, time:int64) maps to string (cell contents). Rows are ordered in lexicographic order by row
key. The row range for a table is dynamically partitioned and each row range is called “Tablet”.
Syntax for columns is shown as a (family:qualifier) pair. Cells can store multiple versions of data
with timestamps.
Figure (b) shows the BigTable system structure. A BigTable master manages and stores
the metadata of the BigTable system. BigTable clients use the BigTable client programming
library to communicate with the BigTable master and tablet servers. BigTable relies on a highly
available and persistent distributed lock service called Chubby.

Programming on Amazon AWS

AWS platform has many features and offers many services


Features:
Relational Database Service (RDS) with a messaging interface
Elastic MapReduce capability
NOSQL support in SimpleDB
Capabilities
Auto-scaling enables you to automatically scale your Amazon EC2 capacity up or down
according to conditions
Elastic load balancing automatically distributes incoming application traffic across multiple
Amazon EC2 instances
CloudWatch is a web service that provides monitoring for AWS cloud resources,
operational performance, andoverall demand patterns—including metrics such as CPU
utilization, disk reads and writes, and network traffic.
A mazon provides several types of preinstalled VMs. Instances are often called Amazon
Machine Images (AMIs) which are preconfigured with operating systems based on Linux or
Windows, and additional software. Figure 6.24 shows an execution environment.A MIs are the
templates for instances, which are running VMs. The AMIs are formed from the virtualized
compute, storage, and server resource.
Private AMI: Images created by you, which are private by default. You can grant access to other
users to launch your private images.
Public AMI: Images created by users and released to the AWS community, so anyone can
launch instances based on them
Paid QAMI: You can create images providing specific functions that can be launched by anyone
willing to pay you per each hour of usage

Amazon Simple Storage Service (S3)


Amazon S3 provides a simple web services interface that can be used to store and
retrieve any amount of data, at any time, from anywhere on the web. S3 provides the object-
oriented storage service for users. Users can access their objects through Simple Object Access
Protocol (SOAP) with either browsers or other client programs which support SOAP. SQS is
responsible for ensuring a reliable message service between two processes.
Figure shows the S3 execution environment.

The fundamental operation unit of S3 is called an object. Each object is stored in a


bucket and retrieved via a unique, developer-assigned key. In other words, the bucket is the
container of the object. Besides unique key attributes, the object has other attributes such as
values, metadata, and access control information. Through the key-value programming interface,
users can write, read, and delete objects containing from 1 byte to 5 gigabytes of data each.
There are two types of web service interface for the user to access the data stored in Amazon
clouds. One is a REST (web 2.0) interface, and the other is a SOAP interface. Here are some key
features of S3:
Redundant through geographic dispersion.
Designed to provide 99.99% durability and 99.99 %availability of objects over a given
year with cheaper reduced redundancy storage (RRS).
Authentication mechanisms to ensure that data is kept secure from unauthorized access.
Objects can be made private or public, and rights can be granted to specific users.
Per-object URLs and ACLs (access control lists).
Default download protocol of HTTP
Amazon Elastic Block Store (EBS) and SimpleDB
The Elastic Block Store (EBS) provides the volume block interface for saving and
restoring the virtual images of EC2 instances. The status of EC2 can now be saved in the EBS
system after the machine is shut down.Users can use EBS to save persistent data and mount to
the running instances of EC2. S3 is “Storage as a Service” with a messaging interface.
Multiple volumes can be mounted to the same instance. These storage volumes behave like raw,
unformatted block devices, with user-supplied device names and a block device interface.
Amazon SimpleDB Service
SimpleDB provides a simplified data model based on the relational database data model.
Structured data from users must be organized into domains. Each domain can be considered a
table. The items are the rows in the table. A cell in the table is recognized as the value for a
specific attribute (column name) of the corresponding row. it is possible to assign multiple
values to a single cell in the table. This is not permitted in a traditional relational database.
SimpleDB, like Azure Table, could be called “LittleTable” as they are aimed at managing
small amounts of information stored in a distributed table.

You might also like