MANNAR THIRUMALAI NAICKER COLLEGE
(AUTONOMOUS)
PASUMALAI, MADURAI-625 004.
DEPARTMENT OF COMPUTER SCIENCE
BIG DATA ANALYTICS
21UCSE64
STUDY MATERIAL
CLASS III B.Sc (CS) A & B
YEAR 2023-2024
SEMESTER VI
SYLLABUS:
UNIT I
INTRODUCTION TO BIG DATA
INTRODUCTION TO BIG DATA
The "Internet of Things" and its widely ultra-connected nature are leading to a burgeoning rise in
big data. There is no dearth of data for today's enterprise. On the contrary, they are mired in data
and quite deep at that. That brings us to the following questions:
• Why is it that we cannot forego big data?
• How has it come to assume such magnanimous importance in running business?
• How does it compare with the traditional Business Intelligence (BI) environment?
• Is it here to replace the traditional, relational database management system and data
warehouse environment or is it likely to complement their existence?"
Data is widely available. What is scarce is the ability to draw valuable insight.
Some examples of Big Data:
• There are some examples of Big Data Analytics in different areassuch as retail,
ITinfrastructure, and social media.
• Retail: As mentioned earlier, Big Data presents many opportunities to improve sales
andmarketing analytics.
• An example of this is the U.S. retailer Target. After analyzing consumer purchasing
behavior, Target's statisticians determined that the retailer made a great deal of money from three
main life-event situations.
• Marriage, when people tend to buy many new products
• Divorce, when people buy new products and change their spending habits
• Pregnancy, when people have many new things to buy and have an urgency to buy them.
The analysis target to manage its inventory, knowing that there would be demand for specific
products and it would likely vary by month over the coming nine- to ten-month cycles
• IT infrastructure: MapReduce paradigm is an ideal technical framework for many Big Data
projects, which rely on large data sets with unconventional data structures.
• One of the main benefits of Hadoop is that it employs a distributed file system, meaning it
can use a distributed cluster of servers and commodity hardware to process large amounts ofdata.
Some of the most common examples of Hadoop implementations are in the social media space,
where Hadoop can manage transactions, give textual updates, and develop social graphs among
millions of users.
Twitter and Facebook generate massive amounts of unstructured data and use Hadoop and its
ecosystem of tools to manage this high volume.
Social media: It represents a tremendous opportunity to leverage socialand
professionalinteractions to derive new insights.
LinkedIn represents a company in which data itself is the product. Early on, Linkedln founder Reid
Hoffman saw the opportunity to create a social network for working professionals.
As of 2014, Linkedln has more than 250 million user accounts and has added many
additionalfeatures and data-related products, such as recruiting, job seeker tools, advertising, and
lnMaps, which show a social graph of a user's professional network.
CHARACTERISTICS OF DATA
• Composition: The composition of data deals with the structure of data, that is, the sources
of data, the granularity, the types, and the nature of data as to whether it isstatic or real-time
streaming.
• Condition: The condition of data deals with the state of data, that is, "Can one use this data
as is foranalysis?" or "Does it require cleansing for further enhancement and enrichment?"
• Context: The context of data deals with "Where has this data been generated?" "Why was
this datagenerated?" How sensitive is this data?"
"What are the events associated with this data?" and so on.
Small data (data as it existed prior to the big data revolution) is about certainty. It is about known
datasources; it is about no major changes tothe composition or context of data.
Most often we have answers to queries like why this data was generated, where and when it was
generated, exactly how we would like to use it, what questions will this data be able to answer, and
so on. Big data is about complexity. Complexity in terms of multiple and unknown datasets, in
terms of exploding volume, in terms of speed at which the data is being generated and the speed
at which it needs to be processed and in terms of the variety of data (internal or external,
behavioural or social) that is being generated.
EVOLUTION OF BIG DATA
1970s and before was the era of mainframes. The data was essentially primitive and structured.
Relational databases evolved in 1980s and 1990s. The era was of data intensive applications. The
World Wide Web (WWW) and the Internet of Things (IOT) have led to an onslaught of structured,
unstructured, and multimedia data. Refer Table 1.1.
Table 1.1 The evolution of big data (Big Data and Analytics)
DEFINITION OF BIG DATA
• Big data is high-velocity and high-variety information assets that demand cost effective,
innovative forms of information processing for enhanced insight and decision making.
• Big data refers to datasets whose size is typically beyond the storage capacity of and also
complex for traditional database software tools
• Big data is anything beyond the human & technical infrastructure needed to support storage,
processing and analysis.
• It is data that is big in volume, velocity and variety. Refer to figure 1.3
Variety: Data can be structured data, semi-structured data and unstructured data. Data stored in a
database is an example of structured data.HTML data, XML data, email data, CSV files are the
examples of semi-structured data. Power point presentation, images, videos, researches, white
papers, body of email etc are the examples of unstructured data.
Velocity: Velocity essentially refers to the speed at which data is being created in real- time. We
have moved from simple desktop applications like payroll application to real-time processing
applications.
Volume: Volume can be in Terabytes or Peta bytes or Zetta bytes.
Gartner Glossary Big data is high-volume, high-velocity and/or high- variety information assets
that demand cost-effective, innovative forms of information processing that enable enhanced insight
and decision making.
Big in volume, variety, and Velocity (Big Data and Analytics)
CHALLENGES WITH BIG DATA
Refer figure are a few challenges with big data:
Challenges with big data (Big Data and Analytics)
Data volume: Data today is growing at an exponential rate. This hightide of data willcontinue
to rise continuously. The key questions are –
“will all this data be useful for analysis?”,
“Do we work with all this data or subset of it?”,
“How will we separate the knowledge from the noise?” etc.
Storage: Cloud computing is the answer to managing infrastructure for big data as far as cost-
efficiency, elasticity and easy upgrading / downgrading is concerned. This further complicates the
decision to host big data solutions outside the enterprise.
Data retention: How long should one retain this data? Some data may require for log-term decision,
but some data may quickly become irrelevant and obsolete.
Skilled professionals: In order to develop, manage and run those applications that generate insights,
organizations need professionals who possess a high-level proficiencyin data sciences.
Other challenges: Other challenges of big data are with respect to capture, storage, search,analysis,
transfer and security of big data.
Visualization: Big data refers to datasets whose size is typically beyond the storage capacity of
traditional database software tools. There is no explicit definition of how bigthe data set should be
for it to be considered bigdata. Data visualization(computer graphics) is becoming popular as a
separate discipline. There are very few data visualization experts.
DATA WAREHOUSE ENVIRONMENT
Operational or transactional or day-to-day business data is gathered from Enterprise Resource
Planning (ERP) systems, Customer Relationship Management (CRM), Legacy systems, and several
third-party applications.
The data from these sources may differ in format.
This data is then integrated, cleaned up, transformed, and standardized through the process of
Extraction, Transformation, and Loading (ETL).
The transformed data is then loaded into the enterprise data warehouse (available at the enterprise
level) or data marts (available at the business unit/ functional unit or business process level).
Business intelligence and analytics tools are then used to enable decision making from the use of
ad-hoc queries, SQL, enterprise dashboards, data mining, Online Analytical Processing etc. Refer
Figure
Data Warehouse Environment (Big Data and Analytics)
TRADITIONAL BUSINESS INTELLIGENCE (BI)VERSUS BIG DATA
Following are the differences that one encounters dealing with traditional Bl and big data.
In traditional BI environment, all the enterprise's data is housed ina central server whereas in a big
data environment data resides in a distributed file system. The distributed file system scales by
scaling in(decrease) or out(increase) horizontally ascompared to typical database server that scales
vertically.
In traditional BI, data is generally analysed in an offline mode whereas in big data, it is analysed in
both real-time streaming as well as in offline mode.
Traditional Bl is about structured data and it is here that data is taken to processing functions (move
data to code) whereas big data is about variety: Structured, semi- structured, and unstructured data
and here the processing functions are taken to the data(move code to data).
NEW THINGS
The big data trends in 2024 are:
Trend #1- The rise of machine learning
Machine learning has been around for a while, but it is only now that we are starting to see its true
potential. It’s not just about artificial intelligence anymore, but rather about how computers can
learn from their own experience and make predictions on their own. It is the most important part of
big data analytics because it can process and analyze huge amounts of data in a short amount of time.
Trend #2- The need for better security
Data breaches have become more common than ever before and there’s no sign that they’ll stop
happening anytime soon. Organizations need to invest heavily in security if they want to stay ahead
of the curve. During the third quarter of 2022, internet users worldwide saw approximately 15
million data breaches, up by 167 percent compared to the previous quarter as per statista.
Trend #3- Extended adoption of predictive analytics
Predictive analytics is on the rise and is considered among the top benefits of big data even though
this topic isn’t new. Looking at data as the most valuable asset, organizations will widely use
predictive analytics in order to understand how customers reacted and will react to a specific event,
product, or service including predict future trends.
Trend #4- More cloud adoption
Organizations can greatly benefit from moving to the cloud since it enables them to cut costs,
increase efficiency, and rely on outside services to address security concerns. One of the most
important big data trends is to keep pushing for further cloud migration and decreased reliance on
on-premises data centers. The only thing we need to keep an eye on is if businesses that handle
extremely sensitive data will put greater faith in the cloud. This question may cause a significant
change in the cloud.
Trend #5- More advanced big data tools
In order to handle big data properly and get the most out of it, organizations need to adopt advanced
tools that are investing in cognitive technologies such as Artificial Intelligence and Machine
Learning in order to facilitate its management and help them get more insights.
Business intelligence software companies are making significant investments in their technology in
order to offer more potent tools that will fundamentally alter the way that big data is handled. The
global market will be able to adopt and use big data projects as a result.
Trend #6- Data lakes
Data lakes are a new type of architecture that is changing the way companies store and analyze data.
Historically, organizations would store their data in a relational database. The problem with this
type of storage is that it is too structured to store all types of data such as images, audio files, video
files and more. Data lakes allow organizations to keep all types of data in one place.
Trend #7- More data sources (IoT, smart devices, generativeAI)
There are many different ways that we can now collect data including sensors, generativeAI, social
media platforms, and even smart devices.
This will continue to be one of the hottest big data trends for years to come.
Trend #8- Data Fabric
In hybrid multi-cloud systems, a data fabric is a framework and collection of data services that
standardize big data best practices and offer consistent functionality.
Trend #9- Data Quality
As more businesses rely on data to make intelligent business decisions, they must ensure that the
data they use is of a high quality. Poor data quality will force your company to make poor business
decisions, provide poor insights, and hinder its capacity to comprehend its clients.
Trend #10- Flexible and customizable dashboard
One of the major developments in big data is the end of pre-defined dashboards that meet the needs
of all employees because data might be viewed differently by employees in different departments.
In order to make the most of their data, employees will soon be able to build and interact with it
anyway they see fit.
Trend #11- More restricted data governance
There are many reasons for the future to hold a more restrictive data governance. The first one is the
need for data protection and privacy regulations.
Secondly, there is a growing demand for data-driven decision making which requires more
transparency around data.
TYPES OF DIGITAL DATA
CLASSIFICATION OF DIGITAL DATA
Structured Data
Structured data can be crudely defined as the data that resides in a fixed field within a
record.
It is type of data most familiar to our everyday lives. for ex: birthday,address
A certain schema binds it, so all the data has the same set of properties. Structured data is
also called relational data. It is split into multiple tables to enhance the integrity of the
data by creating a single record to depict an entity. Relationships are enforced by the
application of table constraints.
The business value of structured data lies within how well an organization can utilize its
existing systems and processes for analysis purposes.
Sources of structured data
A Structured Query Language (SQL) is needed to bring the data together. Structured data is easy
to enter, query, and analyze. All of the data follows the same format. However, forcing a
consistent structure also means that any alteration of data is too tough as each record has to be
updated to adhere to the new structure. Examples of structured data include numbers, dates,
strings, etc. The business data of an e-commerce website can be considered to be structured data.
Name Class Section Roll No Grade
Geek1 11 A 1 A
Geek2 11 A 2 B
Geek3 11 A 3 A
Cons of Structured Data
1. Structured data can only be leveraged in cases of predefined functionalities. This means
that structured data has limited flexibility and is suitable for certain specific use cases
only.
2. Structured data is stored in a data warehouse with rigid constraints and a definite schema.
Any change in requirements would mean updating all of that structured data to meet the
new needs. This is a massive drawback in terms of resource and time management.
Semi-Structured Data
Semi-structured data is not bound by any rigid schema for data storage and handling. The
data is not in the relational format and is not neatly organized into rows and columns like
that in a spreadsheet. However, there are some features like key-value pairs that help in
discerning the different entities from each other.
Since semi-structured data doesn’t need a structured query language, it is commonly
called NoSQL data.
A data serialization language is used to exchange semi-structured data across systems that
may even have varied underlying infrastructure.
Semi-structured content is often used to store metadata about a business process but it can
also include files containing machine instructions for computer programs.
This type of information typically comes from external sources such as social media
platforms or other web-based data feeds.
Semi-Structured Data
Data is created in plain text so that different text-editing tools can be used to draw valuable
insights. Due to a simple format, data serialization readers can be implemented on hardware with
limited processing resources and bandwidth.
Data Serialization Languages
Software developers use serialization languages to write memory-based data in files, transit,
store, and parse. The sender and the receiver don’t need to know about the other system. As long
as the same serialization language is used, the data can be understood by both systems
comfortably. There are three predominantly used Serialization languages.
1. XML– XML stands for eXtensible Markup Language. It is a text-based markup language
designed to store and transport data. XML parsers can be found in almost all popular
development platforms. It is human and machine-readable. XML has definite standards for
schema, transformation, and display. It is self-descriptive. Below is an example of a
programmer’s details in XML.
XML
<ProgrammerDetails>
<FirstName>Jane</FirstName>
<LastName>Doe</LastName>
<CodingPlatforms>
<CodingPlatform
Type="Fav">GeeksforGeeks</CodingPlatform>
<CodingPlatform
Type="2ndFav">Code4Eva!</CodingPlatform>
<CodingPlatform
Type="3rdFav">CodeisLife</CodingPlatform>
</CodingPlatforms>
</ProgrammerDetails>
<!--The 2ndFav and 3rdFav Coding Platforms are imaginative
because Geeksforgeeks is the best!-->
XML expresses the data using tags (text within angular brackets) to shape the data (for ex:
FirstName) and attributes (For ex: Type) to feature the data. However, being a verbose and
voluminous language, other formats have gained more popularity.
2. JSON– JSON (JavaScript Object Notation) is a lightweight open-standard file format for data
interchange. JSON is easy to use and uses human/machine-readable text to store and transmit
data objects.
Javascript
{
"firstName": "Jane",
"lastName": "Doe",
"codingPlatforms": [
{ "type": "Fav", "value": "Geeksforgeeks" },
{ "type": "2ndFav", "value": "Code4Eva!" },
{ "type": "3rdFav", "value": "CodeisLife" }
]
}
This format isn’t as formal as XML. It’s more like a key/value pair model than a formal data
depiction. Javascript has inbuilt support for JSON. Although JSON is very popular amongst web
developers, non-technical personnel find it tedious to work with JSON due to its heavy
dependence on JavaScript and structural characters (braces, commas, etc.)
3. YAML– YAML is a user-friendly data serialization language. Figuratively, it stands
for YAML Ain’t Markup Language. It is adopted by technical and non-technical handlers all
across the globe owing to its simplicity. The data structure is defined by line separation and
indentation and reduces the dependency on structural characters. YAML is extremely
comprehensive and its popularity is a result of its human-machine readability.
YAML example
A product catalog organized by tags is an example of semi-structured data.
Unstructured Data
Unstructured data is the kind of data that doesn’t adhere to any definite schema or set of
rules. Its arrangement is unplanned and haphazard.
Photos, videos, text documents, and log files can be generally considered unstructured
data. Even though the metadata accompanying an image or a video may be semi-
structured, the actual data being dealt with is unstructured.
Additionally, Unstructured data is also known as “dark data” because it cannot be
analyzed without the proper software tools.
Un-structured Data
REVIEW QUESTIONS
1. Define big data.
2. Explain the characteristics of big data.
3. What is big data? Why is big data required? How does traditional BIenvironment differ from
big data environment?
4. What are the challenges in big data?
5. Define big data. Why is big data required? Write a note on datawarehouse environment.
6. What are the three characteristics of big data? Explain thedifferences between Bl and Data
Science.
7. Explain the challenges in big data.
8. Describe the classification of digital data.
9. Differentiate between structured data and unstructured data?
10. What are new things in big data?
*****
Unit 2
BIG DATA ANALYTICS
Big Data analytics - Classification of Analytics – Greatest challenges that
prevent business from capitalizing on Big Data – Top Challenges facing Big Data
– Importance of Big Data Environment – Base – Analytics tool.
Big Data Analytics:
CLASSIFICATION OF ANALYTICS
There are basically two Classification.
• Basic, operationalized, advanced and Monetized.
• Analytics 1.0, analytics 2.0, and analytics 3.0.
I. First School of Thought
Basic analytics: Reporting on historical data, basic visualization, etc.
Operationalized analytics: It is operationalized analytics if it getswoven into the
enterprises business processes.
Advanced analytics: This largely is about forecasting for the future byway of
predictive and prescriptive modelling.
Monetized analytics: This is analytics in use to derive direct business revenue.
II. Second School of Thought:
Analytics 1.0 Analytics 2.0 Analytics 3.0
Era: mid 1990s 2005 to 2012 2012 to present
to2009 Descriptive statistics Descriptive +
Descriptive predictive statistics
predictive
(use data from the past+
statistics (report on to make predictions forprescriptive statistics
events, occurrences, the future) (use data from the past
etc.of the past) to make
prophecies for the
future and at the same
timemake
recommendations to
leverage the situation
toone's advantage)
key questions asked: key questions asked: Key questions asked:
What happened? What happened? What will happen?
Why did it happen? Why will it happen? When will it happen?
Why will it happen?
What should be
the
action
taken to take
advantage
of
what will happen?
Data from legacy Big data A blend of big data and
systems. ERP, CRM, data from legacy
and systems,
3rd party applications. ERP, CRM, and 3rd
party
applications.
Small and structured Big data is being A blend of big data
data taken up and
sources. Data stored in seriously. Data is traditional analytics
mainly to
enterprise data unstructured, arriving yield insights and
warehouses or data at a offerings with speed
marts. much higher pace. and impact.
This
fast flow of data
entailedthat the
influx of big
volume data had to
be
stored and
processed
rapidly, often on
massive
parallel servers
running
Hadoop.
Data was Data was often Data is both being
internallysourced. externally internally and
sourced. externally sourced.
Relational databases Database appliances, In memory analytics, in
Hadoop clusters, SQL database processing,
to Hadoop agile analytical
environments,etc. method
s,machine
learning techniques etc.
Table : Analytics 1.0, 2.0 and 3.0
Greatest Challenges that prevent business from capitalizing on
big data:
1. Obtaining executive sponsorships for investments in big data and its related
activities (such as training etc...)
2. Getting the business units to share information across organizational silos.
3. Finding the right skills (business analysis and data scientists) that can manage
large amount of structured, unstructured and created insight from it.
4. Determining the approach to scale rapidly and elastically. In other words, the
need to address the storage and processing by large volume, velocity, variety of big
data.
5. Deciding whether to use structured or unstructured, internal or external data to
make business decisions.
6. Choosing the optional way to report finding and analysis of big data (visual
presentation and analytics) for the presentation to make the most scene.
7. Determining what to do with the insights created from big data.
TOP CHALLENGES OF BIG DATA
There are mainly seven challenges of big data: scale, security, schema, Continuous
availability, Consistency, Partition tolerant and data quality.
Scale: Storage (RDBMS (Relational Database Management System) or NoSQL (Not
only SQL)) is one major concern that needs to be addressed to handle the need for
scaling rapidly and elastically.
Security: Most of the NoSQL big data platforms have poor security mechanisms
(lack of proper authentication and authorization mechanisms) when it comes to
safeguarding big data.
schema: Rigid schemas have no place. We want the technology to be able to fit
our big data and not the other way around.
Continuous availability: The big question here is how to provide 24/7 support
because almost all RDBMS and NoSQL big data platforms have a certain amount of
downtime built in.
Consistency: Should one opt for consistency or eventual consistency?
Partition tolerant: How to build partition tolerant systems that can takecare of both
hardware and software failures?
Data quality: How to maintain data quality-data accuracy, completeness, timeliness,
etc.? Dowe have appropriate metadata in place?
IMPORTANCE OF BIG DATA
Reactive-Business Intelligence: What does Business Intelligence (BI)
help us with? It allows the businesses to make faster and better decisions by
providing the right information to the right person at the right time in the right
format. It is about analysis of the past or historical data and then displaying the
findings of the analysis or reports in the form of enterprise dashboards, alerts,
notifications, etc.
Reactive - Big Data Analytics: Here the analysis is done on huge datasets but the
approach isstill reactive as it is still based on static data.
Proactive - Analytics: This is to support futuristic decision making by use of data
mining predictive modelling, text mining, and statistical analysis on. This analysis is
not on big data as it still the traditional database management practices on big data
and therefore has severe limitations on the storage capacity and the processing
capability.
Proactive - Big Data Analytics: This is filtering through terabytes, petabytes,
exabytes of information to filter out the relevant data to analyze. This also includes
high performance analytics to gain rapid insights from big data and the ability to
solve complex problems using more data.
DATA SCIENCE
Data science is the science of extracting knowledge from data. In other words, it is a
science of drawing out hidden patterns amongst data using statistical and
mathematical techniques.
It employs techniques and theories drawn from many fields from the broad areas of
mathematics, statistics, information technology including machine learning, data
engineering, probability models, statistical learning, pattern recognition and learning,
etc.
Data Scientist
Business Acumen(expertise) Skills:
A data scientist should have following ability to play the role of data
scientist.
• Understanding of domain
• Business strategy
• Problem solving
• Communication
• Presentation
• Keenness
Technology Expertise:
• Good database knowledge such as RDBMS.
• Good NoSQL database knowledge such as MongoDB, Cassandra,HBase,
etc.
• Programming languages such as Java. Python, C++, etc.
• Open-source tools such as Hadoop.
• Data warehousing.
• Data mining
• Visualization such as Tableau, Flare, Google visualization APIs, etc.
Mathematics Expertise:
• Mathematics.
• Statistics.
• Artificial Intelligence (AI).
• Algorithms.
• Machine learning.
• Pattern recognition.
• Natural Language Processing.
RESPONSIBILITIES
1. Data Management
2. Analytical Techniques
3. Business Analysis
Data Management: A data scientist employs several approaches to develop the
relevant datasets for analysis. Raw data is just "RAW", unsuitable for analysis. The
data scientist works on it to prepare to reflect the relationships and contexts. This data
then becomes useful for processing and further analysis.
Analytical Techniques: Depending on the business questions which we are trying to
find answers to and the type of data available at hand, the data scientist employs a
blend of analytical techniques to develop models and algorithms to understand the
data, interpret relationships, spot trends, and reveal patterns.
Business Analysis: A data scientist is a business analyst who distinguishes cool facts
from insights and is able to apply his business expertise and domain knowledge to
see the results inthe business context.
TERMINOLOGIES USED IN BIG DATA ENVIRONMENTS
In order to get a good handle on the big data environment, let us get
familiar with a few key terminologies in this arena.
In-Memory Analytics:
Data access from non-volatile storage such as hard disk is a slow process.
The more the data is required to be fetched from hard disk or secondary storage,
the slower the process gets. One way to combat this challenge is to pre-process and
store data (cubes, aggregate tables, query sets, etc.) so that the CPU has to fetch a
small subset of records. But this requires thinking in advance as to what data will
be required for analysis. It there is a need for different or more data, it is back to
the initial process of pre-computing and storing data or fetching it from secondary
storage.
This problem has been addressed using in-memory analytics. Here all the relevant
data is stored in Random Access Memory (RAM) or primary storage thus
eliminating the need to access the data from hard disk. The advantage is faster
access, rapid deployment, better insights, and minimal IT involvement.
In-Database Processing:
In-database processing is also called as in-database analytics. It
works by fusing data warehouses with analytical systems. Typically the
data from various enterprise On Line Transaction Processing (OLTP)
systems after cleaning up (de-duplication. scrubbing. etc.) through the
process of ETL is stored in the Enterprise Data Warehouse (EDW) or
data marts. The huge datasets are then exported to analytical programs
for complex and extensive computations. With in-database processing.
the database program itself can run the computations eliminating the
need for export and thereby saving on time. leading database vendors are
offering this feature to large businesses.
Symmetric Multiprocessor System (SMP)
In SMP, there is a single common main memory that is shared by two or
more identical processors have full access to all. I/O devices and are controlled by
a single operating system instance. SMP are tightly coupled multiprocessor
systems. Each Processor has its own high-speed memory, called cache memory
and are connected using a system bus.
Massively Parallel Processing
Massive Parallel Processing (MPP) refers to the coordinated processing of
programs by a number of processors working parallel. The processors, each have
their own operating systems and dedicated memory. They work on different parts
of the same program. The MPP processors communicate using some short of
messaging interface. The MPP systems are more difficult to program as the
application must be divided in such a way that all the executing segments can
communicate with each other. MPP is different from Symmetrically
MultiProccssing (SMP) in that SMP works with the processors sharing the same
operating system and same memory. SMP is also referred to as tightly-coupled
multiprocessing.
Difference Between Parallel and Distributed Systems
Parallel Systems:
A Parallel database system is a tightly coupled system. The Processor co-
operate for query processing. The user is unaware of the parallelism since he/she
has no access a specific processor of the system.
Distributed database systems:
Distributed database systems are known to be loosely coupled and are
composed by individual machines. Each of the machines can run their individual
application and serve their own respective user. The data is usually distributed
across several machines, thereby necessitating quite a number of machines to be
accessed to answer a user query.
Shared Nothing Architecture
1. Shared Memory (SM).
2. Shared Disk (SD).
3. Shared Nothing (SN).
Shared Memory - In shared memory architecture, a common central
memory is shared by multiple processors.
Shared Disk - In shared disk architecture, multiple processors share a
common collection of disks while having their own private memory.
Shared Nothing - In shared nothing architecture, neither memory nor
disk is shared among multiple processors.
Advantages of a "Shared Nothing Architecture"
1. Fault Isolation: A "Shared Nothing Architecture" provides the benefit of
isolating fault. A fault in a single node is contained and confined to that node
exclusively and exposed only through messages.
2. Scalability: Assume that the disk is a shared resource. It implies that the
controller and the disk band-width are also shared. Synchronization will have to be
implemented to maintain a consistent shared state. This would mean that different
nodes will have to take turns to access the critical data.
CAP Theorem Explained
The CAP theorem is also called the Brewer's Theorem.
1. Consistency
2. Availability
3. Partition tolerance
CAP Theorem
1. Consistency implies that every read fetches the last write.
2. Availability implies that reads and writes always succeed. In other words,
each non-failing node will return a response in a reasonable amount of time.
3. Partition tolerance implies that the system will continue to function when
network partition occurs.
In summary, one can at most decide to go with two of the three.
1. Consistent: The instructors or the training coordinator, once they have
updated information with you. will always get the most updated information when
they call subsequently.
2. Availability: The instructors or the training coordinators will always get the
schedule if any or both of the office administrators have reported to work.
3. Partition Tolerance: Work will go on as usual even if there is
communication loss between the office administrators owing to a spat or a tiff!
When to choose consistency over availability and vice-versa...
1. Choose availability over consistency when your business requirements allow
some flexibility around when the data in the system synchronizes.
2. Choose consistency over availability when your business requirements
demand atomic reads and writes.
Examples of databases:
1. Availability and Partition Tolerance (AP)
2. Consistency and Partition Tolerance (CP)
3. Consistency and Availability (CA)
BASICALLY AVAILABLE SOFT STATE EVENTUAL CONSISTENCY (BASE):
A few Basic Questions to Start with,
1) Where is it used?
In Distributed Computing.
2) Why is it used?
To achieve high-availability.
3) How is it achieved?
Assume a given data item. If no new updates are made in this given
data item for a stipulated period of time, eventually all accesses to this data item
will return the updated value. In other words, if no new updates are made to a
given data item for a stipulated period of time, all updates that were made in the
past and not applied to this given data item and the several replicas of it will
percolate to this data item so that it stays as current/recent as is possible.
4) What is Replica Convergence?
A System that has achieved eventual consistency is said to have
converged or achieved replica convergence.
5) Conflict Resolution: How is the conflict resolved?
a) Read Repair: If the read leads to discrepancy or inconsistency, a
correction is initiated. It slows down the read operation.
b) Write Repair: If the write leads to discrepancy or inconsistency, a
correction is initiated. This will cause the write operation to slow
down.
c) Asynchronous Repair: Here, the correction is not part of a read or
write operation.
Few step analytical Tools:
1) NoSQL
2) SAS
3) IBM SPSS Modules
4) Statistics
Open-Source Analytical Tools
1) R Analytics
2) Weka
**********************************************************************
Unit: III
The Big Data Technology Landscape:
NoSQL – Types of NoSQL Database – Need of NoSQL? – Advantages of NoSQL – Use
of NoSQL in Industry – SQL vsNoSQL – Comparison of SQL, NoSQL and NewSQL. Hadoop:
Features of Hadoop – Advantages of Hadoop – Overview of Hadoop – Hadoop distribution –
Hadoopvs SQL – Integrated Hadoop System – Cloud-Based HadoopSolutions.
NoSQL(Not Only SQL)
The term NoSQL was first coined by Carlo Strozzi in 1998 and Johan Oskarsson
reintroduced the term NoSQL in 2009 to discuss opensource distributed network.
Features of NoSQL database:
They are opensource.
They are non-relational.
They are distributed.
They are schema-less.
They are cluster friendly.
They are born out of 21st century of web applications.
Where it is used?
NoSQL databases are widely used in big data and other real-time web applications.
Where to use NoSQL?
Log analysis.
Social networking feeds.
Time based data(not easily analyzed in a RDBMS).
Types of NoSQL database
NoSQL databases use a variety of data models for managing and accessing the data.
1.Key-value pair
2.Document-oriented
3.Column-oriented
4Graph-based
5Time series
1. Key-value pair(Big hash table):
A key value data store is a type of database that stores data as a collection of key
value pairs. In this type of data store, each data items identified by a unique key, and then
value associated with that key can be anything, such as a string, number, object or even
another data structure.
2. Document-Oriented:
It maintains data in collections constituted of documents.
For example: MongoDB, Apache CouchDB, Couchbase, Mark Logic etc.,
Sample document in document database:
{
“Book name” : “fundamentals of business analytics”,
“Publisher”: ”Wiley India”,
“Year of publication”:”2011”.
}
3. Column-Oriented:
Each storage block has data from only one column. For example: Cassendra, HBase etc.,
4. Graph-Based:
They are also called network database. A graph stores data in nodes.
For example: Neo4j, hyper GraphDB, etc.
Need of NoSQL
It has scale out architecture instead of the monolithic architecture of relational
databases.
It can house large volumes of structures, semi-structured and unstructured data.
Dynamic schema: NoSQL database allows insertion of data without a pre-
defined schema. In other words, it facilitates applications changes in real-time,
which thus supports faster development, easy code Integration, and requires less
database administration.
Auto Sharding:It automatically spreads data across on arbitrary number of
servers. The application in question is more often not even aware of the
composition of the server pool. It balances the load of data and query on the
available servers; and if and when a server goes down, it is quickly replaced
without any major activity disruptions.
Replication: It offers good support for replication which in turn guarantees high
availability, fault tolerance and disaster recovery.
Advantages of NoSQL
1.Can easily scale up and down:.
NoSQL database supports scaling rapidly and elastically and even allows to scale to the
cloud.
Key-value Column Document Graph
pair oriented oriented
Data store Data store Data store Data store
Riak Cassendra Mongo DB Infinite Graph
Redis HBase CouchDB Neo4j
Membase Hypertable RavenDB Allegrography
a) Cluster scale:
It allows distribution of database across 100+ nodes often in multiple data centers.
b) Performance scale:
It sustains over 100,000+database reads and writes per second.
c) Data scale:
It supports housing of 1 billion+ documents in the database.
2.Does not require a pre-defined schema:.
NoSQL does not require any adherence to pre-defined schema.
For example: MongoDB
3.Cheap, easy to implement:
Deploying NoSQL properly allows for all of the benefits of scale, high availability,
fault tolerance,etc.
4.Relaxes the data consistency requirement:
NoSQL databases have adherence to CAP theorem. Most of the NoSQL databases
compromise on consistency in favor of availability and partition tolerance.
5.Data can be replicated to multiple nodes and can be partitioned.
Sharding
Replication
Use of NoSQL in Industry
NoSQL is being put to use in varied industries. They adre used to support
analysis for applications such as web user data analysis, log analysis , sensor feed
analysis, making recommendations for upsell and cross-sell,etc.
SQL VS NoSQL
SQL NoSQL
RELATIONAL DATABASE
Non-relational or distributed database
MANAGEMENT SYSTEM
system.
(RDBMS)
These databases have fixed or static
They have a dynamic schema
or predefined schema
These databases are not suited for These databases are best suited for
hierarchical data storage. hierarchical data storage.
These databases are best suited for These databases are not so good for
complex queries complex queries
Vertically Scalable Horizontally scalable
Follows CAP(consistency, availability,
Follows ACID property
partition tolerance)
Examples: MySQL, PostgreSQL, Examples: MongoDB, HBase, Neo4j,
Oracle, MS-SQL Server, etc Cassandra, etc
Comparison of SQL, NoSQL, NewSQL
FEATURES SQL NO SQL NEW SQL
It is both schema-
It is schema-
Schema It is schema-fix. fix and schema-
free.
free.
Base It strictly follows It follows the It takes care of
Properties/Theorem ACID properties. CAP theorem. ACID properties.
It is less It is moderately
Security It is secure.
secure. secure.
No distributed Distributed Distributed
Databases
database. database. database.
Query Language It supports SQL It does not It supports SQL
as a query support old with improved
language. SQL but functions and
supports UQL. features.
It is both
It is only
It is vertically vertically and
Scalability vertically
scalable. horizontally
scalable.
scalable.
Relational
Relational Non-relational
Types of database database but not
database. database.
purely.
Online
Online
transaction Online
transaction
Online processing processing but analytical
processing with
not full processing.
full functionality.
functionality.
Complex
Highly efficient
Simple queries queries can be
Query Handling for complex
can be handled. directed better
queries.
than SQL.
Example MySQL MongoDB Cockroach DB.
HADOOP
Hadoop is an open-source project of the Apache foundation. It is a framework written in
java, originally developed by Doug cutting in 2005 who named it after his son’s toy elephant.
It was created to support distribution for “Nutch”, the text search engine. Hadoop is now a
part of the computing infrastructure for companies such as Yahoo, Facebook, Linkedin,
Twitter etc.
Features of Hadoop
1. It is optimized to handle massive quantities of structured, semi-structured and unstructured
data, using commodity hardware, that is, relatively inexpensive computers.
2. Haddop has a shared nothing architecture.
3. It replicate its data across multiple computers so that if one goes down, the data can still
processed from another machine that stored its replica.
4. Hadoop is for high throughput rather than low latency. It is a batch operation handling
massive quantities of data; therefor the response time is not immediate.
5. It complents on-line transaction processing(OLTP) and on-line analytical processing(OLAP)
however, it is not a replacement for a relational database management system.
6. It is not good when work cannot be parallelized or when there are dependencies with in the
data.
7. It is not good for peocessing small files. It works best with huge data files and datasets.
Advantages of Hadoop
Stores data in its native format:
Hadoop’s data storage framework (HDFS-Hadoop distributed file system) can store
data in its native format. There is no structure that is imposed while keying in data or storing
data. HDFS is pretty much schema-less. It is only later when the data needs to be processed
that structure is imposed on the raw data.
Scalable:
Hadoop can store and distribute very large datasets across hundreds of inexpensive
servers that operate in parallel.
Cost-effective:
Owing to its scale-out architecture. Hadoop has a much reduced cost/terabyte of
storage and processing.
Resilient to failure:
Hadoop is fault tolerant. It practices replication of data diligently which means
whenever data is sent to any node, the same data also gets replicated to other nodes in the
cluster, there by ensuring that in the event of a node failure, there will always be another
copy of data available for use.
Flexibility:
One of the key advantage of hadoop is its ability to work with all kind of data-
structured, unstructured and semi-structured data. It can help derive meaningful business
insights from email conversations, social media data, click-stream data, etc.
Fast:
Processing is extremely fast in hadoop as compared to other conventional systems
owing to the “move code to data” paradigm. Hadoop has a shared-nothing architecture.
Overview of Hadoop
There are components available in the hadoop ecosystem for data ingestion,
processing and analysis.
Components that help with Data Ingestion are
1.Sqoop
2.Flame
Components that help with Data processing are
1.Map Reduce
2.Spark
Components that help with Data Analysis are
1.Pig
2.Hi
3.Impala
HDFS
HDFS is a Hadoop Distributed File System. It is the distributed storage unit of
Hadoop. It provides streaming access to file system data as well as file permission and
authentication. It is based on GFS(Google File System). HDFS is a highly fault-tolerant.
HBASE
It stores data in HDFS. It is the first non-batch component of the hadoop ecosystem.
It is a database on top of HDFS. It is based on Google BigTable. It is a NoSQL database, is
non-relational and is a column-oriented database. This is widely used by Facebook, Twitter,
Yahoo, etc.
Difference between HBase and Hadoop
S.No. Hadoop HBase
Hadoop is a collection of software
1 HBase is a part of hadoop eco-system
tools
Stores data sets in a distributed
2 Stores data in a column-oriented manner
environment
3 Hadoop is a framework HBase is a NOSQL database
4 Data are stored in form of chunks Data are stored in form of key/value pair
Hadoop does not allow run time
5 HBase allows run time changes
changes
File can be written only once, can
6 File can be read and write multiple times
be read many times
7 Hadoop has low latency operations HBase has high latency operations
HDFS can be accessed through HBase can be accessed through shell
8
MapReduce: commands, Java API, REST
Hadoop ecosystem components for data ingestion
1.Sqoop:
Sqoop stands for SQL to hadoop. Its main functions are
a) Importing data from RDBMS such as MySQL, Oracle, DB2, etc. to hadoop file
system(HDFS, HBase, Hive).
b) Exporting data from hadoop file system to RDBMS.
Uses of sqoop
a) It has a connector-based architecture to allow plug-ins to connect to external
systems such as MySQL, Oracle, DB2, etc.
b) It integrates with Oozie allowing you to schedule and automate import and export
tasks.
1. Flume:
Flume is an important log aggregator component in the hadoop ecosystem. Flume
has been developed by Cloudera. It is designed for high volume ingestion of event-based
data into hadoop. The default destination in flume id HDFS. However it can also write to
HBase or Solr.
Hadoop Ecosystem components for data processing
1. MapReduce:
It is a programming paradigm that allows distributed and parallel
processing of huge datasets. It is based on google mapreduce. Google released a paper
on mapreduce programming paradigm in 2004 and that became the genesis of hadoop
processing model. There are two main phases : Map phase and Reduce phase.
2. Spark:
It is both a programming model as well as a computing model. It is an
open-source big data processing framework. It was originally developed in 2009 at UC
Berkeley’s AmpLab and became an open-source project in 2010.
1.Spark SQL
2.Spark streaming
3.MLib
4.GraphX
Hadoop ecosystem components for data analysis
1. Pig:
It is a high-level scripting languageused with hadoop. It serves as an
alternative to MapReduce. It has two parts:
a) Pig Latin – It is SQL-like scripting language.
b) Pig runtime- It is the runtime environment.
2. Hive:
Hive is a data warehouse softwar project built on top of Hadoop. Three
main tasks performed by hive are summarization, queriying and analysis.
Difference between Hive and RDBMS
RDBMS Hive
It is used to maintain It is used to maintain data
database. warehouse.
It uses SQL (Structured It uses HQL (Hive Query
Query Language). Language).
Schema is fixed in
Schema varies in it.
RDBMS.
RDBMS Hive
Normalized and de-
Normalized data is
normalized both type of
stored.
data is stored.
Tables in rdms are
Table in hive are dense.
sparse.
It doesn’t support It supports automation
partitioning. partition.
No partition method is Sharding method is used
used. for partition.
Difference between Hive and HBase
1. Hive is a MapReduce-based SQL engine that runs on top of Hadoop. HBase is a
key-value NoSQL database that runs on top of HDFS.
2. Hive is for batch processing of big data. Hbase is for real-time data streaming.
Impala
It is a high performance SQL engine that runs on Hadoop cluster. It is ideal for
interactive analysis. It has very low latency measured in milliseconds. It supports a dialect of
SQL called Impala SQL.
Zookeeper
It is a coordination service for distributed applications.
Oozie
It is a workflow scheduler system to manage Apache Hadoop jobs.
Mahout
It is a scalable machine learning and data mining library.
Chukwa
It is a data collection system for managing large distributed systems.
Ambari
It is a web-based tool for provisioning, managing, and monitoring Apache
Hadoop clusters.
Hadoop Distribution
Hadoop is an open-source Apache project. The core aspects of hadoop include
the following:
1. Hadoop common
2. Hadoop Distributed File System(HDFS)
3. Hadoop YARN(Yen Another Resource Negotiator)
4. Hadoop MapReduce
There are few companies such as IBM, Amazon web services, Microsoft, Teradata,
Hortonworks, cloudera, etc. that have packaged hadoop into a more easily consumable
distributions or services. Although each of these companies have a slightly different
stratergy, the key essence remains its ability to distribute data and workloads across
potentially thousands of servers thus making big data manageable data.
Hadoop Distributions
Intel distribution for Apache Hadoop software
Hortonworks
Cloudera’s distribution including Apache Hadoop
EMC Greenplum HD
IBM InfoSphere biginsights
MapR M5 edition
MS Big data solution
Hadoop vs SQL
Hadoop SQL
Scale out Scale up
Key-value pairs Relational tables
Functional programming Declarative queries
Integrated Hadoop Systems offered by leading market vendors
Integrated Hadoop Systems:
1. EMC Greenplum
2. Oracle Big Data Appliance
3. Microsoft Big Data Solution
4. IBM Infosphere
5. HP Big data solutions.
Cloud-Based Hadoop Solutions
Amazon web services holds out a comprehensive, end-to-end portfolio of cloud
computing services to help manage big data. The aims is to achieve this and more
along with retaining the emphasis on reducing costs, scaling to meet demand, and
accelerating the speed of innovation.
The Google cloud storage connector for hadoop empowers one to perform
MapReduce jobs directly on data in google cloud storage, without the need to copy it to
local disk and running it in the hadoop distributed file system.
Cloud based solution:
1. Amazon web services
2. Google BigQuery.
*********************************************
Unit: IV
Introduction to Hadoop
Introduction to Hadoop: Introduction to Hadoop – Need of Hadoop – Need of
RDBMS – RDBMS vs Hadoop – Distributed computing challenges – History of Hadoop
– Hadoop overview – Use case of Hadoop – Hadoop distribution –HDFS – Processing
data with Hadoop – Managing resources and Application with Hadoop YARN –
Interacting with Hadoop Ecosystem
INTRODUCTION TO HADOOP
Today, big data seems to be the buzz world! Enterprises, the world over, are
beginning to realize that there is a huge volume of untapped information before
them in the form of structure, semi-structure, and unstructured data.
This varied variety of data is spread across the networks.
Let us look at few statistics to get an idea of the amount of data which gets
generated every day, every minute and every second.
1. EVERY DAY:
(a) NYSE(New York Stock Exchange) generates 1.5 billion shares and
trade data.
(b) Facebook stores 2.7 billion comments and likes.
(c) Google processes about 24 petabytes of data.
2. EVERY MINUTE:
(a) Facebook users share nearly 2.5 million pieces of content.
(b) Twitter users tweet nearly 300,000 times.
(c) Instagram users post nearly220,000 new photos.
(d) You tube users upload 72 hours new video content.
(e) Apple users download nearly 50,000 apps
(f) Email users sendover 200 million messages.
(g) Amazon generates $80,000 in online sales.
(h) Google receives over 4 million search queries.
3. EVERY SECOND:
(a) Banking applications process more than 10,000 credit card
transactions.
NEED OF HADOOP
Ever wondered why hadoop has been and is one of the most wanted technologies!!
Its capability to handle massive amounts of data, different categories of data-
fairly quickly.
The other considerations are
1. cost: Hadoop is an open-source framework and uses commodity
hardware(commodity hardware is relatively inexpensive and easy to obtain
hardware)to store enormous quantities of data.
2. Computing power: Hadoop is based on distributed computing model
which processes very large volumes of fairly quickly. The more the number
of computing nodes, the more the processing power at hand.
3. Scalability: this boils down to simply adding nodes as the system grows
and requires much administration.
4. Storage flexibility: unlike the traditional relational databases , in hadoop
data need not be preprocessed before storing it. Hadoop provides the
convenience of storing as much data as one needs and also the added
flexibility of deciding later as to how to use the stored data. In hadoop, one
can store unstructured data like images, videos, and free-form text.
5. Inherent data protection: hadoop protects data and executing applications
against hardware failure. If a node fails, it automatically redirects the jobs
that had been assigned to this node to the other functional and available
nodes and ensures the distributed computing does not fail. It goes a step
further to store multiple copies of the data on various nodes across the
cluster.
In this new design, groups of machine are gathered together, it is known as
cluster.
NEED OF RDBMS
RDBMS is not suitable for storing and processing large files, images and
videos. RDBMS is not a good choice when it comes to advanced analytics
involving machine learning.
The following diagram describes the RDBMS system with respect to cost and
storage.
It calls for huge investment as the volume of data shows an upward trend.
RDBMS (VS) HADOOP
S.No. RDBMS Hadoop
Traditional row-column based databases, An open-source software used for storing
1. basically used for data storage, manipulation data and running applications or processes
and retrieval. concurrently.
In this both structured and unstructured data
2. In this structured data is mostly processed.
is processed.
3. It is best suited for OLTP environment. It is best suited for BIG data.
4. It is less scalable than Hadoop. It is highly scalable.
Data normalization is not required in
5. Data normalization is required in RDBMS.
Hadoop.
6. It stores transformed and aggregated data. It stores huge volume of data.
7. It has no latency in response. It has some latency in response.
The data schema of Hadoop is dynamic
8. The data schema of RDBMS is static type.
type.
9. High data integrity available. Low data integrity available than RDBMS.
10. Cost is applicable for licensed software. Free of cost,as it is an open source software.
DISTRIBUTED COMPUTING CHALLENGES
Hardware failure:
In a distributed system, several servers are networked together. This
implies that more often the not, there may be a possibility of hardware failure.
And when such a failure does happen, how does happen, how does one retrieve
the data that was stored in the system?
Just to explain further- a regular hard disk may fail once in 3years.
And when you have 1000 such harddisks, there is a possibility of atleast a few
being down every day.
REPLICATION SOFTWARE:
Number of data copies to be stored across the network.
Implies we have two replicas of the data.
How to process this gigantic store of data?
In a distributed system, the data is spread across the network on
several machines. A key challenge here is to integrate the data
available on several machines prior to processing it.
Hadoop solves this problem by using MapReduce Programming. It is
a programming model to process the data (MapReduce programming
will be discussed a little later).
HISTORY OF HADOOP
Hadoop was created by Doug Cutting, the creator of Apache Lucene
(a commonly used text search library Hadoop is a part of the Apache
Nutch (Yahoo) project (an open-source web search engine) and also
a part of the Lucene project. Refer the diagram for more details.
The Name "Hadoop"
The name Hadoop is not an acronym; it's a made-up name. The project
creator, Doug Cutting, expl how the name came about:
"The name my kid gave a stuffed yellow elephant. Short, relatively
easy to spell pronounce, meaningless, and not used elsewhere: those
are my naming criteria. Kids are good at generating Googol is a kid's
term".
HADOOP OVERVIEW
Open-source software framework to store and process massive
amounts of data in a distributed fashion on large clusters of commodity
hardware. Basically, Hadoop accomplishes two tasks:
1. Massive data storage.
2. Faster data processing.
Hadoop components:
The diagram depicts the hadoop components.
Hadoop Core Components:
1. HDFS:
(a) Storage component.
(b) Distributes data across several nodes.
(c) Natively redundant.
2. MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
Hadoop Ecosystem:
Hadoop Ecosystem are support projects to enhance the functionality
of Hadoop Core Components. The Eco Projects are as follows:
1. HIVE
2. PIG
3. SQOOP
4. HBASE
5. FLUME
6. OOZIE
7. MAHOUT
Hadoop Conceptual Layer:
It is conceptually divided into Data Storage Layer which stores huge
volumes of data and Data Processing Layer which processes data in parallel
to extract richer and meaningful insights from data (Figure 5.9).
High-Level Architecture of Hadoop:
Hadoop is a distributed Master-Slave Architecture. Master node is
known as NameNode and slave nodes are known as DataNodes.( Figure 5.10)
depicts the Master-Slave Architecture of Hadoop Framework.
Let us look at the key components of the Master Node.
1. Master HDFS: Its main responsibility is partitioning the data storage
across the slave nodes. It also keeps track of locations of data on DataNodes.
2. Master MapReduce: It decides and schedules computation task on slave
nodes.
USE CASE OF HADOOP
Click stream data:
ClickStream data (mouse clicks) helps you to understand the purchasing
behavior of customers. ClickStream analysis helps online marketers to optimize
their product web pages, promotional content, etc. to improve their business.
The ClickStream analysis (Figure 5.11) using Hadoop provides three key
benefits:
1. Hadoop helps to join ClickStream data with other data sources such as
Customer Relationship Management Data (Customer Demographics Data,
Sales Data, and Information on Advertising Campaigns). This additional
data often provides the much needed information to understand cus tomer
behavior.
2. Hadoop's scalability property helps you to store years of data without
ample incremental cost. This helps you to perform temporal or year over
year analysis on ClickStream data which your competitors may miss.
3.Business analysts can use Apache Pig or Apache Hive for website analysis.
With these tools, you can organize ClickStream data by user session, refine
it, and feed it to visualization or analytics tools.
HADOOP DISTRIBUTORS
The companies shown in Figure provide products that include Apache
Hadoop, commercial support, and/or tools and utilities related to Hadoop.
HDFS (HADOOP DISTRIBUTED FILE SYSTEM):
Some key Points of Hadoop Distributed File System are as follows:
1. Storage component of Hadoop.
2. Distributed File System.
3. Modeled after Google File System.
4. Optimized for high throughput (HDFS leverages large block size
and moves computation where data is stored).
5. You can replicate a file for a configured number of times, which is
tolerant in terms of both software and hardware.
6. Re-replicates data blocks automatically on nodes that have failed.
7. You can realize the power of HDFS when you perform read or
write on large files (gigabytes and larger).
HDFS DAEMONS
Name Node:
HDFS breaks a large file into smaller pieces called blocks.
NameNode uses a rack ID to identify DataNodes in the rack. A
rack is a collection of DataNodes within the cluster.
NameNode keeps tracks of blocks of a file as it is placed on
various DataNodes. NameNode manages file related operations
such as read, write, create. and delete.
Its main job is managing the File System Namespace. A file
system namespace is collection of files in the cluster.
NameNode stores HDFS namespace. File system namespace
includes mapping of blocks to file, file properties and is stored
in a file called Filmage.
Data Node:
There are multiple data nodes per cluster.
During pipeline read and write Data Nodes communicate with each other.
A data node also continuously sends “heartbeat” message to name node to
ensure the connectivity between the name node and data node.
Secondary NameNode:
The secondary NameNode takes a snapshot of HDFS metadata at intervals
specified in the Hadoop configuration. Since the memory requirements of
Secondary NameNode are the same as NameNode, it is better to run NameNode
and Secondary NameNode on different machines. In case of failure of the
NameNode.
Anatomy of File Read:
The Following figure 5.18 describes the anatomy File Read.
The steps involved in the File Read are as follows:
1. The client opens the file that it wishes to read from by calling open() on
the Distributed File System.
2. Distributed File System communicates with the NameNode to get the
location of data blocks. NameNode returns with the addresses of the
DataNodes that the data blocks are stored on.
3. Client then calls read of on the stream DFSInputStream, which has
addresses on the data nodes for the first few blocks of the file, connects to
the closest Datanode for the first block in the file.
4. Client calls read() repeatedly to stream the data from the DataNode.
5. When end of the block is reached, DFSInputStream closes the connection
with the datanode.
6. When the client completes the reading of the file, it calls close() on the
FSDInputStream to close the connection.
Anatomy of File Write
The steps involved in anatomy of File Write are as foliow
1. The client calls create() on Distributed FileSystem to create a file .
2. An RPC call to the NameNode happens through the Distributed File System to
create a new The NameNode performs various checks to create a new file
(checks whether such a file eas not) Intially the NameNode creates a file withour
associating any data blockes to the file. The Distributed File System returns an
FSData OutputStream to the client to perform write .
3. As the client writes data, data is split into packets by DFSOurpurStream, which
is then written te internal queue, called data quene. DataStreamer consumes the
data queue.
4. DataStreamer streams the packets to the first DataNode in the pipeline.
Replica Placement Strategy
Hadoop Default Replica Placement Strategy:
As per the Hadoop Replica Placement Strategy, first replica is placed
on the same node as the client. Then it places second replica on a node
that is present on different rack.
It places the third replica on the same rack as second, but on a
different node in the rack. Once replica locations have been set, a
pipeline is built.
This strategy provides good reliability. Figure 5.20 describes the
typical replica pipeline.
Working with HDFS Commands:
The list of directories and files at the root of HDFS.
Hadoop fs-ls/
The list of complete directories and file of HDFS.
Hadoop fs-ls-R/
To create a directory in HDFS.
Hadiio fs-mkdir/sample
To copy a file from local file system to HDFS.
Hadoop fs-put/root/sample/test.txt/sample/test.txt
To display the contentsof an HDFS file on console.
Hadoop fs-cat/sample/test.txt
To remove a directory form HDFS.
Hadoop-fs-rm-r/sample1
Special features of HDFS:
1.Data Replication: There is absolutely no need for a client application to
track all blocks. It directs the client to the nearest replica to ensure high
performance.
2. Data Pipeline: A client application writes a block to the first DataNode in
the pipeline. Then this DataNode takes over and forwards the data to the next
node in the pipeline. This process continues for all the data blocks, and
subsequently all the replicas are written to the disk.
PROCESSING DATA WITH HADOOP
In MapReduce Programming, the input dataset is split into
independent chunks. Map tasks process these independent chunks
completely in a parallel manner.
MapReduce Framework sorts the output based on keys. This sorted
output becomes the input to the reduce tasks. Reduce task provides
reduced output by combining the out- put of the various mappers. Job
inputs and outputs are stored in a file system.
MapReduce framework also takes care of the other tasks such as
scheduling, monitoring, re-executing failed tasks, etc). A single master
JobTracker per cluster and one slave TaskTracker per cluster-node. The
Job Tracker is responsible for scheduling tasks to the Task Trackers,
monitoring the task, and re-executing the task just in case the Task
Tracker fails.
MapReduce Daemons:
1. JobTracker:
It provides connectivity between Hadoop and your application. When
you submit code to cluster, Job Tracker creates the execution plan by
deciding which task to assign to which node.
It also monitors all the running tasks. When a task fails, it
automatically re-schedules the task to a different node after a
predefined number of retries.
JobTracker is a master daemon responsible for executing overall
MapReduce job. There is a single JobTracker per Hadoop cluster.
2. TaskTracker:
This daemon is responsible for executing individual tasks that is
assigned by the JobTracker.
There is a single TaskTracker per slave and spawns multiple Java
Virtual Machines (JVMs) to handle multiple map or reduce tasks in
parallel.
Task Tracker continuously sends heartbeat message to JobTracker.
How does mapReduce work?
The following steps describe how MapReduce performs its task
1. First, the input dataset is split into multiple pieces of data (several small
subsets).
2. . Next, the framework creates a master and several workers processes and
executes the worker processes remotely
MapReduce Example:
The famous example for MapReduce Programming is Word Count For
example, consider you need to count the occurrences of similar words across 50
files. You can achieve this using MapReduce Programming Refer Figure 5.25.
Word Count MapReduce Programming using Java
The MapReduce Programming requires three things.
1. Driver Class: This class specifies Job Configuration details.
2. Mapper Class: This class overrides the Map Function based on the problem
statement.
3. Reducer Class: This class overrides the Reduce Function based on the problem
statement.
MANAGING RESOURCES AND APPLICATIONS WITH
HADOOP YARN
You can run multiple applications in Hadoop 2.X in which all
application share a common resource management. Now Hadoop can be wed
for various types of processing such as Batch, Interactive, Online, Streaming,
Graph, and others.
Limitations of Hadoop 1.0 Architecture
In Hadoop 1.0 HDFS and MapReduce are Core Components, while other
components are built around the core
1. Single NameNode is responsible for managing entire namespace for Hadoop
Cluster
2. It has a restricted processing model which is suitable for batch-oriented
MapReduce jobs.
3. Hadoop MapReduce is not suitable for interactive analysis .
4. Hadoop 10 is not suitable for machine learning algorithms, graphs, and other
memory intensive algorithms
5. MapReduce is responsible for cluster resource management and data
processing.
HDFS Limitation
NameNode saves all its file metadata in main memory. Although the
main memory today is not as small and as expensive as it used to be two
decades ago, still there is a limit on the number of objects that one can have
in the memory on a single NameNode. The NameNode can quickly become
overwhelmed with load on the system increasing.
Hadoop 2: HDFS
HDFS 2 consists of two major components:
(a) namespace.
(b) blocks storage service.
Namespace service takes care of file-related operations, such as creating files,
modifying files, and directories. The block storage service handles data node
cluster management, replication.
HDFS Features:
1. Horizontal scalability.
2. High availability.
Hadoop 2 YARN: Taking hadoop beyond Batch
Fundamental Idea
The fundamental idea behind this architecture is splitting the Job Tracker
responsibility of resource manage ment and Job Scheduling/Monitoring into
separate daemons. Daemons that are part of YARN Architecture are described
below.
1. A Global Resource Manager:
Its main responsibility is to distribute resources among various applications
in the system. It has two main components:
(a) Scheduler: The pluggable scheduler of Resource Manager decides
allocation of resources to various running applications.
(b) Application Manager: Application Manager does the following:
Accepting job submissions.
Negotiating resources (container) for executing the
application specific Application Master.
Restarting the Application Master in case of failure.
2. NodeManager:
This is a per-machine slave daemon. NodeManager responsibility is
launching the application containers for application execution. NodeManager
monitors the resource usage such as memory, CPU, disk, nerwork, etc. It then
reports the usage of resources to the global ResourceManager.
3. Per-application ApplicationMaster:
This is an application-specific entity, its responsibility is to negotiate
required resources for execution from the ResourceManager. It works along with
the NodeManager for executing and monitoring component tasks.
Basic Concepts
Application:
1. Application is a job submitted to the framework.
2. Example-MapReduce Job.
Container:
1. Basic unit of allocation.
2. Fine-grained resource allocation across multiple resource types
(Memory, CPU, disk, network, etc.)
(a) container 0-2GB, 1CPU
(b) container 11GB, 6 CPU
3. Replaces the fixed map/reduce slots.
YARN Architecture:
The steps involved in YARN architecture are as follows:
1. A client program submits the application which includes
the necessary specifications to launch the application-specific
ApplicationMaster itself.
2. The Resource Manager launches the Application Master
by assigning some container.
3. The ApplicationMaster, on boot-up, registers with the
Resource Manager. This helps the client program to query the
Resource Manager directly for the details.
4. During the normal course, ApplicationMaster negotiates
appropriate resource containers via the resource-request protocol.
INTERACTION WITH HADOOP ECOSYSTEM
Pig:
Pig is a data flow system for Hadoop. It uses Pig Latin to specify dara
flow.
Pig is an alternative to MapReduce Programming. It abstracts some
details and allows you to focus on data processing.
It consists of two components
1. Pig Latin: The data processing language
2. Compiler: To translate Pig Latin to MapReduce Programming.
Hive:
Hive is a Data Warehousing Layer on top of Hadoop. Analysis and
queries can be done using an SQL-like language.
Hive can be used to do ad-hoc queries, summarization, and data analysis.
Figure 5.31 depicts Hive in the Hadoop ecosystem.
Sqoop:
Sqoop is a tool which helps to transfer data between Hadoop and
Relational Databases. With the help of Sqoop, you can import data from
RDBMS to HDFS and vice-versa.
HBase:
HBase in a NoSQL database for Hadoop. HBase is column-oriented
NoSQL database,
HBase is used to store billions of rows and millions of columns. HBase
provides random read/write operation.
It also supports record level updates which is not possible using HDFS.
HBase sits on top of HDFS
******************************************************
UNIT: V
INTRODUCTION TO MANGODB & MACHINE LEARNING
Introduction to MangoDB: What is MangoDB-Why MangoDB- Terms
used in RDBMS and MangoDB -Data types in MangoDB-MangoDB
querylanguage.
Introduction to Machine Learning: Introduction –Machine Learning
Definition –Machine Learning Algorithms –Regression Model –Linear
Regression –Clustering –Collaborative Filtering –Association Rule
Mining –Decision Tree.
INTODUCTION TO MANGODB
MongoDB is a preferred big data option thanks to its ability to
easily handle a wide variety of data formats, support for real-time
analysis, high-speed data ingestion, low-latency performance,
flexible data model, easy horizontal scale-out, and powerful query
language.
MongoDB, The most popular NoSQL database, is an open-source
document-oriented database.
The term ‘NoSQL’ means ‘non-relational’. It means that
MongoDB isn’t based on the table-like relational database
structure but provides an altogether different mechanism for
storage and retrieval of data.
What is MangoDB?
1.Cross-plamultitudetform.
2.Open source.
3.Non-relational.
4.Distributed.
5.NoSQL.
6.Document-oriented data source.
Why MongoDB?
Few of the major challenges with traditional RDBMS are dealing with large volumes of
data,rich variety of data- particularly unstructured data,and meeting up to the scale needs
of enterprise data.The need is for a database that can scale out or scale horizontally to
meet the scale requirements,has flexibility with respect to schema,is fault tolerant,is
consistent and partition tolerant,and can be easily distributed over a multitude od nodes in
a cluster.
Terms used in RDBMS and MongoDB
RDBMS MongoDB
Database Database
Table Collection
Record Document
Columns Fields/Key Value pairs
Index Index
Joins Embedded document
Primary Key Primary key id is a identifier)
MYSQL Oracle Mongo DB
Database Mysqld Oracle Mongod
Server
Database Mysql SQL plus Mongo
Client
Data types in MangoDB
The following are various data types in MangoDB.
String Must be UIT-8 valid. Most commonly used data type
Integer Can be 32-bit or 64-bit.
Boolean To store a true/false value.
Double To store floating point (real values).
Min/Max Keys To compare a value against the lowest or highest BSON
elements
Arrays To store arrays or list or multiple values into one key
Timestamp To record when a document has been modified or added.
Null To store a NULL value. A NULL is a missing or unknown
value
Date To store the current date or time in Unix time format. One
can create object of date and pass day, month and year to it.
Object ID To store the document's id.
Binary data To store binary data (images, binaries, etc.).
Code To store javascript code into the document
Regular To store regular expression
expression
MangoDB query language
CRUD (Create, Read Update, and Delete) operations in MongoDB
Create ->Creation of data is done using insert() or update() or save() method.
Read ->Reading the data is performed using the find() method.
Update ->Update to data is accomplished using the update() method with UPSERT
set to false.
Delete -> a document is Deleted using the remove() method.
We will present the various methods available in MongoDB shell to deal with data in the next
few section The sections have been designed as follows:
Objective: What is it that we are trying to achieve here?
Input:What is the input that has been given to us to act upon?
Act: The actual statement/command to accomplish the task at hand.
Outcome:The result/output as a consequence of executing the statement.
At few places we have also provided the equivalent in RDBMS such as Oracle.
Objective: To create a collection by the name "Person". Let us take a look at the collection list
prior to the creation of the new collection "Person":
show collections
Students
food
system. indexes
System.Js
Act: The statement to create the collection is
db.createCollection("Person")
>db.createCollection("Person");
{“ok”:1}
>
Outcome: Below is the collection list after the creation of the new collection "Person":
>show collections;
Person
Students
food
system.indexes
system.js
>
Objective: To drop a collection by the name "food".
Take a look at the current collection list:
>show collections:
Person
Students
food
system.indexes
system.js
>
Act: The statement to drop the collection is
db.food.drop();
> db.food.drop():
true
>
Outcome: The collection list after the execution of the statement is as follows:
>show collections
Person
Students
system.indexes
system.js
>
INTRODUCTION TO MACHINE LEARNING
Machine learning is programming computers to optimize a
performance criterion using example data or past experience. We
have a model defined up to some parameters, and learning is the
execution of a computer program to optimize the parameters of the
model using the training data or past experience. The model may be
predictive to make predictions in the future, or descriptive to gain
knowledge from data.
Machine Learning Definition
Arthur Samuel (1959), Machine Learning: It is a field of study that gives computers
the ability to learn without being explicitly programmed.
Tom Mitchell (1998), Well-posed Learning Problem: A computer program is said to
learn from experi- ence E with respect to some task T and some performance measure
P, if its performance on T, as measured by P, improves with experience E.
Example
Tom's definition is the latest Machine Learning Definition. According to Tom's definition,
we will try to identify E, P, T in Email Spam Classification Problem. In this problem,
Task - Classifying email as spam or not.
Experience - Labeling the email as spam or not.
Performance - Number of emails correctly classified as spam or not
Machine Learning Algorithms
Machine learning Algorithms can be classified into two categories.
1.supervised Learning:In supervised learning,the classigfier undergoes
a process of training based on known classifications and through
supervision it attempts to learn the information contained in the training
dataset.example:predicting the selling price of the house based on
pricing of other houses for sale in neighborhood.
2.unsupervised Learning:In unsupervised learning ,the classifier tries
to find some structure from the giver data set without any known
classification.That structure is known as cluster.Example:Google news
collects tens of thousands of news stories and automatically clusters
them together.so that news stories that have the same content are
displayed together.
Regression Model –Linear Regression
Regression Model is used to predict numbers. With the help of regression
you can predict profit, sales, house values, etc. Regression Model serves as a good
example for classification (supervised learning). Here, we will discuss Linear
Regression.
Linear Regression is used to predict the relationship between two variables.
The variable that needs to be predicted is known as the dependent variable and the
variables that are used dependent variable are known as independent variables. to
predict the value of the
Clustering
Clustering is the process of grouping similar object together.Clustering is an
example of unsupervised learning. One can use clustering algorithms to segment
data as in classification algorithms. However, classification models are used to
segment data based on previously defined classes that are mentioned in the target,
whereas clustering models do not use any target.
Clustering can be used to group items in a supermarket. For example, butter,
cheese and milk can be placed in the "dairy products" group.
You can use clustering when you want to explore data. Clustering algorithms are
mainly used for natural groupings. There are different categories of clustering.
1. Hierarchical: Hierarchical cluster identifies the cluster within the cluster. A
news article group can further have other groups such as business, politics, and
sports in which each group can still have subgroups. For example, inside sports
news there could be news on baseball sport, news on basketball sport, and so on.
2. Partitional: Partitional creates a fixed number of clusters. The K-means
clustering algorithm belongs to this category. Let us study the K-mean clustering
algorithm in detail.
Collaborative Filtering
Collaborative filtering is a technique used for recommendation. Let us assume there
are two people A and B.A likes Apples and B also likes Apples.We can assume that B
has similar liking as A.So we can go ahead and recommend options for A to B as well.
12.2.3.1 History of Collaborative Filtering
The history of collaborative filtering started with Information Retrieval and Information
Filtering.
Information Retrieval
The information retrieval era was from 1960s to 1980s. It is about retrieving information
based on the queries/questions and these contents are mostly static in nature. Indexes,
if built, help with the retrieval of information. For example, a collection of books can be
indexed by title, author, and summary specified for the book. However, information
requirements do not stay the same and change from time to time. This kind of
information is known as dynamic content. An example of dynamic content is Google
Search Engine. which provides us with dynamic content based on our search criteria.
Information Filtering
The exponential growth of Web has led to the explosion of information. So we require
some technique to reduce the information overload and sieve information that is
relevant to the users. This is then utilized to build long-term profile of the users' needs.
Email Filtering (Filtering Spam Messages) is an example for information filtering.
Collaborative Filtering
Collaborative filtering is a category of information filtering. It is nothing but predicting
user preferences based on the preferences of a group of users. You can use
collaborative filtering when information needs are more complex than keywords or
topics. Collaborative filtering is concerned with quality and taste. Collaborative filtering
can be defined as Social Navigation. We say that human beings are social animals
owing to their tendency to follow other people's advice or judgment when looking for
information or buying products. This is in a way similar to an ant looking for food
trudging behind other ants.
Algorithms of Collaborative Filtering
Let us discuss few of the collaborative filtering algorithms.In collaborative
filtering , the important concern is “How to find someone Who is similar?”It
involves two steps.
1. Collecting Preferences
Snow Crash Dragon Tattoo
John 5 5
Jack 2 5
Jim 1 4
2. Finding Similar Users: Any of the following algorithms can be used to find
similar users.
Euclidean Distance Score
Manhattan Distance or Cab Driver
Pearson-Correlation Co-efficient
Start by plotting the data as shown below. Here, X represents the Dragon Tatto and Y
represents the Snow Crash.
Association Rule Mining
Association rule mining is also referred to as market basket analysis. Few also
prefer to call it as affinity analysis. It is a data analysis and data mining technique. It is
used to determine co-occurrence relationship among activities performed by
indivuduals and groups
.
Examples
1. widely used in retail wherein the retailer seeks to understand the buying behavior of
customer This insight is then used to cross-sell or up-sell to the customers.
2. If you have ever bought a book from Amazon, this should sound familiar to you. The
moment you are done selecting and placing the desired book in the shopping cart, pop
comes the recommendation stating that customers who bought book "A" also bought
book "B".
3. Who can forget the urban legend, the very famous beer and diapers example. The
legend goes... there was a retail firm wherein it was observed that when diapers were
purchased beer was purchased as well by the customer. The retailer cashed in on this
opportunity by stocking beer coolers close to the shelves that housed the diaper. This
just to make it convenient for the customers to easily pick both the products.
An association rule has two parts (a) an antecedent (if) and (b) a consequent (then). An
antecedent is an item found in the data. A consequent is an item that is found in
combination with the antecedent.
Decision Tree
You have zeroed down your choice to either an ice-cream stall or a burger
stall from a gamut of choice available. How about using a decision tree to decide
on the same?Let us look at how we can go about creating a decision tree.
Decision to be made: Either an ice-cream stall or a burger stall.
Payoff: 3500 INR in profit if you put up a burger stall and a 4000 INR in profit if
you put up an ice-cream stall.
What are the uncertainties?
There is a 50% chance of you succeeding to make profit with a burger stall
and a 50% chance of you failing at it.
As per the weather forecast, it will be downcast sky and may drizzle or
pour slightly throughout the week. Keeping this into consideration, there is a 40%
chance of success and 60% chance of failure with an ice- cream stall.
Let us look at the cost of the raw materials:
For Burger: 700 INR for the burger buns, the fillings, and a
microwave oven to keep it warm.
For Ice-creams: 1000 INR for the cone, the ice-cream, and a freezer
to keep it cold.
Let us compute the effective value as per the below formula:
Expected value for burger = 0.5× 3500 INR-0.5x700 INR= 1400 INR
Expected value for ice-cream = 0.4 × 4000 INR-0.6×1000 INR = 1000 INR
The choice is obvious. Going by the expected value, you will gain by
putting up a burger stall.
The expected value does not imply that you will make a profit of 1400 INR.
Nevertheless, this amount is useful for decision-making, as it will maximize
your expected returns in the long run if you continue to use this approach.
DIAGRAMS
Payouts and probabilities
TV Network payout:
Flat Rate: 500,000 INR
XYZ Movie Company Payout:
Small Box Office : 250,000 INR
Medium Box Office: 600,000 INR
Large Box Office: 800,000 INR
Probabilities:
p(Small Box Office): 0.3
p(Medium Box Office): 0.5
p(Large Box Office): 0.2
For greater understanding, let us create a payoff table.
Decisions Small Box Medium Large Box
Office Box Office Office
Sign up with 500,000 INR 500,000 INR 500,000 INR
TV Network
Sign up with 250,000 INR 600,000 INR 800,000 INR
XYZ Movie
Company
Probabilities 0.3 0.5 0.2
Let us compute the effective value as per the below formula:
Expected value for TV Network=0.3 x 500,000+0.5×500,000+0.2x500,000
500,000 INR.
Expected value for XYZ Movie Company = 0.3×250.000+0.5×600,000
+0.2×800,000 535,000 INR.
Diagrams
What is a Decision Tree?
A decision tree is a decision support tool. It uses a tree-like graph to depict decision and
their consequences,
The following are the three constituents of a decision tree:
1. Decision nodes: Commonly represented by squares.
2. Chance nodes: Represented by circles.
3. End nodes: Represented by triangles
Where is it used?
Decision trees are commonly used in operations research, specifically in decision analysis. It is
used to zero down on a strategy that is most likely to reach its goals. It can also be used to
compute conditional probabilities.
Advantages of using a Decision Tree:
1. Easy to interpret.
2. Easy to plot even when there is little hard data. If one is aware of little data such as
alternatives, prob- abilities, and costs, it can be plotted and can lead to useful insights.
3. Can be easily coupled with other decision techniques.
4. Helps in determining the best, worst, and expected value for a given scenario or
scenarios.
Disadvantages of Decision Trees
1. Requires experience: Business owners and managers should have a certain level of
experience to complete the decision tree. It also calls for an understanding of
quantitative and statistical analytical techniques.
2. Incomplete Information: It is difficult to plot a decision tree without having complete
information of the business and its operating environment.
3. Too much information: Too much information can be over-whelming andlead to
what is called as the “paralysis” of ananlsis”.
****************************************************