0 ratings0% found this document useful (0 votes) 185 views21 pagesBig Data Unit 1
Big data analytics unit 1 book
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
—
UNITI
UNIT IL
UNIT IIT
UNITIV
UNITV
SYLLABUS
UNDERSTANDING BIG DATA
Introduction to big data - convergence of key trends — unstructured data —
industry examples of big data ~ web analytics ~ big data applications big
data technologies ~ introduction to Hadoop — open source technologies ©
loud and big data ~mobile business intelligence - Crowd sourcing analyticg
= inter and trans firewall analytics.
NOSQL DATA MANAGEMENT
Introduction to NoSQL — aggregate data models — key-value and document
data models ~ relationships ~ graph databases ~ schemaless databases
materialized views ~ distribution models — master-slave replication —
‘consistency - Cassandra ~ Cassandra data model — Cassandra examples —
Cassandra clients
MAP REDUCE APPLICATIONS
MapReduce workflows — unit tests with MRUnit ~ test data and local tests
~ anatomy of MapReduce job run ~ classic Map-reduce — YARN ~ failures
in classic Map-reduce and YARN — job scheduling — shuffle and sort —task
execution ~ MapReduce types — input formats — output formats.
BASICS OF HADOOP
Data format ~ analyzing data with Hadoop — scaling out Hadoop streaming
= Hadoop pipes — design of Hadoop distributed file system (HDFS)—HDFS.
concepts ~ Java interface — data flow — Hadoop 1/0 ~ data integrity —
compression — serialization — Avro — file-based data structures - Cassandra
~Hadoop integration.
HADOOP RELATED TOOLS
Hbase — data model and implementations ~ Hbase clients — Hbase examples
~ praxis. Pig - Grunt ~ pig data model — Pig Latin — developing and testing
Pig Latin scripts. Hive ~ data types and file formats ~ HiveQL. data definition
~HiveQL data manipulation — HiveQL queries.
CONT YN
CONT Hy
NITE UNDERSTANI iy
INTRODUCTION 10)
1
1.1.1 Types of
1.1.2 Characterin
113 Advantages
12. CONVERGENCE ¢
A 13 UNSTRUCTURED Dara,
e 13.1 Structured Vs, (
1.4 INDUSTRY EXAM
1.5 WEBANALYTICS :
15.1 Types of Web Analy
15.2 Process of Web Analytics
| 15.3. Benefits of Web Analytice.
16 BIGDATAAPPLICATIONS
BIG DATA TECHNOLOGIE:
pe a
1
1.8.2 Key Features and Benefits of Hadsg,
OPEN-SOURCE TECHNOLOGIES
1,10 CLOUD AND BIGDATA :
| 1.10.1 Cloud Computing and Big Date
|. 1-11 MOBILE BUSINESS INTELLIGENCE
| LALA. Need For Mobile Bi
/ 1112 Advantages Of Mobile BI
2. CROWD SOURCING ANALYTICSUnderstanding Big Data i
UNITI
UNDERSTANDING BIG DATA
i oo een ST
eee ctny 1g sta = convergence of key trends ~ unstructured
applications— bi saree of big data — web analytics — big data
source technolo’, ‘ata technologies — introduction to Hadoop — open
= Crowd sourcia °S ~ £!oud and big data ~ mobile business intelligence
“ING analytics — inter and trans firewall analytics.
eaten
1.1 INTRODUCTION TO BIG DATA
ind complexity that none of the traditional data
* Process it efficiently. Big data is also data but with a
ie a
ig Data analytics is a process used to extract meaningful insights, such as hidden
patterns, unknown correlations, market trends, and customer preferences.
: Big Data analytics Provides various advantages—it can be used for better decisi
making, and preventing fraudulent activities, among other things.
1.1.1 Types of Bigdata
There are three main types of big data:
* Structured,
* Semi-structured, and
* Unstructured data.
+ Structured data: Structured data is highly organized and typically stored in a
database. It can be easily analyzed using tfaditional data analysis tools and
techniques, as it is formatted in a specific way. Examples of structured data include
transactional data, customer data, financial data, and inventory data.Big Data Analytics
Fenistructured data: Semi-structured data is a mixture of structured and
serge tured data Ithas a defined data model, but the data itself may not be fully
vrmanized. Examples of smi-structred data include XML and JSON daa, log
and sensor data,
fil
Unstructured data: Unstructured dat
‘and does not have a defined data mod
‘and can be difficult to analyze u
‘Examples of unstructured data include emails,
and text data.
1.12 Characteristics of Big Data
big data can be described by the following characteristi
is not organized in any particular way
‘media data, images, videos,
«Volume
is enormous. Size of data
F data.
sidered as a Big Data or
one
Variety
The next aspect of Big
PDFs, audio,
variety of
analyzing data,
ars
: understanding Big Data
un derstaneg
competitors.
4Big Data Analytics
CONVERGENCE OF KEY TRENDS.
Several key trends have converge
As more devices become
connected tothe interne, the amount of data generated is expected to continue
to increase.
Cloud computing: The widespread a of cloud computing has made
iteasierand more cost-effective for organizations to store and process lane
Cloud-based big data platforms have become more
businesses to process and analyze large amounts of deta
se infrastructure.
amounts of d
accessible, allow
‘without investin
expensive on-pr
© Machine learning and AL: The growth of machine learning and artifi
intelligence (AD) has made it possible to extract insights from large and
complex data sets. These technologies can help automate data analysis,
identify patterns and trends, and make predictions based on data,
‘+ Data privacy and security: The importance of data privacy and security has
increased significantly in recent years, as data breaches and cyber-attacks
have become more common. Big data solutions must ensure that sensitive
data is properly protected, and that security protocols are in place to prevent
unauthorized access. j
+ Data governance: The growing importance of data governance has made it
| for organizations to have policies and procedures in place for
managing and using data. This includes ensuring data quality, maintaining
data accuracy and consistency, and complying with data privacy regulations.
‘These key trends have converged to make big data a critical component of modern
business operations, As the amount of data continues to grow, organizations will need
{0 adopt new technologies and processes to ensure that they can effectively manage
and extract value from their data.
13 UNSTRUCTURED DATA
Sn ee eeally waa
erstanding Big Data d
tis typically text-based data, tan 14
images, Videos, and aud
‘Unstructured data
qmail, mobile devices, Sensors, and webs eys,
uf 4A esas i. etait
some examples of unstructured data
Social media data:
Facebook, Twitter, Link
videos, and other types of n
Emails: Email data
sent and received by int
Audio and video files: Audio anc
phone calls, interviews, and surv
Hance footage.
One popular application is customer analytics. Retailers, manufac
‘companies analyze unstructured data to improve customer exp
targeted marketing. Sentiment analysis can be done to better understand eustorm:
and identify attitudes about products, customer service and corporate bran
1.3.1 Structured Vs. Unstructured Data
‘The main differences between structured and unstructured data include the ty
of analysis it can be used for, schema used, type of format and the ways itis stored.
Traditional structured data, such as the transaction data in financial systems and
other business applications, conforms to a rigid format to ensure consistency in
processing and analyzin,
Sets of unstructured data, on the other hand, can be maintained in formats that
aren't uniform.
Structured data is stored in a relational database (RDBMS) that provides access
to data points that are related to one another via columns and tables. For example,
‘customer information kept in a spreadsheet and categorized by phone numbers.
addresses or other criteria is considered structured data.Big Data Analytes
16
1.4 INDUSTRY EXAMPLE OF BIG DATA
‘There are many industries that are using big data to dri
‘ovation and improve
business oper
Retail Industry:
patterns, which can be use
growth.
One example of big data in the retail industry is Amazon. Amazon uses big
ie, by recommending products based
ing behavior. This helps to increase
which in tur drives sales and revenue
customer engagement and loyal
idustry is Walmart, Walmart uses
ly chain operations, by analyzing data from
suppliers, distributors, and its own stores.
+ This allows Walmart to better forecast demand, optimize inventory levels,
and reduce waste resulting in cost savings and improved operational efficiency,
nother example of big dat
Healthcare Industry:
The healthcare industry is one of the fastest-growing industries for big data,
asit generates and manages vast amounts of data from various sources such
as electronic health records (EHRs), medical ima,
genomics.
© Byanalyzing insights into patient
health, disease diagnosis and treatment, and operational efficiency.
organizations improve patient outcomes, reduce costs, and enhance the
+ Watson Health analyzes vast amounts of patient data, including medical
records, lab results, and imaging data, to provide clinicians with personalized
‘reatment recommendations and insights into disease trends and
Understanding Big Data
47
another exammle of Ut?
‘Another example of big data in healthcare is Pfizer. Pfizer uses big data to
iscovery and development, by analyzing vast amounts of
|, and operational data.
accelerate dru
gets and develop more effective
ional efficiency and reducing costs.
ws Pfizer to identify new dru
while also improving oper
yy is another industry that generates and manages vast,
amounts of data, including financial transactions, market data, and customer
tutions can gain insights into market
id risk management, which can be used to inform
business decisions and drive revenue growth,
Mastercard. Mastercard
ime, by analyzing vast
One example of big data
uses big data to id
amounts of transaction data from its global network.
'y and alert cardholders
‘This allows Mastercard to detect fraudulent a
re processed, reducing
and merchants before any fraudulent trans:
financial losses and improving the customer experience.
+ Another example of big d:
an be used to improve website design, user experience, and8g Daca
ta is typically collected using a web analytics tool su
Visitor activity on the website, has Googie
ide metrics such as:
The data collec
a Pageviews: The number of times a specifi page on the website is viewey
iewe
Unique visitors: The number of unique individuals who visit the web;
© Website
iven period of time.
The percentage of visitors who leave the website after view;
iewing
Bounce rai
only one page.
Session duration: The length of time that visitors spend on the website
Conversion rate: The percentage of visitors who complete a specific goa)
such as filling out a contact form or making a purchase.
1.5.1 Types of Web Analytics
‘There are two main types of web analytics: on-site analytics and off-site analytics,
Ou-site analytics:
«On-site analytics tracks user behavior on a specific website. It collects data
on website traffic, pageviews, bounce rates, conversion rates, and ot
On-site analytics tools include Google Analytics, Adobe Analytics,
me
and Piwik.
«On-site analytics data can be used to identify user behavior patterns, popular
pages, and areas for improvement on the website.
used to optimize website design, user experience, and marketing efforts.
Off-site analytics:
Off-site analyties tracks traffic from external sources, such as search engines,
social media, and referral sites. It collects data on the number of vi
referral sources, and user behavior on the website.
+ Off-site analytics tools include SEMrush, Abrefs, and Si
analytics data can be used to track the effectiveness of marketing
fy popular referral sources, and optimize
ilar Web. Off-site
efforts,
understanding Big Data
1.5.2 Process of Web Analytics
ies involves:
“The Process of web an
‘Setting business goals: Defining the key metrics that wil i
success of your business and website. will determine the
Collecting data: Gathering information, statistics, and :
itors using analytics tools. * and data on website
Processing data: Converting the raw data you've gathered into meani
ratios, KPIs, and other information that tell a story. a
Reporting data: Displaying the processed data in an easy-to-read format.
Developing an online strategy: Creating a plan to optimize the websi
experience to meet business goals.
Experimenting: Doing A/B tests to determine the best way to optimize
website performance. ie oes
1.5.3 Benefits of Web Analytics
«Understanding visitor behavior: Web analytics provides insights into how visitors
are interacting with the website, which can help identify areas for improvement.
‘or behavior, website owners can
Improving website design: By analyzing
optimize the website design to improve the user experience.
ics data can be used to track the
Measuring marketing performance: Web anal
effectiveness of online marketing campaigns and make adjustments to improve
their performance.
Increasing website traffic: By identifying popular pages and optimizing website
owners can increase website traffic and engagement.
content, wet
BIGDATA APPLICATIONS
tions across various industries, including:
Big data has numerous appli
Healthcare: Big. data is used in healthcare to improve patient outcomes,
reduce costs, and optimize treatment plans. Healthcare providers use
data to analyze patient data, predict disease outbreaks, and improve diagnostic
accuracy.
eeerack vehicle performance, optimize routing, and
+ Marketing: Marketers use big data to analyze customer be
1¢ customer experience, and optimize marketing campaigns. Big da
marketers identify target audiences, track customer behavior, and optimize
marketing strategies.
designed for
large amounts of
or user acti
© Apache Bea
1.7 BIG DATA TECHNOLOGIES for both bat
‘There are several big data technologies and tools that are commonly used to and can run on various big data processing engines, ineluding Apache Spark,
store, process, and analyze large and complex data sets. Some of the popular big di Apache Flink, and Google Cloud Dataflow.
technologies include: © Elastic Stack: Elo:
source software
pache Beam isan open-source unified program
j¢ Stack (formerly known as ELK stack) is an open-
used for search, analyties, and visualization of large
+ Harloop: adoop is an open-source framework used for distributed storage
‘and processing of large data sets across clusters of commodity hardware. It data sets. It includes Elastiesearch, Logstash, and Kibana.
is de ened to handle large and complex data sets and can scale up or down + Apache NiF: Apache NiFi is an open-source data integration tool used for
as needed. : ingesting, processing, and distributing data across various systems. Itis often
used in data lakes and data hubs.Big Data Anatnics
1g INTRODUCTION TO HADOOP
Java language adopting
Daetaeel ‘and distributed proces:
MapReduce from im
of huge volumes of ds
jadoop was J
le System (GFS).
rived from white papers such as Google MapReduce ang
A
Google
ned to scale up from single servers to thousunds of computers,
job scheduling, and resource
op platform.
a cluster of computers that consists of one master node
Hadoop can be viewed as
and many worker nodes.
.de schedules the tasks and the workers are responsible
The master
performing the execu!
Hadoop can be deployed i
1a of the map and reduce tasks.
ree modes:
used for debuyging in a single node environment,
a single standalone instance,
+ Standalone mo
Hadoop can be installed on a single node
adoop cluster can be formed by connecting
hardware,
+ Fully Distributed mod
‘multiple nodes of commodi
d mode: This
ingle node java system that runs the ent
‘The two versions of Hadoop: Hadoop 1.x and Hadoop 2.x,
1. Madoop 1.x:
+ Itsupports the MapReduce mode! only.
‘+ Non- MapReduce tools are not supported.
less scalable than the Hadoop 2. x version since
nodes per cluster.
+ Hadoop 1. x is responsible for data processing and lus
management.
erstanding Big bata
‘Understanding Big Date 8
2
Hadoop 2.x:
Jt supports the
as Spark, i
In can scale up 10
cluster resource m
data processi
On cach of the nodes, resour
of map and reduce slots avai
ows running other Sremenorks on top of HDFS.
System) using YARN API.
Wiedoop Distibuted
‘The MRV2 is a next-generation MapReduce framework that runs wit
a m that runs within
Hadosp 1.9
te rie fa
MepResuce
(Rescurca Maegerart ad Data Processing
Hors
(Fie Storeze)
14 Hadoop 1.x version
Hadsop 20
rs
aes, | [oe
g U
YARN
Resource Management and Dasa Possess)
HOFS
(File Storage)
Figure 1.2 Hadoop 2.x versionBig Data Analyticg
Mn ee
1.8.1 Hadoop Core Components Understanding Big Data ioe 4
mist aap aula Pevformaed + Provides data security a
Tiadoop Common ‘Common utilities i.e. java library and java files used by «Highly fault-tolerant - [fone machine goes down, the data from that machine
other components such as HDFS, YARN, and goes to the next machine
‘MapReduce for running the Hadoop cluster. 2. Hadoop YARN
HDFS- Storage layer | It allows the storage of a huge volume of data across ‘+ Hadoop YARN stands for Yet Another Resource Negotiator. Its the resource
multiple nodes. Data is stored in the form of memory management unit of Hadoop and is available as a component of Hadoop
blocks and is distributed across the cluster. version 2.
Hadoop YARN- resource| It is responsible for job scheduling and resource * Hadoop YARN acts like an OS to Hadoop. It is a file system that i
‘management layer ‘management, built on top of HDFS,
MapReduce- data Parallel Processing of huge datasets + Itis responsible for managing cluster resources to make sure you don"t
processing layer overload one machine.
z : : % Itperforms job scheduling to make sure that the jobs are scheduled in
1, Hadoop Distributed File System (DFS) ie taialice
+ HDFS is the file system of Hadoop. + Inthe second version of Hadoop called YARN, the two major features of the
= Itis an open-source implementation of the distributed Google File System, Job Tracker have split into, (1) a global Resource Manager and (2) a per-
. th huge datasets or files. HDFS splits the data into application Application Master
block-sized chunks. + Cluster resource management and job scheduling are separated into two
+The default block size of HDFS is 64 MB and it ean be extended up to 128 aca
He + The main components of YARN architecture are resource manager, node
+ Users ean configure the block size as per the requirement. See ae Cones
3. MAPREDUCE
+ Storing small files in HDFS leads to a wastage of memory.
+ Data is stored in HDF in two forms, actual and metadata,
+ Theactual datais stored in DataNodes and metadata is stored in NameNode.
+ Itincludes the timestamp, file size, and location of blocks. Re}
HDFS ensures data availabi
Features of IDFS
+ Provides distributed storage
+ Can be
jlemented on commodity hardware
+ MapReduce is the data processing layer of Hadoop.
is a software fra
vast amount of structured and unstructured data stored
Distributed Filesystem (HSDF).
Process the
the Hadoop
© Itprocesses huge by dividing the job (submitted
‘+ InHadoop, MapReduce works by breaking the processing into phases: Map
and Redu's of Hadoop
1.8.2 Key Features and Bene!
adoop is designed to be highly scalable and can easily ha
snal nodes to the cluster,
growing data sets.
of data types,
It ean also th a wide range of tools and technologies, such as
‘TL tools, and BI platforms.
Apache Storm.
+ Community support: Hadoop is an open-source platform that is supported
by a large and active community of developers and users. This provides
of resources, such as documentation, tutorials, and support
‘access to.a weal
forums.
1.9 OPEN-SOURCE TECHNOLOGIES
Open-source technologies refer to software or computer programs that have their
source code available to the public, allowing anyone to access, modify, and distrib
ie code, which can result in greater
fer
understanding Big Data
the late 1990s and has since become 2
‘The open-source
significant force
Open-source
flexibility, and cust
‘They also provide opport
developers from around the world ean co!
improvement.
refer to software tools and
spen-source technolo;
analysis, and storage of large
ind enable process
In the context of big da
platforms that are freely av
volumes of data.
abl
ecosystem
large and
I part of the big dat
These technologies have become an esse
because they provide scalable and cost-effective solutions for man:
complex data sets.
‘Some popular open source big data technologies include:
form that allows for the storage and
clusters of computers.
jbuted computing
ig of large data sets acri
‘Spark: A fast and general-purpose data processing engine that can handle
both batch and re
+ Cassandra: A NoSQL database that is designed to handle large amounts of
data across multiple servers.
+ Elastiesearch: 4 distributed search and analytics engine that ean quickly
and easily search large amounts of data.
ime processing
+ Kafka: A distributed streaming platform that can handle real-time data
streams.
development and community contri
features.peo
Taprostructure ws a servic :
Applications or software as a service (SAAS)
big Data Analytic,
ity to our database
Examples of PaaS are Windows Azure and Google App Engine (GAB)
ex, Salesforce.com, dropbox, google drive ete.
Cloud for Big Data
Below are some examples of how cloud applications are used fa
JAAS in a public cloud:
Using a cloud provider's infrastructure for Big Data
understanding Big Data 1.21
the need to analyze the customer's voice,
| media data.
of businesses
and planning.
corporation that employs hundreds.Providers in the
Data Cloud Market
id comp!
large software: vendors
ee, or are in the process of Iaun
ay startupsth ‘
ire we have a list of major vendors of cloud computing,
Few of the eloud providers are £0%
é hne leading cloud provider amongst all,
s called as azure,
1BM’s offerings include Smart Business Storage Cloud and Computing on
Demand (CoD).
AT&T's provides Synaptic Storage and Synaptic Compute as a service,
Platform as a Service cloud computing companies
Googles AppEngi
is a development platform that is built upon Python
and Java. :
com's provides a development platform that is based upon Apex.
Microsoft Azure provides a development platform based upon .Net.
Software as a Service companies
In SaaS, Google provides space that includes Google Docs, Gm
Calendar and Picasa,
IBM provides LotusLive
and calendaring capabi
Understanding Big Data 1.23
Issues in Using Cloud Services
‘Some important cloud services issues are as listed:
Data Security
‘© Organizations must ensure that their agreement wi
to take advantage of a cli
company’s information
ied wherever possible.
+ Exceptions must be clearly noted. Service-Level Ayreem:
“Fhoule clewely = e tees and conditions between a service user -
ider to ensure propér performance.
‘+ Cloud services must be compatible with the compliance needs of the
business. Some companies are also concerned about regulatory issues.
+ Market observers say that around 50 percent people worry that they
will be tied to one provider of cloud storage.
Legal Issues
jon must ensure that the location of the physical resources of
‘cloud does not bring any legal issue.
‘The cloud presents a number of legal cl
+ Organizations should be aware of all
cloud, and use the jed manner as eloud offers pay
~asper usage method of the cost incurred by the company,Lu
hat
ENCE
MOBILE BUSINESS INTELLIG' pero ind BE nto
mobile BI is able to brn,
Hons © Myset when done Properly ©
snagement PO:
Joser to HE
_ in the airport departure lounge o,
.d almost anywhere ang
with mobile BI.
Mobile BI — driven by the
asa big wave in BI and analy
Hasion in the market an
success of mobile devices ~ WAS consig.
a few years ag0. Nowaday,
d users attach much less impo,_
alevel of di
this trend,
orthy information to the right perso,
jgenceis the transfer of business int
has the BlackBerry, iPad, ang jp,
BI delivers relevant and trust
right time. Mobile business int
from the desktop to mobile devices suc
ics and data on mobile devices or tab
fered to as mobile business intelligenc,
icators (KPIs) are pl
ty to access analytic
than desktop computers is re
business metric dashboard and key performance
clearly displayed.
With the rising use of mobile devices, so have the technology that we 4,
cluding business, yj.)
ell
ves to make our lives easit
businesses have benefited from mobile business
Essentially, this post is a guide for business owners and others to educate
them on the benefits and pitfalls of Mobile BI.
Need For Mobile Bi
Mobile phones’ data storage capacity has grown in tandem with their ys.
ms and act quickly in this fast-paceg
You are expected to make deci
environment.
‘The number of businesses receiving assistance in such a situation is gro
by the day.
To expand your business or boost your business productivi
th both small and large businesses.
mobile BI
help, and it works
125
Mobile BI can help you whether you are a salesperson or a CEO.
‘There is 2 high demand for mobile BI in order to reduce information time
and use that time for quick decision-making.
jon-making can boost customer satisfaction and
As a result, timely ¢
improve an enterprise's reputation among its customers.
+ Italso aids in making quick decisions in the face of emerging risks.
1.11.2 Advantages Of Mobile BI
“simple access
Mobile BI is not restricted to a single mobile device or a certain place.
You can view your data at any time and from any location.
y into a firm improves production and the daily
+ Having real-time
efficiency of the business.
Obtaining a company’s perspective with a single jes the process.
Competitive advantage
Many firms are seeking better and more responsive methods to do business
in order to stay ahead of the competition.
+ Easy access to real-time data improves company opportunities and raises
sales and capital.
+ This also aids in making the necessary decisions as market conditions change.
‘Simple decision-making
+ As previously stated, mobile BI provides access to real-time data at any
time and from any location,
During its demand, Mobile BI offers the information.
‘This assists consumers in obtaining what they require at the time.
© Asaresult, decisions are made quickly.
Increase Productivity
+ By extending BI to mobile, the organization's teams ean access critical
company data when they need it,understanding B19 Data 4.27,
+ ‘way of solving time-intensive problems
of time to focus on 1
= Deeper engagement by communities, who resonate and build loyalty to the
oductivity result
product or solution,
; e Increased pr
~ 4.42, CROWD SOURCING ANALYTICS
isadvantages
= Crowdsoureing is thi ny z
usually sourced Results can be easily skewed based on the crowd being sourced
group of people,
+ Lack of confiden fan idea
= Crowdsourcing work
tion and fall short of the goal
imo people wi
«While crowdsourci or purpose.
ieee es 1.12.2. Types of Crowdsourcing
+ Theadvantages oferowdsourc
to work with people who have sh
Crowdsourcing involves obtaining information or resources from a wide swath
‘out work to people anywhere jy
lets busineage, Wisdom of the erowd:
+ Crowdsourcing
the country or around tt
the norma} + Isa collective opinion of differ
indi
1als gathered in a group.
: + This type is used for decision-making since it allows one to find the b
hod to raise capital for specig, solution for problems. at
Many brands pay attention to the collective opinion of their customers
inking, ideas, and
taps into the shared
gatekeepers and intermediaries requi
« Crowdsoureing usual
a crowd of peo} -¢ of a company improves.
While ere formation or workers"
solicits money or t ‘Ip support individuals, el inies get brand new
stand out, For instance, MeDonald’s
repay!
Advantages
© Crowdsourcing bringsBig Data ay
Crowd voting:
Insanypeofermudsoure
They ean
ations. Consumers can choose g
ts
ereated by consumers,
La by experts or products
anew 1aste, package,
consumers £0 €FC:
tify the best one,
ers vote to id
Crowdfunding:
tr's when people collec ask for investments for charities, rojeg,
money to the owners,
People do it voluntarily.
fuals and families sufferin,
i ey to help indi
Often, companies gather money to hi
from natural disasters, poverty, social problems, ¢t€-
YTICS
|13. INTER AND TRANS FIREWALLANAL
1.13.1 Inter Firewall Analytics
= Imer Fi Analytics is a type o!
‘monitoring and analyzing traffic flowing between
ofa network that are separated by firewall
identify and prevent potential threats that may be
fF security analytics that focuses op
ferent Zones OF Seemen,
+ Thegoal isto
the traffic
measure used 10 control traffie flow between
Firewalls are a common security |
different segments of a network, such as between an internal network ang
the internet or between different departments within a company.
|. However, firewalls alone cannot provide complete protection against al
threats, especially those that may be hiding within the allowed traffic,
J ves deploying specialized tools and techniques
«Inter Firewall Analytics invo and Ui
to monitor the traffic passing through the firewalls and analyzing it for signs
of potential threats.
+ This can include detecting anomalies in the traffic patterns, ides
unusual or unauthorized access attempts, and flagging suspicious a
Ihniques used in Inter Firewa
nd analyzing network
‘or suspicious
avolves eapturi
sats, such as malwa
= Packet capture and analysis:
traffic to ider
behavior.
involves analyzing the behavior of network
's or patterns that may indicate a potential
Ives using machine learning algorithms to
fy patterns that may be indicative of a
Machine learning: TI
analyze
threats before they ean cause harm.
ring and analyzing traffic atthe network level, Inter Firewall Analytics
ions identify and respond to potential threats more quickly and
1.13.2 ‘Trans Firewall Analyties
‘Trans Firewall Analytics is a type of network security analytics that focuses on
1g network traffic passing through an organization's perimeter
ygand anal’
firewall(s).
‘The main purpose of trans firewall analytics is to identify and prevent network
threats and attacks such as malware, viruses, tempts, and other types of
cyber threats that try to penetrate an organization's network.
's involves monitoring and analyzing network traffic logs
‘Trans Firewall Anal
generated by the firewall,
‘These logs contain information about the source and destination IP addresses,
the protocols and ports used, the size of the packets, and other network traffie metadata.
By analyzing these logs, security analysts can detect patterns of network traffic
that indicate a potential threat or attack,“These tools use advanced algorithms and machi
yze patterns of network traffic that may indicate a
Is can detect unusus) ,
security breach.”
and reporting: These tools fe alerts and reporis
threat or attack is detected,
information they need to take action.
network security | malware detection. th
Understanding Big Data
Meiries collected
IP addresses, pons, protocols. | §
Analysis
techniques
analysis, anomaly
ignature-based
Benefits
Enhanced network security,
carly detection of potential
threats, improved incident
response
Better understanding of web
traffic, improved detection and
preverion of web-based sacks
(Challenges
Complexity of traffic analysis,
difficulty in identifying
Jattecks that span multiple
flows
Overwhelming volume of|
waffic, limited visibility into
Tools and
technologies
Firewall logs, network traffic
nalysis tools, SIEM systemsBig Data Analyte,
/ESTION AND ANSWERS:
‘TWO MARKS QU!
What you mean by bigdata? :
ta is a collection of data that is huge in volume, yet gro
L
«Big Dat
exponentially with time.
«eisdata with so large a size and complexity that none ofthe traditional dq,
management tools can store it or process it efficiently.
2. Name the types of Bigdata.
‘There are three main types of big data:
Structured,
Semi-structured, and
Unstructured data.
List out the characteristies of Bigdata.
Big data can be described by the following characteristics:
i. Volume
Variety
iii. Velocity :
iv, Variability
What is the advantage of bigdata?
Big data has several advantages for businesses and organizations, including:
4.
Improved decision-making
Enhanced customer insights
Improved operational efficiency
‘New revenue opportunities
Competitive advantage
What you meant by unstructured data?
«Unstructured data is data that does not have a well-defined data model or
structure.
© Itis typically text-based data, but it can also include multimedia data such
as images, videos, and audio.
=
understanding Big Data
ured data is generated from a variety of sourees such as social media,
133
2 Unstru
email, mobile devices, sensors, and web logs.
Some examples of unstructured data include:
+ Social media data:
+ Emails:
Web content
«Audio and video files
6. Difference between structured and unstructured data.
Unstructured Data
prt ‘Structured Data
Data that has a clearly defined schema
and is easily searchable and organized
Data that has no clear structure or schema
nd is often difficult to search and
organize
Text documents, social media posts,
Relational databases, spreadsheets
images, videos
‘Can be stored in a variety of formats such
as text, JSON, XML, binary, etc.
(Often requires specialized tools such as
natural language processing, machine
learning, and computer vision
Can be extremely large and difficult to
‘Typically stored in a tabular format
Can be processed using traditional data
processing tools like SQL
‘Typically smaller in size and easier to
manage
Changes to unstructured data can be rapid
and unpredictable
Unstructured data can be very diverse in
format and harder to analyze
Unstructured data is valuable for gaining
insights into customer sentiment, social
media trends, and other areas where
traditional data may not provide enough
manage
Changes to structured data are often slow
and predictable
Structured data is usually uniform in
format and easier to analyze
Structured data is valuable for traditional
data analysis and reporting
context.Sih eee
Define Web Analytics. a
i uring website trai,
” proces of analyzing and measuring Website aTc ang
ta II effectiveness of a websi
Big Data Anata.
a
eb ana
‘is (or behavi 1: to improve the overell
veurement andanalysisof website dn oie
i to understand user beha
re onmersmake data-driven decision 10 optim
eb anayis helps website oWners .
Neb arate cali ad improve wer experience
jghts into website tra
folves various techniques such as data
ie analysis, and web metrics to
"and website performance
1 metrics in web analytics?
beta
8 What are the data colle«
‘The data collected can include metrics such as:
Pageviews
Session duration
* Conversion rate
List out the types of web analyti
9.
There are two main types of web analytics
© Improving website design
© Measuring marketing performance
Increasing website traffic
List out some applications of Bigdata.
u.
ions of big data:
Here are some appl
Business Analytics
Healthcare
135
——
Understanding Big Data
=
12, Name some bigdata technologies.
Here are some popular Big Data technolog'
+ Hadoop
= Spark
+ Hive
= HBase
+ MongoDB
+ Zookeeper
13. What is Hadoop?
Hadoop is an open-source, distributed processing framework that enables
the storage and processing of large volumes of data on acluster of commodity
hardware.
+ It provides a scalable and fault-tolerant platform for processing big data.
Hadoop consists of two main components: Hadoop Distributed File System
(HDFS) and Yet Another Resource Negotiator (YARN),
14. Explain the core components of Hadoop.
an open-source framework intending to store and process big data in
Hadoop
a distributed manner.
‘Component
le System) — Hadoop's key storage syst
Hadoop’s Essent
red on HDES. It is mainly devised for storing
HDFS (Hadoop Distributed.
HDFS. The extensive data is:
massive datasets in comm
1
hardware.15.
16,
The responsible layer of Hadoop for data proces,
e cs essing: Map and Redug
ne are two stages of processing: es
fe nodes feontainers) for processing. Reduce
ied and collated 5
2. Hadoop MapReduce
simple terms, May
to the executors (computer
stage where all processed data is collect
je YARRN ~The framework which is used 10 process in Hadoop is YARN, p,
ane management and to provide multiple data processing engines i”
isan ence, and batch processing is done by Yann’
real-time streamin
Explain the features of Hadoop-
1 also processing big data. It is the may,
Hadoop assists in not only store data bu
¢ es, Some salient features of Hadog,
reliable way to handle significant data hurd!
1. Distributed Processing — Hadoop helps in distributed processing of data
ie quicker processing. In Hadoop HDFS, the data is collected in ;
distributed manner, and the data is parallel processing, and MapReduce j,
liable for the same.
Open Source — Hadoop is independent of cost as it is an open-source
framework. Changes are allowed in the source code as per the user's
requirements.
3. Fault Tolerance ~ Hadoop is highly fault-tolerant. By default, for e}
block, it ereates three replicas at distinct nodes. This number of replicas
‘be modified according to the requirement. So, we can retrieve the data fro)
a different node if one of the nodes fails. The discovery of node failure
restoration of data is made automati
4. Scalability ~ Itis fitted with different hardware, and we can promptly access
the new device.
5. Reliability ~The data in Hadoop is stored on the cluster in a safe manner
that is autonomous of the machine. So, the data stored in the Hadoop
ceosystem’s data does not get affected by any machine breakdowns,
‘What you mean by HDFS?
‘+ Hadoop Distributed File System (HDFS) is a distributed file system that is,
designed to run on commodity hardware.
understanding Big Data
17.
18.
1.
137
= Itisa core component of the Hadoop framework and provides a distributed
storage system for large data sets.
1 HDFS is designed to handle very large files with stre
patterns, and to provide high-throughput access to data.
ning data access
What do you mean by YARN?
«YARN (Yet Another Resource Negotiator) is one of the core components of
Hadoop, responsible for managing resources and scheduling tasks across a
Hadoop cluster.
carlier version
ty issues.
‘e_ Itwas introduced in Hadoop 2.x as an improvement over t
of MapReduce, which suffered from scalability and flex
«YARN separates the job scheduling and resource management functions of
MapReduce into two separate daemons, the ResourceManager (RM) and
the NodeManager (NM), respectively.
List out the benefits of the Hadoop.
Hadoop offers several benefits in the world of big data processin|
including:
Scalability
Fault tolerance
Cost-effective
Processing speed
Flexibility
Data storage
Integration
Open-source
Define Open-source technology.
‘= Open-source technologies refer to software or computer programs that have
their source code available to the public, allowing anyone to access, modify,
and distribute it
© This means that users ean see and edit the code, which can result in greater
collaboration and innovation in software development.
+ The open-source movement started in the late 1990s and has since become a
significant force in the tech industry.4 Dare
Viany popular wfivare Tooke and platforms, including Linu
* MySOL, and WordPress, are open source.
20. How cloud technology impacts the bigidata?
Cloud technology has 2 significant impact on big data in the following ,,
21. What do you mean by cloud computing?
+ Cloud computing is the delivery of computing services including ser,
storage, databases, networking, software, analytic intelli
the Internet (“the cloud”) to offer faster inno
economies of seale.
The services provided by cloud computing can be categorize:
main models: Infrastructure as a Service (laaS), Platform as a Servi
and Software as a Service (SaaS) :
three
ice (Paas,
22. List out the features of Cloud Computing.
* Seal
© Elasticity
© Resource Pooling
© Self service
© LowCosts.
© Fault Tolerance
23. What are the issues in using cloud services?
‘Some important cloud services issues are as listed:
* Data Security
© Performance
Se
that enables the access and an:
as smariphones and tablets
performance
giving
Mobile BI leverages the power of cloud computing and mobile technology
to make data-driven decision-making faster, more accurate, and more
efficient.
25, Justify the need for Busi
ty has grown in tande
Mobile phones’ data storage cay
mis and act quickly in
You are expected to make di is fast-paced
environment,
+ The number of businesses receiving assistance in such a situation is growing
by the day.
‘+ To expand your business or boost your business productivity, mobile Bl can
help, and it works with both small and large businesses.
26. What are the advantages of business Intelligence.
+ Simple access
© Competitive advantage
* Increase Product1.40
27. Define crowd sourcing:
f information, opinions, oF work 4,
a the hi
Crowdsourcing is the collection of
group of people, usually sourced
to save time and money white
lows compa
or thoughts from all over the wor
+ Crowdsourcing work ¢
io people with different ski
28. What are the types of crowd sour
1. Wisdom of the erowd
2. Crowd creation
3. Crowd sourcing,
4. Crowdfunding
is inter firewall analyt
cs is a type of security analytics that focuse,
29. Wh
= Inter
‘monitoring and analyzing traffic flowing between diff
of a network that are separated by firewalls.
‘The goal is to identify and prevent potential threats that may be hiding,
Be fy and p eM
the traffic.
30. What do you mean by trans firewall analytics?
+ Trans Firewall Ana ¥ analyties that focus
on monitoring and analyzing network traffic passing through .,,
5).
© Themain purpose of trans firewall analytics is to identify and prevent netwg
threats and attacks such as malware, viruses, phishing attempts, and
types of eyber threats that try to penetrate an organization's network.
organization's perimeter firewal
+ Trans Firewall Analytics involves monitoring and analyzing network tr
logs generated by the firewall.
31, Difference between inter and trans firewall analytics.
Inter-Firewall ‘Trans-Firewall
rewall analytics performed between
two oF more firewalls
Firewall analytics performed within
single firewall
Understanding Big Data
‘Spans multiple firewalls or security [L
domains
ited to a single firewall oF see
domain
Captures and analyzes traffie between
traffic
Captures and analyzes traliie wi
.¢ technology? Dis
oud and bigdata?
le II and its types in de
What is Crowdsourcing? discuss about
Compare and contrast the inter and trans firewall in detail,
Can you think of any bigdata application that impact you