100%(1)100% found this document useful (1 vote) 2K views148 pagesBDA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
UNIT |
1 Understanding Big Data
Syllabus
Introduction to big data - convergence of key trends - unstructured data - industry examples of big
data - web analytics - big data applications- big data technologies - introduction to Hadoop - open
source technologies - cloud and big data - mobile business intelligence - Crowd sourcing analytics
= inter and trans firewall analytics.
Contents
1.1 Introduction to Big Data
1.2 Convergence of Key Trends
1.3. Unstructured Data
1.4 Industry Examples of Big Data
1.5 Web Analytics
1.6 Big Data Applications
1.7 Big Data Technologies
1.8 Introduction to Hadoop
1.9 Open Source Technologies
1.10 Cloud and Big Data
1.11. Mobile Business Intelligence
4.12 Crowd Sourcing Analytics
1.13. Inter and Trans Firewall Analytics
1.14 Two Marks Questions with Answers
a.Big Data Analytics 4-2 Understanding Big Daty
EEG introduction to Big Data
* Big data can be defined as very large volumes of data available at various sources,
in varying degrees of complexity, generated at different speed ie., velocities ang
varying degrees of ambiguity, which cannot be processed using. traditional
technologies, processing methods, algorithms or any commercial off-the-shelf
solutions.
‘Big data’ is a term used to describe a collection of data that is huge in size and
yet growing exponentially with time. In short, such data is so. large and complex
that none of the traditional data management tools are able to store it or process it
efficiently.
The processing of big data begins with the raw data that isn't aggregated or
organized and is most often impossible to store in the memory of a single
computer.
Big data processing is a set of techniques or programming models to access
large-scale data to extract useful information for supporting and providing
decisions. Hadoop is the open-source implementation of MapReduce and is widely
used for big data processing.
EREI Difference between Data Science and Big Data
Data science
Sr. No.
L It is a field of scientific analysis of data in
order to solve analytically complex
problems and the significant and
necessary activity of cleansing, preparing
of data,
insurance.
Goals: Data classification, anomaly
detection, prediction, scoring and ranking.
Benefits of Big Data Processing
Benefits of big data processing :
1. Improved customer service,
It is used in Biotech, energy, gaming and
Big data
Big data is storing and processing
Volume of structured and unstructured
data that can not be possible with
traditional applications.
Used in retail, education, healthcare and |
social media,
Goals : To provide better customer
service, identifying new revenue
opportunities, effective marketing etc.
2 Business can utilize outside intelligence while taking decisions.
3. Reducing maintenance costs,
TECHNICAL PuBLICATIONS®
+ an up-thrust for knowledgeBig Data Analytics 1-3 Understanding Big Data
4, Re-develop your products : Big data can also help you understand how others
perceive your products so that you can adapt them or your marketing, if need
be,
5. Early identification of risk to the product / services, if any.
6. Better operational efficiency.
E] Big Data Challenges
+ Collecting, storing and processing big data comes with its own set of challenges :
1. Big data is growing exponentially and existing data management solutions have
to be constantly updated to cope with the three Vs.
2. Organizations do not have enough skilled data professionals who can
understand and work with big data and big data tools.
EA Convergence of Key Trends
« The essence of computer applications is to store things in the real world into
computer systems in the form of data, ie,, it is a process of producing data, Some
data are the records related to culture and society and others are the descriptions
of phenomena of the universe and life. The large scale of data is rapidly generated
and stored in computer systems, which is called data explosion.
Data is generated automatically by mobile, devices and computers, think facebook,
search queries, directions and GPS locations and image capture.
Sensors also generate volumes of data, including medical data and commerce
location-based sensors. Experts expect 55 billion IP- enabled sensors by 2021. Even
storage of all this data is expensive, Analysis gets more important and more
expensive every year.
Fig. 1.2.1 shows the big data explosion by the current data boom. and how ctitical
it is for us to be able to extract meaning from all of this data.
oa
Fig, 1.2.1 Data explosion
The phenomena of exponential multiplication of data that gets stored is termed as
"Data Explosion". Continuous inflow of real-time data from various processes,
machinery and manual inputs keeps flooding the storage servers every second.
Sending emails, making phone calls, collecting information for campaigns; each
day we create a massive amount of data just by going about our normal business
TECHNIGAL PUBLICATIONS® - an yp-thrst for knowledgeBig Data Analytics 1-4 Understanding ig Data
and this data explosion does not seem to be slowing down. In fact, 90 % of the
data that currently exists was created in just the last two years.
* Reason for this data explosion is Innovation. Lo
1. Business model transformation : Innovation changed the way in which we do
business, provide services. The data world is governed by three fundamental
trends are business model transformation, globalization and personalization of
services.
© Organizations have traditionally treated data as a legal or compliance
requirement, supporting limited management reporting requirements,
Consequently, organizations have treated data as a cost to be minimized,
© The businesses are required to produce more data related to product and
Provide services to cater each sector and channel of customer.
Rv
. Globalization : Globalization is an emerging trend in business where
organizations start operating on an intemational scale. From manufacturing to
customer service, globalization has. changed the commerce of the world. Variety
and different formats of data are generated due to globalization.
3. Personalization of services : To enhance customer service, the form of
one-to-one markefing in the form of personalization of service is opted by the
customer. Customers expect communication through. various channels increases*
i the speed of data generation.
4. New sources of data : The shift to online advertising supported by the likes of
Google, Yahoo and others is a key driver in the data boom. Social media,
mobile devices, sensor networks and new media are on the fingertips of
customers or users. The data generated through this is used by corporations for
decision support systems like business intelligence and analytics. The growth of
i technology helped to emerge new business models over the last decade or
more. Integration of all the data across the
r enterprise is used to create business
it decision support platform.
V's of Big Data
* We differentiate big data characteristic
the five V's : Volume, velocity, variety,
1, Volume : Volumes of data are Jai
infrastructure can cope with. It
ics from traditional data by one or more of
veracity and value.
ger than that conventional relational database
‘ consisting of terabytes or petabytes of data.
Fig. 1.22 shows big data volume,
TECHNICAL*PUBLICATIONS® - an Up-thrust for knowledgeUndorstanding big Date
Big Date Analytics 1-6
2 Geographical
Information
Machino data | ==> oo syotoms and
{940-sjpatial dato
Fig. 1.2.2 Big data volumo
‘he term ‘velocity’ refers to the speed of generation of data, How fast
generated and processed to meet the demands, determines real
the data
potential in the data. It is being created in or near real-time.
3. Variety : It refers to heterogeneous sources and the nature of data, both
structured and unstructured,
© Fig. 1.2.3 (a) and Fig, 1.2.3 (b). shows big data velocity and data variety.
Mobile
Sonsor data networks
"Amazon,
facebook,
9 Date Soclal
simon Geode volocity modia
(Wab based
companies)
Fig. 1.2.3 (a) Data velocity
(Refer Fig, 1.2.3 (b) on next page)
Value : It represents the business value to be derived from big data.
© The ultimate objective of any big data project should be to generate some
sort of value for the company doing all the analysis. Otherwise, you're just
performing some technological task for technology's sake,
TECHNICAL: PUBLICATIONS® - an up-thnist for knowledgeBig Data Analytics 4-6 Understanding Big Dat,
Structured
Data
Gs) =a
|
Fig. 1.2.3 (b) Data variety
© For real-time spatial big data, decisions can be enhanced through
visualization of dynamic change in such spatial phenomena as climate,
traffic, social-media-based attitudes and massive inventory locations.
' © Exploration of data trends can include spatial proximities and
relationships. Once spatial big data are structured, formal spatial analytics
can be applied, such as spatial autocorrelation, overlays, buffering, spatial
cluster techniques and location quotients.
5, Veracity : Big data must be fed with relevant and true data. We will not be
able to perform useful analytics if much of the incoming data comes from false
i sources or has errors. Veracity refers to the level of trustiness or messiness of
data and if higher the trustiness of the data, then lower the messiness and vice
versa, It relates to the assurance of the data's quality, integrity, credibility and
accuracy. We must evaluate the data for accuracy before using it for business
insights because it is obtained from multiple sources.
EEE2] compare Cloud Computing and Big Data
Cloud computing Big data Hae
| It provides resources on demand. It provides a way to handle huge volumes: |
: : : of data and generate insights.
2, Ierefers to internet services from SaaS, It refers to data, which can be structured, |
| | PaaStolaaS. Semi-structured or unstructured, fo
I 3. Cloud is used to store data and . It is used to describe a huge volume: of |
k information’ on remote servers, data and information. |
4 Cloud computing is economical as it Big data is a highly scalable, robust |
has low maintenance costs centralized ecosystem and cost-effective.
platform no upfront cost: and ‘disaster
i enBig Data Analytics 1-7 ‘ Understanding Big Data
“oy
& The main focus of cloud computing is Main focus of big data is about solving
Vendors and solution providers. of Vendors and. solution providers ‘of big |
cloud computing are Google, Amazon data are Cloudera, Hortonworks, Apache
web service, Dell, Microsoft, Apple and MapR: 3
and IBM. 2
to provide computer resources and problems when a huge amount of data
services with the help .of network generating and processing.
connection.
EES unstructured Data
Unstructured data is data that does not follow .a specified format. Row and
columns are not used for unstructured data. Therefore it is difficult to retrieve
required information. Unstructured data has no identifiable structure.
For example of unstructured data is e-mails, click. streams, textual data, images,
log data;and videos:
In the case of unstructured data, the size is-not the only problem, deriving value
or getting results out of unstructured data is much complex and challenging as
compared of structured data.
The unstructured data can be in the form of text : (Documents, email messages,
customer feedbacks), audio, video, images. Email is an example of unstructured
data.
Even today in most of the organizations more than 80 % of the data are in
‘unstructured form. This carries lots of information. But extracting information from
these various sources is a very big challenge.
Characteristics of unstructured data :
There is no a structural restriction or binding for the data.
Data can be of any type.
Unstructured. data.does.not-follow any structural rules.
There are no predefined formats, restriction or sequence for unstructured data,
gf Nop
Since there is no structural binding for unstructured data it is unpredictable in
nature.
Examples of machine generated unstructured data : .
1. Satellite images : This includes weather data or. the data that the government
captures in its satellite surveillance imagery.
np
Scientific data : This includes atmospheric data and-high energy physics,
3. Photographs and video : This include security, surveillance and traffic video.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge1
t
i
1-8 Understanding Big Day,
Big Data Analytics
Structured Data :
© Structured data is arranged in rows and colum
retrieve and process data easily. Database mana
structured data. ; ;
* Any data that can be stroed in the form of a particular fixed is known ay
structured data. For example, data stored in the colums and rows of tables jn
relational database management systems is a form of structured data.
n format. It helps for application j,
gement system is used for storing
Difference between Structured and Unstructured Data
Unstructured data
stored Unstructured: data is data that
"eis in discrete form. i
in row and column format. does. not follow a specified
- format.
Syntax Semantics
Database management system. Unmanaged file structure ‘
SQL, ADOnet, ODBC ___—_—_|_Open XML, SMTO, SMS
ETL Batch processing or manual data
entry.
characteristics
| With a structured document, In unstructured. document |
| certain information. always information can appear in
iB | appears in the same location on unexpected places on the |
| the page, __ document, i
| Used by organizations) | Low volume operations High volume operations
Ha Industry Examples of Big Data
+ Big data plays an important role in digital marketing, Each day information shared
digitally increases significantly. With the help of big data, marketers can analyze
every action of the consumer. It provides better marketing insights and it helps
marketers to make more accurate and advanced marketing strategies,
Reasons why big data is important for digital marketers :
a) Real-time customer insights
b) Personalized targeting
) Increasing sales
4) Improves the efficiency of a marketing campaign
e) Budget optimization
f)_Measuring campaign's results more accurately.
TECHNICAL PUBLICATIONS® - an vps! for knowiedgeBig Data Analytics 1-9 Understanding Big Data
© Data constantly informs marketing teams of customer behaviors and industry
trends and is used to optimize future efforts, create innovative campaigns and
build lasting relationships with customers.
+ Big data regarding customers provides marketers details about user demographics,
locations and interests, which can be used to personalize the product experience
and increase customer loyalty over time.
* Big data solutions can help organize data and pinpoint which marketing
campaigns, strategies or social channels are getting the most traction. This lets
marketers allocate marketing resources and reduce costs for projects that are not ~
yielding as much revenue or meeting desired audience goals.
* Personalized targeting : Nowadays, personalization is the key strategy for every
marketer. Engaging the customers at the right moment with the right message is
the biggest issue for marketers. Big data helps marketers to create targeted and
personalized campaigns.
* Personalized marketing is creating and delivering messages to the individuals or
the group of the audience through data analysis with the help of consumer's data
such as geolocation, browsing history, clickstream behavior and purchasing
history. It is also known as one - to - one marketing.
* Consumer insights : In this day an age, marketing has become the ability of a
company to interpret the data and change its strategies accordingly. Big data
allows for real-time consumer insights which is crucial to understanding the habits
of your customers. By interacting with your consumers through social media you
will know exactly what they want and expect from your product or service, which
will be key to distinguishing your campaign from your competitors.
* Help increase sales : Big data will help with demand predictions for a product or
service. Information gathered on user behaviour will allow marketers to answer
‘what types of product their users are buying, how often they conduct purchases
or search for a product or service and lastly, what payment methods they prefer
using.
* Analyse campaign results; Big data allows marketers to measure their campaign
performance. This is the most important part of digital marketing. Marketers will
use reports to measure any negative changes to marketing KPIs. If they have not
achieved the desired results it will be a signal that the strategy would need to be
changed in order to maximize revenue and make i
e your marketin
scalable in future. m este more
TECHNICAL PUBLICATIONS® - an upthrust for knowledge
eeBig Data Analytics 1-10 Undorstanding Big Dag
Web Analytics
# Web analytics is the collection, reporting and analysis of website data. The focy,
is on identifying measures based on your organizational and user goals and uy
the website data to determine the success or failure of those goals and to drive
strategy and improve the user's experience.
The WWW is an evolving system for publishing and accessing resources ang
services across the Internet, The web is an open system. Its operations are based
on freely published communication standards and documents standards,
Web analytics is important to help us to :
1. Refine your marketing campaigns
2. Understand your website visitors
3. Analyze website conversions
4, Improve the website user experience
5. Boost your search engine ranking
6. Understand and optimize referral sources ‘
7. Boost online sales.
Businesses use web analytics platforms to measure and benchmark site
performance and to look at key performance indicators that drive their business,
such as purchase conversion rate,
Website analytics provide insights and data that can be used to create a beter
user experience for website visitors, Understanding customer behavior is also key.
to optimizing a website for key conversion metrics,
For example, web analytics will show us the most Popular pages on your website,
and the most popular paths to purchase. With website analytics, we can also
accurately track the effectiveness of your online. marketing campaigns to help
inform future efforts.
Web anelytics can help a digit marketer understand their customers betet by
providing :
1. Insight into who the customers are and their interests
2. Conversion challenges
5. Enhanced appreciation of what consumer like or do not like
4. Understanding of how to improve user experience for the consumer.
TEOHNICAL PUBLICATIONS? an upd for knowledgeBig Data Analytics 1-11 Understanding Big Date
EEG Big Data Applications
* Big data applications can help companies to make better business decisions by
analyzing large volumes of data and discovering hidden patterns, These data sets
might be from social media, data captured by sensors, website logs, customer
feedbacks, etc. Organizations are spending huge amounts on big data applications
to discover hidden patterns, unknown associations, market style, consumer
preferences and other valuable business information.
* Domains where big data can be applied to health care, media and entertainment,
JoT, manufacturing and government.
* Relation between MoT and Big Data : Big data production in the industrial
Internet of Things (IoT) is evident due to the massive deployment of sensors and
Internet of Things (IoT) devices. However, big data processing is challenging due
to limited computational, networking and storage resources at IoT device-end. Big
Data Analytics (BDA) is expected to provide operational and customer-level
intelligence in HoT systems.
© The extensive installation of sensors on machines causes a massive increase in the
volume of data collected within industrial processes. The data consist of operating
data, error lists, history of maintenance activities and alike.
In combination with the related business data, the overall plethora of data
cess optimizations and other applications. To set
provides the raw material for pro
the raw data needs to be processed
this potential for optimizations free,
systematically, passing through various algorithms.
«The results are prepared information with specific application objectives. Especially
pattern detection is to mention in this context, since this method identifies and
quantifies cause and effect correlations and allows predictions of state changes.
The significance of the information given out by the analysis depends on the
amount of data processed.
1, Healthcare :
* Big data analytics for healthcare uses health-related information of an indi
or community to understand a patient, organization or community. In the past,
managing and analyzing healthcare data was tedious and expensive. More
recently, technology has helped the healthcare sector make leaps and bounds to
keep up with the flow of big data in healthcare.
ual
* Diagnostic devices, medical machinery, instrumentation, online services sources
such as these are transferring data throughout a healthcare network. This is done
with the help of big data tools such as Hadoop and Spark.
TECHNICAL PUBLICATIONS® - an up-inrust for knowledgeUndorst
Big Date Analytics 1-12 tending Biy Day
One of the most current and relevant big data examples in healthcare is how
has impacted the global coronavirus crisis. Big data analytics for healthea.,
Supported the rapid development of COVID-19 vaccines. Researchers can shar,
data with each other to develop advanced medications very quickly. Big data j,
healthcare also predicted the spread of disease by allowing healthcare information
to be processed much more rapidly than in the past during other pandemics,
Smoother hospital administration : Healthcare administration becomes much
smoother with the help of big data. It helps to reduce the cost of care
measurement, provide the best clinical support and manage the population of
at-risk patients. It also helps medical experts analyze data from diverse sources, j
helps healthcare providers conclude the deviations among patients and the effets
treatments have on their health.
Fraud prevention and detection : Big data helps to prevent a wide range of errors
on the side of health administrators in the form of wrong dosage, wrong
medicines and other human errors. It will also be particularly useful to insurance
companies. They can prevent a wide range of fraudulent claims of insurance,
Challenges of big data in healthcare : As a relatively new field, big data in
healthcare is still evolving to keep up with the fast pace and changi
technology. With such vast amounts of data available to work with,
and leaders can struggle with knowing where and how to start with
in healthcare to find the information that is meaningful.
ing nature of
organizations
data analytics
Many healthcare organizations lack adequate systems and databases and the
skilled professionals to handle them. As such, the demand for healthcare analyst
with advanced education and training is very high in the World.
2. Manufacturing :
Improving efficiency across the business helps a manufacturing company control
costs, increase productivity, and boost margins. Automated production lines ar
already standard practice for many,
but manufacturing big data can exponentially
improve line speed and quality.
Manufacturing big data also increases trans;
example, by using sensor and RFID data
inventory in real time, reducing interrupti
Parency into the entire supply chain-for
to track the location of tools, parts and
ions and delays,
Companies can also increase supply chain transparency by analyzing individual
Processes and their interdependencies for opportunities to optimize everything
a optimization.
Speeding up assembly : Part of the key to manufacturing more products is t0
simply make the whole process quicker. With big data, manufacturers have been
able to segment their production to identify which parts of the process go the
TECHNICAL PUBLICATIONS® ~ an up-tnrast for knowledgeBig Data Analytics 1-13 Understanding Blg Dota
fastest. Knowing which products are faster and easier to produce can help
companies know where to focus their efforts, perhaps even concentrating solely on
those products for maximum production. It helps for companies to know where
they are most efficient, with the added possibility of working on those areas that
need the most improvement.
Al-driven analysis of manufacturing big data enables companies to aggregate and
analyze both their own and competitor's pricing and cost data to produce
continually optimized price variants. For manufacturers that focus on
build-to-order products, ML can also ensure the accuracy of their customized
configurations and streamline the Configure-Price-Quote (CPQ) workflow.
Big Data Technologies
Big data technology is defined as the technology and a software utility that is
designed for analysis, processing and extraction of the information from a large set
of extremely complex structures and large data sets which is very difficult for
traditional systems to deal with. Big data technology is used to handle both
real-time and batch related data.
Big data technology is defined as software-utility. This technology is primarily
designed to analyze, process and extract information from a large data set and a
huge set of extremely complex structures. This is very difficult for traditional data
processing software to deal with.
Big data technologies including Apache Hadoop, Apache Spark, MongoDB,
Cassandra, Plotly, Pig, Tableau and Apache Cassandra etc.
Cassandra : Cassandra is one of the leading big data technologies among the list
of top NoSQL databases. It is open-source, distributed and has extensive column
storage options. It is freely available and provides high availability without fail.
Apache Pig is a high - level scripting language used to execute queries for larger
datasets that are used within Hadoop.
Apache Spark is a fast, in - Memory data processing engine suitable for use in a
wide range of circumstances. Spark can be deployed in several ways, it features
java, Python, Scala and R programming languages and supports SQL, streaming
data, machine learning and graph processing, which can be used together in an
application.
MongoDB : MongoDB is another important component of big data technologies in
terms of storage. No relational properties and RDBMS properties apply to
MongoDb because it is a NoSQL database. This is not the same as traditional
RDBMS databases that use structured query languages. Instead, MongoDB uses
schema documents.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Data Analytics 1-14 Understanding Big Dy
EEG introduction to Hadoop
.
Apache Hadoop is an open source framework that is used to efficiently store ang
Process large datasets ranging in size from gigabytes to petabytes of data. Hadoop
is designed to scale up from a single computer to thousands of clusterey
computers, with each machine offering local computation and storage.
While Hadoop is sometimes referred to as an acronym for High Availability
Distributed Object Oriented Platform.
The Hadoop framework consists of a storage layer known as the Hadoop
Distributed File System (HDFS) and a processing framework called. the
MapReduce programming model. Hadoop splits large amounts of data into
chunks, distributes them within the network cluster and processes them in its
MapReduce Framework.
Hadoop can also be installed on cloud servers to better manage the compute and
storage resources required for big data. Leading cloud vendors such as Amazon
Web Services (AWS) and Microsoft Azure offer solutions. Cloudera supports
Hadoop workloads both on-premises and in the cloud, including options for one
or more public cloud environments from multiple vendors.
Hadoop provides a distributed file system and a’ framework for the-analysis’ and
transformation of very large -data sets using the MapReduce paradigm. An
important characteristic of Hadoop is the partitioning of data and computation
actos. many (thousands) of hosts and executing application computations in
Parallel close to their data. A Hadoop cluster scales computation capacity, storage
capacity and I/O bandwidth by simply adding commodity servers.
Key features of Hadoop :
1. Cost Effective System,
2. Large Cluster of Nodes
3. Parallel Processing.
4. Distributed Data
5. Automatic Failover Management
6, Data Locality Optimization
7. Heterogeneous Cluster
8, Scalability.
Hadoop allows for the distribution of datasets across a cluster ‘of commodity
hardware. Processing is performed in parallel on multiple servers simultaneously.
Software clients input data into Hadoop. HDFS handles metadata and the
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Date Anaiyiics 1-15 Understanding Big Date
distributed file system, MapReduce then processes and converts the data. Finally,
YARN divides the jobs across the computing cluster.
« All Hadoop modules are designed with a fundamental assumption that hardware
failures of individual machines or racks of machines are common and should be
automatically handled in software by the framework.
* Challenges of Hadoop :
MapReduce complexity : As a file-intensive system, MapReduce can be 2
difficult tool to utilize for complex jobs, such as interactive analytical tasks.
* There are four main libraries in Hadoop.
1. Hadoop Common : This provides utilities used by all other modules in
Hadoop.
2, Hadoop MapReduce : This works as a parallel framework for scheduling and
processing the data.
. Hadoop YARN : This is an acronym for Yet Another Resource Navigator. It is
an improved version of MapReduce and is used for processes running over
Hadoop.
Hadoop Distributed File System - HDFS : This stores data and maintains
records over various machines or clusters. It also allows the data to be stored in
an accessible format.
w
4.
Hadoop Ecosystem
+ Hadoop ecosystem is neither a programming language nor a service, it is a
platform or framework which solves big data problems.
+ The Hadoop ecosystem refers to the various components of the Apache Hadoop
software library, aswell as to the accessories and tools provided by the Apache
Software Foundation for these types of software projects and to the ways that they
work together.
© Hadoop is a Java - based framework that is extremely popular for handling and
analysing large sets of data. The idea of a Hadoop ecosystem involves the use of
different parts of the core Hadoop set such as MapReduce, a framework for
handling vast amounts of data and the Hadoop Distributed File System (HDFS), a
sophisticated file - handling system. There is also YARN, a Hadoop resource
manager.
* In addition to these core elements of Hadoop, Apache has also delivered other
kinds of accessories or complementary tools for developers,
* Some of the most well - known tools of the Hadoop ecosystem include HDFS,
Hive, Pig, YARN, MapReduce, Spark, HBase, Oozie, Sqoop, Zookeeper, ete.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeUnderstandir i
Big Data Analytics 1-16 ling Big Dats
© Fig. 1.8.1 shows Apache Hadoop ecosystem.
‘Management and monitoring (Ambari)
Data
integration
NoSQL
(HBase)
Machine
Leaming
Coordination] | Workflow
(Zookesper) and
scheduling (Mahout)
(Oozie)
Distributed processing (MapReduce)
Distributed storage (HDFS)
Fig. 1.8.1 Apache Hadoop ecosystem
+ Hadoop Distributed File System (HDFS), is one of the largest Apache projects and
primary storage system of Hadoop. It employs a NameNode and DataNode
architecture. It is a distributed file system able to store large files running over the
cluster of commodity hardware.
* YARN stands for Yet Another Resource Negotiator. It is one of the core
components in open source Apache Hadoop suitable for resource management. It
is responsible for managing workloads, monitoring and security controls
implementation.
* Hive is an ETL and Data warehousing tool used to query or analyze large datasets
stored within the Hadoop ecosystem. Hive has three main functions : Data
summarization, query and analysis of unstructured and semi - structured data in
Hadoop.
+ Map - Reduce : It is the core component of processing in a Hadoop Ecosystem as
it provides the logic of processing. In other words, MapReduce is a software
framework which helps in writing applications that processes large data sets using
distributed and parallel algorithms inside Hadoop environment.
* Apache Pig is a high - level scripting language used to execute queries for larger
datasets that are used within Hadoop.
* Apache Spark is a fast, in - memory data processing engine suitable for use in @
wide pase eu cam a Spark can be deployed in several ways, it features
Java, Python, Scala and R programming languages and supports SQL, streaming
TECHNICAL PUBLICATIONS®
‘an up-thnust for knowledgeBig Data Analytics 1-17 Understanding Big Data
data, machine learning and graph processing, which can be used together in an
application.
Apache HBase is a Hadoop ecosystem component which is a distributed database
that was designed to store structured data in tables that could have billions of
rows and millions of columns. HBase is scalable, distributed and NoSQL database
that is built on top of HDFS. HBase provide real - time access to read. or write
data in HDFS.
Hadoop Advantages
. Scalable : Hadoop cluster can be extended by just adding nodes in the cluster.
. Cost effective : Hadoop is open source and uses commodity hardware to store data
so it is really cost effective as compared to traditional relational database
management systems.
. Resilient to failure : HDFS has the property with which it can replicate data over
the network.
4. Hadoop can handle unstructured as well as semi-structured data.
. The unique storage method of Hadoop is based on a distributed file system that
effectively maps data wherever the cluster is located.
Open Source Technologies
Open source software is like any other software (closed/proprietary software).
This software is differentiated by its use and licenses. Open source software
guarantees the right to access and modify the source code and to use, reuses and
redistribute the software, all with no royalty or other costs.
Standard Software is sold and supported commercially. However, Open Source
software can be sold and/or supported commercially, too. Open source is a
disruptive technology.
Open source is an approach to the design, development and distribution of
software, offering practical accessibility to software's source code.
‘Open source licenses must permit non-exclusive commercial exploitation of the
licensed work, must make available the work's source code and must permit the
creation of derivative works from the work itself.
The Netscape Public License and subsequently under the Mozilla Public License.
Proprietary software is computer software which is the legal property of one
party. The terms of use for other parties are defined by contracts or licensing
agreements. These terms may include various privileges to share, alter, dissemble,
and use the software and its code.
TECHNICAL PUBLICATIONS®- an up-thnist for knowledgeBig Data Anaiytics
Need
1-18 Understanding Bg be
Closed source is a term for software whose license does not allow for the rele
oF distribution of the software's source code. Generally, it means only the bina
of a computer program are distributed and the license provides no access to
Program's source code. The source code of such programs is usually regarded a3
trade secret of the company. Access to source code by third parties commonly
Tequires the party to sign a non-disclosure agreement.
of open source
The demands of consumers as well as enterprises are ever increasing with the
increase in the information technology usage. Information technology solutions are
required to satisfy their different needs. It is a fact that a single solution provide
cannot produce all the needed solutions. Open source, freeware and free softwar
are now available for anyone and for any use.
In the 1970s and early 1980s, the software organization started using technics
measures to prevent computer users from being able to study and modify
software. The copyright law was extended to computer programs in 1980. The free
software movement was conceived in 1983 by Richard Stallman to satisfy the need
for and to give the benefit of "software freedom" to computer users.
Richard Stallman declared the idea of the GNU operating system in September
1983. The GNU Manifesto was written by Richard Stallman and published in
March 1985. .
The Free Software Foundation (FSF) is a non-profit corporation started by Richard
Stallman on 4 October 1985 to support the free software movement, a copyleft
based movement which aims to promote the universal freedom to distribute and
modify computer software without restriction. In February 1986, the first formal
definition of free software was published.
The term “free software" is associated with FSFs definition,
and the term “open
source software” is associated with OSI's defit
ion. FSFs and OSI's definitions are
worded quite differently but the set of software that they cover is almost identical
One of the primary goals ‘of this foundation was the development of a free and
open computer operating system and application software that can be used and
shared among different users with complete freedom,
While open source differs from the operation of traditional ¢
permitting both open distribution and open modification,
Before the term open source became widely adopted, developers and producers
uused a variety of phrases to describe the concept, The term open source gained
popularity with the rise of the Internet, which provided access to divers?
production models, communication paths and last but not least, interactiv?
communities.
‘opyright licensing bY
TECHNICAL PUBLICATIONS® - an up.thrst for knowledge
Sen ee ABig Data Analytics 4219 Understanding Big Data
«Netscape licensed and released its code as open source under Definition of Open
Source Software.
Successes of open source
« Successful open source projects make up many of today's most widely used
technologies
Operating systems : Linux, Symbian, GNU Project, NetBSD.
Servers : Apache, Tomcat, MediaWiki, Drupal, WordPress, Eclipse, Moodle, Joomla
Programming languages : Java, JavaScript, PHP, Python, Ruby.
Client software : Mozilla Firefox, Mozilla Thunderbird, OpenOffice, Songbird,
Audacity, 7-Zip.
Digital content : Wikipedia, Wiktionary, Project Gutenberg.
Examples in open source. and propritary software :
| Classification of software Open. Source software Propritary software
|. Operating systems Linux MS Windows, XP, Vista ; SUN’ |
: Solaris |
| word processing and. office openOtfice "| MS Office, Adobe Framemaker |
| applications é = ‘ 4
| Software development Eclipse, JDK MS Visual Studio, .net i
f 1
| Multimedia content creation Gimp “Adobe Photoshop - |
i
|
Web page design - Typo3 MS! Brontpage,, Adobe, Flash,
Difference between Open Source and Open Standards
Open. source software is a type of software where the user has access to the
software's source code and can freely use, modify and distribute the software.
Thus open source concerns the code the software is made of.
Open standards denotes that the code responsible for communication with other
systems is open and has technical specifications which are accessible free of
charge. Thus open standards concern the communication between software.
Advantages of Open Sources
1. The right to use the software in any way.
2. There is usually no license cost and free of cost.
3._ The source code is open and.can be modified freely.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge*
Big Data Analytics
4,
1-20
It is possible to, reuse the software in another co
authority.
Open standards.
It provides higher flexibility.
Disadvantages of Open Sources
Understoncing Big oy,
text or with another pubj,
There is no guarantee that development will happen.
It is sometimes difficult to know that a project exists, and its current status.
No secured follow-up development strategy.
Application of Open Source Software
Following is the list of applications where open source software is used.
L
ypu aw
Social networking 2
Animation 4.
Instant messaging 6.
Desktop publishing 8.
Resource management 10.
EEE comparison of Open Source with
k
NS Mop i
| Proprietary Software
Open source software
Source code freely available.
Modification are allowed.
Licenses may. do their own development.
Example : Wikipedia
Sublicensing is allowed.
No guarantee of further development,
Fees if any for integration, packing,
support and consulting.
Android OS is open source software
rovided by Google.
Multimedia
Accounting
ERP
Website development
Video editing.
Close Source
Close source / proprietary software |
Source code is kept secret,
Modifications are not allowed. i
All upgrades, support, maintenance and |
development are done by licensor.
Example : Microsoft windows
Sublicensing is not allowed.
Guarantee of further development.
Fees are for license, mainteance and
upgradation,
An iOS is proprietary software provided |
TECHNICAL PUBLICATIONS® -
‘an up-thrust for knowledgeBig Data Analytics 1-21 Understanding Big Date
Cloud and Big Data
The NIST defines cloud computing as : "Cloud computing is a model for enabling
ubiquitous, convenient, on-demand network access to a shared pool of
configurable computing resources that can be rapidly provisioned and released
with minimal management effort or service provider interaction. This cloud model
is composed of five essential characteristics, three service models and four
deployment models.”
Cloud provider is responsible for the physical infrastructure and the cloud
consumer is responsible for application configuration, personalization and data.
Broad network access refers to resources hosted in a cloud network that are
available for access from a wide range of devices. Rapid elasticity is used to
describe the capability to provide scalable cloud computing services.
In measured services, NIST talks about measured service as a setup where cloud
systems may control a user or tenant's use of resources by a metering capability
somewhere in the system. :
On-demand self-service refers to the service provided by cloud computing vendors
that enables the provision of cloud resources on demand whenever they are
required.
The Cloud Cube Model has four dimensions to differentiate cloud formations :
a) External/Internal b) Proprietary/Open
©) De-perimeterized / peremeterized d) Outsourced/Insourced.
External / Internal : Physical location of data is defined by extemal/intemal
dimension. It defiries the organization's boundary.
Example: Information inside a datacenter using a private cloud deployment
would .be considered internal and data that resided on Amazon EC2 would be
considered external.
Proprietary / Open : Ownership is proprietary or open; is a measurement for not
only ownership of technology but also its interoperability, use of data and ease of
data-transfer and degree of vendor's application's lock-in.
Proprietary means that the organization providing the service is keeping the
means of provision under their ownership. Clouds that are open are using
technology that is not proprietary, meaning that there are likely to be more
suppliers.
De-perimeterized /. peremeterized. : Security Ranges : is parameterized or
de-parameterized; which measures whether the operations are inside or outside
the security boundary, firewall, etc. :
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeUndorstonding By oy,
for providing
* Encryption and key management will be the technology means TOF P) ‘ding day
sterized model.
confidentiality and integeity in a deepeslmeterized mode!
which defines whether ty
Big Data Anatytice
Outnourced / Insourced ; Out-soureing/In-sourcing:
customer or the service provider provides the service, |
* ulsoureed means the service is provided by a third party. It refers to letng
contractors or service providers handle all requests and most of cloud busing,
models {all into this.
* Insourced is the services provided by your own staff under organization contry,
Insourced means in-house development of clouds.
* Cloud computing, is often described as a stack, as a response to the broad range gf |
services built on top of one another under the “cloud”, A’cloud computing stack i
a cloud architecture built in layers of one or more cloud-managed services (Saas,
Paas, Jaa, ele.).
* Cloud computing stacks are used for all sorts of applications: and systems. They |
are especially good in microservices and -scalable applications, as each tier is
dynamically scaling and replaceable,
+ The cloud computing pile makes up a threefold system that comprises its
lower-level elements, These components function as formalized cloud computing
delivery models :
4) Software as a Service (SaaS)
b) Platform as a Service (PaaS)
¢) Infrastructure as a Service (Iaa8)
* Saa$ applications are designed for end-users and delivered over the web,
+ PanS is the set of tools and services designe
d to make coding and deploying those
applications quick and efficient,
JnaS-is the hardware and software that powers it-all,
including. servers, storage,
networks and operating systems,
EHD Difference botwoen Cloud Computing and Big Data
| Sr No, Cloud computing
Jt provides a way to. handle-hhage volumes ,
_ oF cata andl generate insights, !
|
|
}
| 1k Tl provides resources on demand,
[on . Jt refers to internet: services from, ‘Saas,
PaaS.to Jaas,
WM ref
{crs 10 date, which can be structured, |
red or tinstructured,
TEcHNCAL PunueaTioNs®-woeang goo
PUBLICATIONS? « an upstirust {for knowledgeBig Data Analytics
4.
Cloud is used to store data
information on remote servers.
and
Cloud computing is economical as it
has low maintenance costs centralized
platform no upfront cost and disaster
safe implementation,
Vendors and solution providers of
cloud computing are Google, Amazon
Web Service, Dell, Microsoft, Apple
and IBM, :
The main focus of cloud computing is
to provide computer resources and
services with the help of network
connection.
Cloud computing
Cloud computing is a new technology
that delivers many types of resources
over the Internet.
Cloud computing allows individuals
and businesses to access on-demand
computing resources and applications,
Cloud computing cannot
globally without the Internet.
operate
Cloud computing is owned by a person,
company or institution or government.
Cloud computing isan
application-based software infrastructure
that stores data on remote servers,
which can- be accessed through the
internet,
The Internet is
enabling
infrastructure.
Ey Mobile Business Intelligence
1-23
Understanding Big Data
It is used to describe huge volume of data
and information.
Big data is highly scalable, robust
ecosystem ond cost - effective.
Vendors and solution providers of big data
are Cloudera, Hortonworks, Apache and
MapR.
Main focus of big data is about solving
problems when a huge amount of data
generating and processing.
Difference between Cloud Computing and Internet
Internet
Internet is a network of networks, which
provides software/hardware infrastructure
fo establish and maintain connectivity: of
the computers around the world.
‘The Internet is interconnected with unique
identifiers and can exchange data over a
network with little or no human
interaction.
Internet operates without cloud computing.
No single person, company, institution, or
government agency controls or owns the
Internet.
The Internet provides coftware/hardware
infrastructure to establish and maintain |
connectivity of the computers,
Cloud computing is the promise of the
utilization of that infrastructure to provide
continuous services,
* Mobile Business Intelligence (BI) or Mobile analytics is the rising software
technology that allows users to access information and analytics on their phones
TECHNICAL PUBLICATIONS® - en up-thrust for knowledgeretanding Big Dé
Big Data Analytics 1-24 Understanding Big Data
1 systems. Mobile analytics involves
as i sktop-based BI
and tablets instead of desktop-base: Jatforms and properties, such
measuring and analyzing data generated by mobile p!
as mobile sites and mobile applications. /
Analytics is the practice of measuring and analyzing data of users in order to
create an understanding of user behavior as well as website or application's
performance. If this practice is done on mobile apps and app users, it is called
“mobile analytics".
* Mobile analytics is the practice of collecting user behavior data, determining intent
from those metrics and taking action to drive retention, engagement and
conversion.
* Mobile analytics is similar to web analytics where identification of the unique
customer and recording their usages.
* With mobile analytics data, you can improve your cross-channel marketing
initiatives, optimize the mobile experience for you customers and grow mobile
user engagement and retention.
* Analytics usually comes in the form of a software that integrates into companie's
existing websites and apps to capture, store and analyze the data.
It is always very important for businesses to measure their critical KPIs (Key
Performance Indicators), as the old rule is always valid : “If you can't measure it,
you can't improve it”.
* To be more specific, if a business find out 75 % of their users exit in the shipment
screen of their sales funnel, probably there is something wrong with that screen in
terms of its design, user interface (UI) or user experience (UX) or there is a
technical problem preventing users from completing the process.
Working of Mobile Analytics :
* Most of the analytics tools need a library (an SDK) to be embedded into the
mobile app's project code and at minimum an initialization code in order to track
the users and screens.
SDKs differ by platform so a different SDK is required for each platform such as
iOS, Android, Windows Phone etc. On top of that, additional code is required for
custom event traking.
With the help of this code, analytics tools track and count each user,
tap, event, app crash or any additional information that the user
device, operating system, version IP address (and probable location)
app launch,
has, such as
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Data Analytics 1-25
Understanding Big Data
* Unlike web analytics, mobile analytics tools don't depend on cookies to identify
unique users since mobile analytics SDKs can generate a persistent and unique
identifier for each device.
° The tracking technology varies between websites, which use either JavaScript or
cookies and apps, which use a Software Development Kit (SDF).
* Each time a website or app visitor takes an action, the application fires off data
which is recorded in the mobile analytics platform.
KEMEUG Difference between Mobile Analytics and Web Analytics
Mobile analytics
When web site is using,.then mobile
_user called. as USER.
Interaction with ‘site is called as
SESSION.
Web analytics
When web site is using, then user called |
as VISITER.
Interaction with site is called es VISTIS. “|
ES On mobile, users have less screen real On a desktop, users have larger screes |
estate (4 to 7 iches) and interact by -(10 to 17 inches) and interact by
toucing, swiping and holding. dlicking, double-clicking and using key
commands. t
fod Session timeout may be as short as Session will end after 30 minutes of |
i 30 seconds. inactivity for websites.
5. Unique users are identified via user Cookies are used to identifies user.
IDs.
EEE crowd Sourcing Analytics
* Crowdsourcing is the process of exploring customer's ideas, opinions and thoughts
available on the internet from large groups of people aimed at incorporating
innovation, implementing new ideas and eliminating product issues.
* Crowdsourcing means the outsourcing of human-intelligence tasks to a large
group of unspecified people.via the Internet.
* Crowdsourcing is all about collecting data from users through some services,
ideas, or content and then it needs to be stored in a server such that the necessary
data can be or provided to users whenever necessary.
* Most users nowadays use Truecaller to find unknown numbers and Google Maps
to find out places and the traffic in a region. All the services are based on
crowdsourcing.
TECHNICAL PUBLIGATIONS® - an up-thnist for knowledgeUnderstanding fy
ata
* Crowdnourced data ina form of necondary data, Secondary data refers to data thay
is collected by any party other than the rexearcher, Secondary data provide,
important context for any investigation into a policy intervention.
* When crowdsourcing data, researchers collect plentiful, valuable and disperse
data at a cost typically lower than that of traditional data collection methods,
* Consider the trade-offy between sample size and sampling Insues before deciding
to crowdaource data. Enuuring data quality means making sure the platform on
which you are collecting crowduourced data is well-tested.
* Crowdsourcing experiments are normally set up by asking @ set of users to
perform a task for a very small remuneration on each unit of the task. Amazon
Mechanical Turk (AMT) is @ popular platform that has a large set of registered
remote workers who are hired to perform tasks such as data labeling,
* In data labeling tasks, the crowd workers are randomly assigned a single item in
the dataset. A data object may receive multiple labels from different workers and
these have to be aggregated to get the overall true label.
* Crowdsourcing allows for many contributors to be recruited in a short period of
time, thereby eliminating traditional barriers to data collection, Furthermore,
crowdsourcing platforms ysually employ their own tools to optimize the
annotation process, making it easier to conduct time-intensive labeling tasks,
Crowdsourcing data is especially effective in generating complex and free-form
labels such as in the case of audio transcription, sentiment analysis, image
annotation or translation,
* With crowdsourcing, companies can collect information from custorners and use it
to their advantage. Brands gather opinions, ask for help, receive feedback to
improve their product or service, and drive sales. For instance, Lego conducted a
campaign where customers had the chance to develop their designs of toys and
submit them.
* To become the winner, the creator had to receive the biggest amount of people's
votes, The best design was moved to the production process. Moreover, the
winner got a privilege that amounted to a 1 % royalty on the net revenue.
* Types of Crowdsourcing : There are four main types of crowdsourcing.
1. Wisdom of the crowd : It is a’ collective opinion of different individuals
gathered in a group. This type is used for decision-making since it allows one
to find the best solution for problems,
2. Crowd creation : This type involves a company asking its customers to help
with new products, This way, companies get brand new ideas and thoughts
that help a business stand out.
TECHNICAL PUBLICATIONS” - an up-thrust for knowledgeBig Data Analytics 1-27 Undoratanding Big Data
3. Crowd voting : It is a type of crowdsourcing where customers are allowed to
choose a winner. They can vote to decide which of the options is the best for
them. This type can be applied to different situations. Consumers can choose
one of the options provided by experts or products created by consumers,
4, Crowdfunding : It is when people collect money and ask for investments for
charities, projects and startups without planning to return the money to the
owners, People do it voluntarily. Often, companies gather money to help
individuals and families suffering from natural disasters, poverty, social
problems, etc.
Inter and Trans Firewall Analytics
* A firewall is a device designed to control the flow of traffic into and out-of a
network, In general, firewalls are installed to prevent attacks. Firewall can be a
software program or a hardware device.
* Fig. 1.13.1 shows firewall.
[Computer|
Router| Firewall [Computer
Intemet
Server
a
Fig. 1.13.1 Firewall
* Firewalls are software programs or hardware devices that filter the traffic that
flows into’a user PC or user network through an internet connection. They sift
through the data flow and block that which they deem harmful to the user
network or computer system.
. Firewalls filter based on IP, UDP and TCP information. Firewall is placed on the
link between a network router and Internet or between a user and router. For
large organizations with many small networks, the firewall is placed on every
connection attached to the Internet.
+ Large organizations may use multiple levels of firewall or distributed firewalls,
locating a firewall at a single access point to the network.
¢ Firewalls test all traffic against consistent rules and pass traffic that meets those
rules. Many routers support basic firewall functionality. Firewall can also be
used to control data traffic.
TECHNICAL PUBLICATIONS® - an up-thnust for knowledgeUndaratondiny Big Day,
ng, the only’ connectivity to thy
the firewall via othe,
sncly on the firewall be
no way to bypass
rewall based security depet
size from outside; there should be
ns,
gateways; wireless connect
wd to a particular IP address or 4
+ Firewall filters out all incoming messages adeleess ress -truates |i
particular TCP port number, It divides a network ine rad 70n9
internal to the firewall and a Jess trusted zone external 10 {he [
© Firewalls may also impose restrictions on outgoing traffic, 0 prevent certain
attacks and to limit losses if an attacker succeeds in getting access inside the
firewall.
* Functions of firewall :
1. Access control : Firewall filters incoming as we
2, Address/Port. Translation : Using network address (ranslation, Internal
machines, though not visible on the Internet, can establish @ connection with
external machines on the Intemet, NATing is often done by firewall.
IL as outgoing packets.
3. Logging : Security architecture ensures that each incoming or outgoing packet
encounters at least one firewall. The firewall can log all anomalous packets,
«Firewalls can protect the computer and user personal information from :
1. Hackers who breaks your system security.
2. Firewall prevents malware and other Internet hacker attacks from reaching, your
computer in the first place.
3. Outgoing traffic from your computer created by a virus infection.
* Firewalls cannot provide protection :
1. Against phishing scams and other fraudttlent activity
2. Viruses spread through e-mail
3. From physical access of your computer or network
4, For an unprotected wireless network,
Firewall Characteristics
1, All traffic from inside to outside and vice versa, must pass through the firewall,
2. The firewall itself is resistant to penetration.
3. Only authorized traffic, as defined by te local security policy,
¢ a
Firewall Rules
* The rules and regul
will be allowed to
lations set by the organizati
1 ‘ganization, Policy determi ,
+ . z * *
internal and external. information resources employees can ace: ite hn of
8 access, the kinds
TECHNICAL PUBLICATIONS® - an upstmrust for knowlodgoBig Dats Analytics 1-29 Understanding Big Data
Programs they may install on their own. computers as well as their authority for
reserving network resources,
Policy is typically general and set at a high level within the organization. Policies
that contain details generally become too much of a “living document”.
‘User can create or disable firewall filter rules based on following conditions :
1, IP addresses : System admin can block a certain tange of IP addresses.
2, Domain names : Admin can only allow certain specific domain names to access
your systems or allow access to only some specific types of domain names or
domain name extension.
3. Protocol : A firewall can decide which of the systems can allow or have access
to common protocols like IP, SMTP, FIP, UDP, ICMP, Telnet or SNMP.
4. Ports : Blocking or disabling ports of servers that are connected to the internet
will help maintain the kind of data flow you want to see it used for and also
close down possible entry points for hackers or malignant software.
5. Keywords : Firewalls also can sift through the data flow for a match of the
keywords or phrases to block out offensive or unwanted data from flowing in.
¢ When your computer makes a connection with another computer on the network,
several things are exchanged including the source and destination ports. In a
standard firewall configuration, most inbound ports are blocked. This would
normally cause a problem with return traffic since the source port is randomly
assigned. A state is a dynamic rule created by the firewall containing the
source-destination port combination, allowing the desired return traffic to pass the
firewall.
FREES Types of Firewall
1. Packet filter 2, Application level firewall 3. Circuit level gateway.
Fig. 1.13.2 shows relation between OSI layer and Firewall.
* Packet filter firewall controls access to packets on the basis of packet source and
destination address or specific transport protocol type. It is done at the OSI data
link, network and transport layers. Packet filter firewall works on the network
layer of the OSI model.
* Packet filters do not see inside a packet; they block or accept packets solely on the
basis of the IP addresses and ports. All incoming SMTP and FIP packets are
parsed to check whether they should drop or forwarded. But outgoing SMTP and
FTP packets have already been screened by the gateway and do not have to be
checked by the packet filtering router. Packet filter firewall only checks the header
information.
TECHNICAL PUBLICATIONS® « an up-thrist for knowledgo