0% found this document useful (0 votes)
125 views22 pages

Seminar Report Alisha

The document provides an introduction to a seminar report on big data. It discusses how big data refers to large datasets that are difficult to process using traditional methods due to their size and complexity. It describes some of the challenges of big data including analysis, storage, sharing and privacy. The document also provides an index of topics to be covered in the seminar report, including characteristics of big data, how sectors can benefit from it, data analysis processes, Apache Hadoop and programming models for big data.

Uploaded by

Alisha Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views22 pages

Seminar Report Alisha

The document provides an introduction to a seminar report on big data. It discusses how big data refers to large datasets that are difficult to process using traditional methods due to their size and complexity. It describes some of the challenges of big data including analysis, storage, sharing and privacy. The document also provides an index of topics to be covered in the seminar report, including characteristics of big data, how sectors can benefit from it, data analysis processes, Apache Hadoop and programming models for big data.

Uploaded by

Alisha Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

SEMINAR REPORT

On

INTRODUCTION TO BIG DATA

Bachelors of Technology

Department of Electronics & Communication

Institute of Engineering and Technology, Ayodhya

SUBMITTED BY: SUBMITTED TO:

Name-Alisha Khan Er. Shambhavi Shukla

Year\ Sem - 3rd\5th

Roll no- 19204


Abstract:

Big data is a broad term for data sets so large or complex that traditional data processing
applications are inadequate. Challenges include analysis, capture, data curation, search, sharing,
storage, transfer, visualization, and information privacy.

The term often refers simply to the use of predictive analytics or other certain advanced methods
to extract value from data, and seldom to a particular size of data set. Accuracy in big data may
lead to more confident decision making. And better decisions can mean greater operational
efficiency, cost reductions and reduced risk.

Analysis of data sets can find new correlations, to "spot business trends, prevent diseases, and
combat crime and so on." Scientists, practitioners of media and advertising and governments
alike regularly meet difficulties with large data sets in areas including Internet search, finance
and business informatics. Scientists encounter limitations in e-Science work, including
meteorology, genomics, connectomics, complex physics simulations, and biological and
environmental research.

Data sets grow in size in part because they are increasingly being gathered by cheap and
numerous information-sensing mobile devices, aerial (remote sensing), software logs, cameras,
microphones, radio-frequency identification (RFID) readers, and wireless sensor networks. The
world's technological per-capita capacity to store information has roughly doubled every 40
months since the 1980s; as of 2012, every day 2.5 Exabyte (2.5×1018) of data were created; The
challenge for large enterprises is determining who should own big data initiatives that straddle
the entire organization.
INDEX

 Introduction to big data


 Why big data
 Characteristics of big data
 Technology career paths/ How can various sectors be benefitted by it
 Process of data analysis
 ETL and Data Warehousing
 Apache HADOOP
 HDFS
 Map Reduce
 YARN
 Commodity cluster and fault tolerance
 Programming models for big data
 Cloud Computing
 Conclusion
 References
 Annexure- I
INTRODUCTION TO BIG DATA:

Big data is a term that describes the large volume of data – both structured and unstructured –
that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important.
It’s what organizations do with the data that matters. Big data can be analyzed for insights that
lead to better decisions and strategic business moves.

‘Big data’ refers to datasets whose size is beyond the ability of typical database software tools to
capture, store, manage, and analyze.

The term Big Data refers to the massive volume of structured/unstructured data which is very
hard to process using the conventional techniques. Using big data, companies can know radically
more about their businesses, and directly translate that knowledge into improved decision
making and performance.

Big data is a term that describes the large volume of data – both structured and unstructured
– that inundates a business on a day-to-day basis.
Big Data philosophy encompasses unstructured, semi-structured and structured data;
however the main focus is on unstructured data.
Big Data represents the Information assets characterized by high Volume, Velocity and
Variety to require specific Technology and Analytical Methods for its transformation into
Value

WHY BIG DATA:


 To address the needs of handling complex variety of data we need a mechanism
or engineering and Big data helps in simplifying the complex data structures
 It is needed to derive insights from complex and hug volumes of data. Data can
be enormous but to analysis that we need a system and that is where big data
system helps.
 It helps in Cost reduction (Big Data) as the systems can be installed at
affordable prices as well
 It helps in better decision-making process as the analytics/algorithms involved
provide accurate and appropriate analysis in most of the cases
 It is also scalable and can be used from a single machine to many servers.

CHARACTERISTICS OF BIG DATA:

Big Data has four key characteristics which help us understand the advantages, as well as the
challenges, faced in big data initiatives. These characteristics are also known as the 4 V’s of Big
Data.

1. Volume:

Volume is the V most frequently associated with big data – data quantity can be so big it
can reach incomprehensible proportions. For example, Facebook stores more than 250
billions images uploaded by people, in addition to all the individual posts (over 2.5
trillion posts). Overall, close to 2.5 exabytes (1 Exabyte = 10^9 Gigabytes) of data is
being produced every day. And, the total data in the world is expected reach 44 zettabytes
(1 Zettabyte = 10^12 Gigabytes) by 2020.

2. Velocity :

Velocity is the measure of how fast the data is being generated and collected. For
example, in Facebook, more than 350 millions photos  are being uploaded every day.
This data needs to be collected, stored, filed, and available to be retrieved whenever
required. Data velocity highlights the need to process the data quickly, and most
importantly, use it at a faster rate than ever before. Many types of data have a limited
shelf-life and their value can diminish very quickly. For example, to improve sales in a
retail business, out of stock products should be identified within minutes rather than days
or weeks.

3. Variety 

Data can come in all forms – photos, videos, sensor data, tweets, encrypted packets and
so on. Data is not always accumulated in the form of rows and columns in a database – it
can either be structured or unstructured. With an increase in data sources, there are more
varieties of data in different formats-from traditional documents and databases, to semi-
structured and unstructured data including click streams, GPS location data, and social
media apps. Different data formats mean it’s tougher to derive value from the data
because it must all be extracted for processing in different ways.

4. Veracity

Data veracity is the degree to which data is accurate, precise, and trusted. It refers to the
biases, noise, and abnormality in the data. To avoid ‘dirty data’ accumulating in our
systems, we need to have a strategy to keep the data usable. Having diverse and messy
data requires a lot of cleanups. Obtaining and cleaning datasets still takes more time for a
data scientist than putting their investigational skills (statistics, machine learning, and
algorithms) to use.

TECHNOLOGY CAREER PATHS/ HOW CAN VARIOUS SECTORS


BENEFITTED BY BIG DATA:
Big data is contributing in so many industries; the public sector ;healthcare; insurance sector and
many more.

1.HEALTHCARE: -
2. INSURANCE:-
Major car insurance companies, such as AXA and AIG, are using big data to gather driver
behavioral analytics to create specialized and customized insurance packages/policies.

 AIG offers drivers an internet-enabled application to score and track driving


performance. AIG then analyzes this data to monitor patterns, create tailor-made
insurance packages, and ultimately, offer better services to their users.
 AXA offers a “DriveSafe” application to drivers who are under the age of 24. The
application records drivers’ journeys and measures their performance. AXA then uses the
data to set insurance discounts for drivers.
 Predictive analytics uses the big data collected by insurance companies to precisely and
accurately calculate claims, pricing, emerging trends and risk selection

3. BANKING:-

 Creating profile of customer


 Fraud detections
 Lending decisions
 Cyber security
4. TELECOM:-
The telecom industry worldwide is finding itself in a highly complex environment of decreasing
margins and congested networks; an environment that is as cutthroat as ever. A new IBM study
on how telcos are using Big Data shows that 85% of the respondents indicate that the use of
information and analytics is creating a competitive advantage for them. Big data initiatives
promise to improve growth and increase efficiency and profitability across the entire telecom
value chain. Yes, Big Data to the rescue again!
The potential of Big Data, however, poses a challenge: how can a company utilize data to
increase revenues and profits across the value chain, spanning network operations, product
development, marketing, sales, and customer service.

Big Data analytics, for instance, enables companies to predict peak network usage so that they
can take measures to relieve congestion. It can also help identify customers who are most likely
to have problems paying bills.

Telecommunication companies collect massive amounts of data from call detail records, mobile
phone usage, network equipment, server logs, billing, and social networks, providing lots of
information about their customers and network.

The telecommunication industry has been boosted by the active use of machine learning and data
science. This step was made only for the better. A great many aspects and issues became much
easier to resolve, control or even prevent from happening.

BIG DATA AS A CAREER PATH:


 Business analyst/Data analyst
 Data Scientists
 Machine Learning Professionals
 Data engineers/ETL developers
 Data Management Professionals
PROCESS OF DATA ANALYSIS:

Steps in the Data Science Process:


1. Acquire includes anything that makes us retrieve data including; finding, 
accessing, acquiring, and moving data. 
 It includes identification of and authenticated access to all related data 
 And transportation of data from sources to distributed files systems
 It includes way to subset and match the data to regions or times of interest 
 As we sometimes refer to it as geo-spacial query
2. The next activity is prepare data, we divide the pre-data activity
Into two steps based on the nature of the activity. 
Namely, explore data and pre-process data. 
i) The first step in data preparation involves literally looking at the data to understand
its nature, what it means, its quality and format
It often takes a preliminary analysis of data, or samples of data, to understand it.
This is why this step is called explore.
Once we know more about the data through exploratory analysis, the next step is pre-
processing of data for analysis. 
ii) Pre-processing includes cleaning data, sub-setting or filtering data, creating data,
which programs can read and understand, such as modeling raw data into a more
defined data model, or packaging it using a specific data format.
If there are multiple data sets involved, this step also includes integration of multiple data
sources, or streams. 
3. The prepared data then would be passed onto the analysis step, 
which involves selection of analytical techniques to use, building a model of the data, and
analyzing results.
This step can take a couple of iterations on its own or might require data scientists to go
back to steps one and two to get more data or package data in a different way. 
4. Step four for communicating results includes evaluation of analytical results. 
Presenting them in a visual way, creating reports that include an assessment of results
with respect to success criteria.
Activities in this step can often be referred to with terms like interpret, summarize,
visualize, or post process.

5. Reporting insights from analysis and determining actions from insights based 
on the purpose you initially defined is what we refer to as the act step.

ETL & DATA WAREHOUSING:

 ETL (or Extract, Transform, Load) is a process of data integration that encompasses three
steps — extraction, transformation, and loading.
 In a nutshell, ETL systems take large volumes of raw data from multiple sources, convert
it for analysis, and loads that data into your warehouse.

• Data warehousing emphasizes the capture of data from different sources for access and
analysis.

• Three main component of Data Warehouse.

– Data sources from operational systems, such as Excel, ERP, CRP or financial
applications

– A data staging area where data is cleaned and ordered

– A presentation area where data is warehoused.

• Data may be:

– Structured

– Semi-structured

–Unstructured data

• Types of Data Warehouse

– Enterprise Data Warehouse:

– Operational Data Store


– Data Mart

Typical Data Warehouse

• Step 1:

Operational or transactional or day – to –day business data is gathered.

• Step 2:

This data is then integrated, cleaned up, transformed, and standardized through the
process of Extraction, Transformation, and Loading (ETL).

• Step 3:

The transformed data is then loaded into enterprise data warehouse or data marts.

• Step 4:

A host of market leading business intelligence and analytics tools are then used to enable
decision making from the use of ad-hoc queries, SQL, enterprise dashboards, data mining etc.

Apache HADOOP:
Today, there are over 100 open-source projects for big data and this number continues to grow.
Many among them rely on HADOOP but some are independent.

The Hadoop ecosystem frameworks and applications have several overarching themes and goals.
First, they provide scalability to store large volumes of data 
on commodity hardware.
As the number of systems increases, so does the chance for crashes and hardware failures. 
A second goal, supported by most frameworks in the Hadoop ecosystem, is the ability to
gracefully recover from these problems.

In addition, big data comes in a variety of flavors, such as text files, graph of social
networks, streaming sensor data and raster images. 
A third goal for the Hadoop ecosystem then, is the ability to handle these different data types for
any given type of data.
Several projects in the ecosystem that support it.
A fourth goal of the Hadoop ecosystem 
is the ability to facilitate a shared environment. 
Since even modest-sized clusters can have many cores, it is important to allow multiple jobs to
execute simultaneously.
Why buy servers only to let them sit idle?
Another goal of the Hadoop ecosystem is providing value for an enterprise.
The ecosystem includes a wide range of open source projects backed by a large active
community of users.
These projects are free to use and easy to find support for.

The three main parts of Hadoop. 


1. The Hadoop distributed file system, or HDFS.
2. YARN, the scheduler and resource manager.
3. MapReduce, a programming model for processing big data.

HDFS- HADOOP DISTRIBUTED FILE SYSTEM:

 It is a storage system for enormous data.


 It is as a basis of all the tools utilized in Hadoop ecosystem.
 The HDFS provides mainly two capabilities for managing big data:
1. Scalability to large data sets
2. Reliability to address hardware failures
 Massive storage
 It achieves scalability by splitting or partitioning large files across multiple nodes for
parallel access.
 General file size- GB to TB
 Default chunk size of every piece is 64 MB
 By default HFDS creates 3 copies for each node.
 HDFS can handle a spread of data types.
 It has 2 main components- NameNode and DataNode.

NameNode:

 It is employed for metadata.


 It issues comments to DataNodes across the cluster
 It is one per cluster
 It decides which nodes store data files.
 The NameNode keeps tracks of file name, location in directory etc.
 It is often seen as the administrator/co-coordinator of the HDFS cluster

DataNode:

 It is employed for block storage.


 It is one per machine.
 It runs on each node within the cluster.
 It stores file block
 Block creation, deletion and replication happens here.
 Replication involves fault tolerance and rate locality
 Replication also means the identical block will be stored on different nodes on the system
which are in numerous geographical locations.
 A location may be a selected rack or a data center in a different town.
 The location is very important since we wish to move computation to data and not the
other way around.

MAP-REDUCE:
 It is a simplified programming tool/model for Hadoop

MAP-> apply() {applies operations to all elements}

REDUCE-> summarize() {summarizes operations on elements}

 Google uses MapReduce for indexing websites


 It hides complexity of parallel programming and simplifies building parallel applications
 It executes parallel processing
 Its output is solely based on input
 It relies on YARN to schedule and execute parallel processing over the distributed file
blocks in HDFS
 Map creates a key value for each word on the line containing the word as the key

The MapReduce framework operates exclusively on pairs, i.e. the framework views the
input to the job as a set of pairs and produces a set of pairs as the output of the job,
conceivably of different types

(input) <k1, v1> -> map -> <k2,v2> -> combine -> <k2,v2> -> reduce ->
<k3,v3>(output)
YARN (YET ANOTHER RESOURCE NEGOTIATOR):

YARN is one of the main components of Hadoop ecosystem. Yarn is used to run
various kind of distributed applications except for MapReduce as it allows the processing and
running data for various processes such as batch processing , stream processing ,interactive
processing and graph processing that are stored in the HDFS.

It also allows flexible scheduling .It works as the resource manager of the HDFS storage. It
has a job tracker that is responsible for the creation of divisions among application manager and
resource manager. As we know that this data goes through different processing, these processing
are managed in YARN by different components. Hence increasing the efficiency of the system.

Many different kinds of resources are also progressively allocated for optimum utilization.
YARN helps a lot in the proper usage of the available resources, which is very necessary for the
processing of a high volume of data.

IMORTANT FEATURES OF YARN:

 YARN provides the feature of multi-tenancy to the company.


 YARN also provides the feature of cluster utilization, so hadoop uses clusters
dynamically.
 It facilitates great compatibility.
 It facilitates Scalability.
COMPONENTS OF YARN:

 CONTAINER:
In the Container, one can find physical resources like a disk on a single node, CPU
cores, RAM. Container Launch Context (CLC) is used to invoke containers. Data about
the dependencies, security tokens, environment variables which are maintained as a
record known as Container Launch Context (CLC).

 APPLICATION MANAGER:
In a framework, when a single job is submitted, it is called an application. Monitoring
the application progress, application status tracking, negotiation of resources with
resource manager is the responsibility of the application manager. All the requirement of
an application to run is done by sending the Container Launch Context (CLC).

 NODE MANAGER:
The node manager takes care of individual nodes in the Hadoop cluster and also
manages containers related to each specific node. It is registered with the Resource
Manager and sends each node’s health status to the Resource Manager, stating if the node
process has finished working with the resource. As its primary goal is to manage each
specific node container that is assigned by the resource manager.

 RESOURCE MANAGER:
Resource management and assignment of all the apps is the responsibility of Resource
Manager and is the master daemon of YARN. Requests received by the resource manager
are forwarded to the corresponding node manager. According to the application,
resources are allocated by the resource manager for completion.

 SCHEDULER:
Based on Resource Availability and Application Allocation, Scheduler schedules
the tasks. There is no other task performed by scheduler like no restart of the job after
failing, tracking or monitoring of tasks. The different types of scheduler plugins are Fair
Scheduler and Capacity Scheduler, which are supported by the YARN scheduler for the

partition of cluster resources.

COMODITY CLUSTER AND FAULT TOLERENCE:

Commodity clusters are an important class of modern-day supercomputers. A commodity


cluster is an ensemble of fully independent computing systems integrated by a commodity off-
the-shelf interconnection communication network. Commodity clusters exploit the economy of
scale of their mass-produced subsystems and components to deliver the best performance relative
to cost in high performance computers for many user workloads. Clusters represent more than
80% of all the systems on the Top 500 list and a larger part of commercial scalable systems.
While they do not drive the very peak performance in the field, they are the class of system most
likely to be encountered in a typical machine room, even when such a data centre may include a
more tightly coupled and expensive massively parallel processor among its other computing
resources. Clusters have now been in use for more than 2 decades and almost all applications,
software environments, and various tools found within the domain of supercomputing run on
them. All programming interfaces, software environments and tools, and methods described in
this book are directly relevant to and can be used with commodity clusters. In this chapter the
generic commodity cluster computer is described as a vehicle for a first-pass examination of a
total supercomputer, both hardware and software, so that in-depth discussions of future topics
can be conceptually placed within their respective context.

When the size of the data itself becomes part of the problem, big data era is
approaching. Big data technologies describe a new generation of technologies and architectures,
designed to economically extract value from very large volumes of a wide variety of data, by
enabling high-velocity capture, discovery, and/or analysis. Fault tolerance is of great importance
for big data systems, which have potential software and hardware faults after their development.
This paper introduces some popular applications and case studies of big data mining. The
architecture of big data’s individual components has parallel and distributed features, including
distributed data processing, distributed storage and distributed memory, this paper briefly
introduces Hadoop architecture of big data systems. Then presents some fault tolerance work
recently in the big data systems.

PROGRAMMING MODELS FOR BIG DATA:


CLOUD COMPUTING:
 Cloud computing is the on-demand availability of computer system resources, especially
data storage and computational power, without direct active management by the user.
 The term is generally used to describe data centres available to many users over the
Internet. Large clouds, predominant today, often have functions spread over multiple
locations from central servers.
 Cloud servers are based on data centers all around the globe.
 It can be accessed through browsers or dedicated mobile apps.
 Cloud benefits similar to car rental company
- Pay for what you use, low capital investment
-Quick implementation
-Faster results
-Happy customers
- Less resource management issue
 Some famous cloud providers are-
AWS, GCE (Google Cloud Platform), Luna Cloud, Microsoft Azure, Cloud Sigma,
Dimension data etc.
 Cloud Service Models are- 1. IaaS(Infrastructure as a Service)
2. PaaS(Platform as a Service
3. SaaS(Software as a Service)
PLAGIARISM CHECK REPORT:-
Plagiarism detected: 15%

CONCLUSION:

Datasets are growing rapidly because they are being captured by cheap and numerous gadgets
such as mobile devices, remote sensing, software logs, cameras, microphones, and wireless
sensor networks. The work to analyze big data may require massively parallel software running
on hundreds or even thousands of servers.

Harnessing the power of big data will enable analysts, researchers, and business owners to make
better and faster decisions using data that was previously inaccessible or unusable.

REFERENCES:

In the completion of this report, help has been taken from the following references:

 https://www.coursera.org/learn/big-data-introduction
 https://www.futurelearn.com
 https://chartio.com/learn
 https://training.digitalvidya.com/courses/big-data-foundation-course
 https://en.wikipedia.org

You might also like