0% found this document useful (0 votes)

125 views22 pages

Seminar Report Alisha

The document provides an introduction to a seminar report on big data. It discusses how big data refers to large datasets that are difficult to process using traditional methods due to their size and complexity. It describes some of the challenges of big data including analysis, storage, sharing and privacy. The document also provides an index of topics to be covered in the seminar report, including characteristics of big data, how sectors can benefit from it, data analysis processes, Apache Hadoop and programming models for big data.

Uploaded by

Alisha Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

125 views22 pages

Seminar Report Alisha

Uploaded by

Alisha Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 22

SEMINAR REPORT

INTRODUCTION TO BIG DATA

Bachelors of Technology

Department of Electronics & Communication

Institute of Engineering and Technology, Ayodhya

SUBMITTED BY: SUBMITTED TO:

Name-Alisha Khan Er. Shambhavi Shukla

Year\ Sem - 3rd\5th

Roll no- 19204

Abstract:

Big data is a broad term for data sets so large or complex that traditional data processing
applications are inadequate. Challenges include analysis, capture, data curation, search, sharing,
storage, transfer, visualization, and information privacy.

The term often refers simply to the use of predictive analytics or other certain advanced methods
to extract value from data, and seldom to a particular size of data set. Accuracy in big data may
lead to more confident decision making. And better decisions can mean greater operational
efficiency, cost reductions and reduced risk.

Analysis of data sets can find new correlations, to "spot business trends, prevent diseases, and
combat crime and so on." Scientists, practitioners of media and advertising and governments
alike regularly meet difficulties with large data sets in areas including Internet search, finance
and business informatics. Scientists encounter limitations in e-Science work, including
meteorology, genomics, connectomics, complex physics simulations, and biological and
environmental research.

Data sets grow in size in part because they are increasingly being gathered by cheap and
numerous information-sensing mobile devices, aerial (remote sensing), software logs, cameras,
microphones, radio-frequency identification (RFID) readers, and wireless sensor networks. The
world's technological per-capita capacity to store information has roughly doubled every 40
months since the 1980s; as of 2012, every day 2.5 Exabyte (2.5×1018) of data were created; The
challenge for large enterprises is determining who should own big data initiatives that straddle
the entire organization.
INDEX

 Introduction to big data

 Why big data
 Characteristics of big data
 Technology career paths/ How can various sectors be benefitted by it
 Process of data analysis
 ETL and Data Warehousing
 Apache HADOOP
 HDFS
 Map Reduce
 YARN
 Commodity cluster and fault tolerance
 Programming models for big data
 Cloud Computing
 Conclusion
 References
 Annexure- I
INTRODUCTION TO BIG DATA:

Big data is a term that describes the large volume of data – both structured and unstructured –
that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important.
It’s what organizations do with the data that matters. Big data can be analyzed for insights that
lead to better decisions and strategic business moves.

‘Big data’ refers to datasets whose size is beyond the ability of typical database software tools to
capture, store, manage, and analyze.

The term Big Data refers to the massive volume of structured/unstructured data which is very
hard to process using the conventional techniques. Using big data, companies can know radically
more about their businesses, and directly translate that knowledge into improved decision
making and performance.

Big data is a term that describes the large volume of data – both structured and unstructured
– that inundates a business on a day-to-day basis.
Big Data philosophy encompasses unstructured, semi-structured and structured data;
however the main focus is on unstructured data.
Big Data represents the Information assets characterized by high Volume, Velocity and
Variety to require specific Technology and Analytical Methods for its transformation into
Value

WHY BIG DATA:

 To address the needs of handling complex variety of data we need a mechanism
or engineering and Big data helps in simplifying the complex data structures
 It is needed to derive insights from complex and hug volumes of data. Data can
be enormous but to analysis that we need a system and that is where big data
system helps.
 It helps in Cost reduction (Big Data) as the systems can be installed at
affordable prices as well
 It helps in better decision-making process as the analytics/algorithms involved
provide accurate and appropriate analysis in most of the cases
 It is also scalable and can be used from a single machine to many servers.

CHARACTERISTICS OF BIG DATA:

Big Data has four key characteristics which help us understand the advantages, as well as the
challenges, faced in big data initiatives. These characteristics are also known as the 4 V’s of Big
Data.

1. Volume:

Volume is the V most frequently associated with big data – data quantity can be so big it
can reach incomprehensible proportions. For example, Facebook stores more than 250
billions images uploaded by people, in addition to all the individual posts (over 2.5
trillion posts). Overall, close to 2.5 exabytes (1 Exabyte = 10^9 Gigabytes) of data is
being produced every day. And, the total data in the world is expected reach 44 zettabytes
(1 Zettabyte = 10^12 Gigabytes) by 2020.

2. Velocity :

Velocity is the measure of how fast the data is being generated and collected. For
example, in Facebook, more than 350 millions photos are being uploaded every day.
This data needs to be collected, stored, filed, and available to be retrieved whenever
required. Data velocity highlights the need to process the data quickly, and most
importantly, use it at a faster rate than ever before. Many types of data have a limited
shelf-life and their value can diminish very quickly. For example, to improve sales in a
retail business, out of stock products should be identified within minutes rather than days
or weeks.

3. Variety

Data can come in all forms – photos, videos, sensor data, tweets, encrypted packets and
so on. Data is not always accumulated in the form of rows and columns in a database – it
can either be structured or unstructured. With an increase in data sources, there are more
varieties of data in different formats-from traditional documents and databases, to semi-
structured and unstructured data including click streams, GPS location data, and social
media apps. Different data formats mean it’s tougher to derive value from the data
because it must all be extracted for processing in different ways.

4. Veracity

Data veracity is the degree to which data is accurate, precise, and trusted. It refers to the
biases, noise, and abnormality in the data. To avoid ‘dirty data’ accumulating in our
systems, we need to have a strategy to keep the data usable. Having diverse and messy
data requires a lot of cleanups. Obtaining and cleaning datasets still takes more time for a
data scientist than putting their investigational skills (statistics, machine learning, and
algorithms) to use.

TECHNOLOGY CAREER PATHS/ HOW CAN VARIOUS SECTORS

BENEFITTED BY BIG DATA:
Big data is contributing in so many industries; the public sector ;healthcare; insurance sector and
many more.

1.HEALTHCARE: -
2. INSURANCE:-
Major car insurance companies, such as AXA and AIG, are using big data to gather driver
behavioral analytics to create specialized and customized insurance packages/policies.

 AIG offers drivers an internet-enabled application to score and track driving

performance. AIG then analyzes this data to monitor patterns, create tailor-made
insurance packages, and ultimately, offer better services to their users.
 AXA offers a “DriveSafe” application to drivers who are under the age of 24. The
application records drivers’ journeys and measures their performance. AXA then uses the
data to set insurance discounts for drivers.
 Predictive analytics uses the big data collected by insurance companies to precisely and
accurately calculate claims, pricing, emerging trends and risk selection

3. BANKING:-

 Creating profile of customer

 Fraud detections
 Lending decisions
 Cyber security
4. TELECOM:-
The telecom industry worldwide is finding itself in a highly complex environment of decreasing
margins and congested networks; an environment that is as cutthroat as ever. A new IBM study
on how telcos are using Big Data shows that 85% of the respondents indicate that the use of
information and analytics is creating a competitive advantage for them. Big data initiatives
promise to improve growth and increase efficiency and profitability across the entire telecom
value chain. Yes, Big Data to the rescue again!
The potential of Big Data, however, poses a challenge: how can a company utilize data to
increase revenues and profits across the value chain, spanning network operations, product
development, marketing, sales, and customer service.

Big Data analytics, for instance, enables companies to predict peak network usage so that they
can take measures to relieve congestion. It can also help identify customers who are most likely
to have problems paying bills.

Telecommunication companies collect massive amounts of data from call detail records, mobile
phone usage, network equipment, server logs, billing, and social networks, providing lots of
information about their customers and network.

The telecommunication industry has been boosted by the active use of machine learning and data
science. This step was made only for the better. A great many aspects and issues became much
easier to resolve, control or even prevent from happening.

BIG DATA AS A CAREER PATH:

 Business analyst/Data analyst
 Data Scientists
 Machine Learning Professionals
 Data engineers/ETL developers
 Data Management Professionals
PROCESS OF DATA ANALYSIS:

Steps in the Data Science Process:

1. Acquire includes anything that makes us retrieve data including; finding,
accessing, acquiring, and moving data.
 It includes identification of and authenticated access to all related data
 And transportation of data from sources to distributed files systems
 It includes way to subset and match the data to regions or times of interest
 As we sometimes refer to it as geo-spacial query
2. The next activity is prepare data, we divide the pre-data activity
Into two steps based on the nature of the activity.
Namely, explore data and pre-process data.
i) The first step in data preparation involves literally looking at the data to understand
its nature, what it means, its quality and format
It often takes a preliminary analysis of data, or samples of data, to understand it.
This is why this step is called explore.
Once we know more about the data through exploratory analysis, the next step is pre-
processing of data for analysis.
ii) Pre-processing includes cleaning data, sub-setting or filtering data, creating data,
which programs can read and understand, such as modeling raw data into a more
defined data model, or packaging it using a specific data format.
If there are multiple data sets involved, this step also includes integration of multiple data
sources, or streams.
3. The prepared data then would be passed onto the analysis step,
which involves selection of analytical techniques to use, building a model of the data, and
analyzing results.
This step can take a couple of iterations on its own or might require data scientists to go
back to steps one and two to get more data or package data in a different way.
4. Step four for communicating results includes evaluation of analytical results.
Presenting them in a visual way, creating reports that include an assessment of results
with respect to success criteria.
Activities in this step can often be referred to with terms like interpret, summarize,
visualize, or post process.

5. Reporting insights from analysis and determining actions from insights based
on the purpose you initially defined is what we refer to as the act step.

ETL & DATA WAREHOUSING:

 ETL (or Extract, Transform, Load) is a process of data integration that encompasses three
steps — extraction, transformation, and loading.
 In a nutshell, ETL systems take large volumes of raw data from multiple sources, convert
it for analysis, and loads that data into your warehouse.

• Data warehousing emphasizes the capture of data from different sources for access and
analysis.

• Three main component of Data Warehouse.

– Data sources from operational systems, such as Excel, ERP, CRP or financial
applications

– A data staging area where data is cleaned and ordered

– A presentation area where data is warehoused.

• Data may be:

– Structured

– Semi-structured

–Unstructured data

• Types of Data Warehouse

– Enterprise Data Warehouse:

– Operational Data Store

– Data Mart

Typical Data Warehouse

• Step 1:

Operational or transactional or day – to –day business data is gathered.

• Step 2:

This data is then integrated, cleaned up, transformed, and standardized through the
process of Extraction, Transformation, and Loading (ETL).

• Step 3:

The transformed data is then loaded into enterprise data warehouse or data marts.

• Step 4:

A host of market leading business intelligence and analytics tools are then used to enable
decision making from the use of ad-hoc queries, SQL, enterprise dashboards, data mining etc.

Apache HADOOP:
Today, there are over 100 open-source projects for big data and this number continues to grow.
Many among them rely on HADOOP but some are independent.

The Hadoop ecosystem frameworks and applications have several overarching themes and goals.
First, they provide scalability to store large volumes of data
on commodity hardware.
As the number of systems increases, so does the chance for crashes and hardware failures.
A second goal, supported by most frameworks in the Hadoop ecosystem, is the ability to
gracefully recover from these problems.

In addition, big data comes in a variety of flavors, such as text files, graph of social
networks, streaming sensor data and raster images.
A third goal for the Hadoop ecosystem then, is the ability to handle these different data types for
any given type of data.
Several projects in the ecosystem that support it.
A fourth goal of the Hadoop ecosystem
is the ability to facilitate a shared environment.
Since even modest-sized clusters can have many cores, it is important to allow multiple jobs to
execute simultaneously.
Why buy servers only to let them sit idle?
Another goal of the Hadoop ecosystem is providing value for an enterprise.
The ecosystem includes a wide range of open source projects backed by a large active
community of users.
These projects are free to use and easy to find support for.

The three main parts of Hadoop.

1. The Hadoop distributed file system, or HDFS.
2. YARN, the scheduler and resource manager.
3. MapReduce, a programming model for processing big data.

HDFS- HADOOP DISTRIBUTED FILE SYSTEM:

 It is a storage system for enormous data.

 It is as a basis of all the tools utilized in Hadoop ecosystem.
 The HDFS provides mainly two capabilities for managing big data:
1. Scalability to large data sets
2. Reliability to address hardware failures
 Massive storage
 It achieves scalability by splitting or partitioning large files across multiple nodes for
parallel access.
 General file size- GB to TB
 Default chunk size of every piece is 64 MB
 By default HFDS creates 3 copies for each node.
 HDFS can handle a spread of data types.
 It has 2 main components- NameNode and DataNode.

NameNode:

 It is employed for metadata.

 It issues comments to DataNodes across the cluster
 It is one per cluster
 It decides which nodes store data files.
 The NameNode keeps tracks of file name, location in directory etc.
 It is often seen as the administrator/co-coordinator of the HDFS cluster

DataNode:

 It is employed for block storage.

 It is one per machine.
 It runs on each node within the cluster.
 It stores file block
 Block creation, deletion and replication happens here.
 Replication involves fault tolerance and rate locality
 Replication also means the identical block will be stored on different nodes on the system
which are in numerous geographical locations.
 A location may be a selected rack or a data center in a different town.
 The location is very important since we wish to move computation to data and not the
other way around.

MAP-REDUCE:
 It is a simplified programming tool/model for Hadoop

MAP-> apply() {applies operations to all elements}

REDUCE-> summarize() {summarizes operations on elements}

 Google uses MapReduce for indexing websites

 It hides complexity of parallel programming and simplifies building parallel applications
 It executes parallel processing
 Its output is solely based on input
 It relies on YARN to schedule and execute parallel processing over the distributed file
blocks in HDFS
 Map creates a key value for each word on the line containing the word as the key

The MapReduce framework operates exclusively on pairs, i.e. the framework views the
input to the job as a set of pairs and produces a set of pairs as the output of the job,
conceivably of different types

(input) <k1, v1> -> map -> <k2,v2> -> combine -> <k2,v2> -> reduce ->
<k3,v3>(output)
YARN (YET ANOTHER RESOURCE NEGOTIATOR):

YARN is one of the main components of Hadoop ecosystem. Yarn is used to run
various kind of distributed applications except for MapReduce as it allows the processing and
running data for various processes such as batch processing , stream processing ,interactive
processing and graph processing that are stored in the HDFS.

It also allows flexible scheduling .It works as the resource manager of the HDFS storage. It
has a job tracker that is responsible for the creation of divisions among application manager and
resource manager. As we know that this data goes through different processing, these processing
are managed in YARN by different components. Hence increasing the efficiency of the system.

Many different kinds of resources are also progressively allocated for optimum utilization.
YARN helps a lot in the proper usage of the available resources, which is very necessary for the
processing of a high volume of data.

IMORTANT FEATURES OF YARN:

 YARN provides the feature of multi-tenancy to the company.

 YARN also provides the feature of cluster utilization, so hadoop uses clusters
dynamically.
 It facilitates great compatibility.
 It facilitates Scalability.
COMPONENTS OF YARN:

 CONTAINER:
In the Container, one can find physical resources like a disk on a single node, CPU
cores, RAM. Container Launch Context (CLC) is used to invoke containers. Data about
the dependencies, security tokens, environment variables which are maintained as a
record known as Container Launch Context (CLC).

 APPLICATION MANAGER:
In a framework, when a single job is submitted, it is called an application. Monitoring
the application progress, application status tracking, negotiation of resources with
resource manager is the responsibility of the application manager. All the requirement of
an application to run is done by sending the Container Launch Context (CLC).

 NODE MANAGER:
The node manager takes care of individual nodes in the Hadoop cluster and also
manages containers related to each specific node. It is registered with the Resource
Manager and sends each node’s health status to the Resource Manager, stating if the node
process has finished working with the resource. As its primary goal is to manage each
specific node container that is assigned by the resource manager.

 RESOURCE MANAGER:
Resource management and assignment of all the apps is the responsibility of Resource
Manager and is the master daemon of YARN. Requests received by the resource manager
are forwarded to the corresponding node manager. According to the application,
resources are allocated by the resource manager for completion.

 SCHEDULER:
Based on Resource Availability and Application Allocation, Scheduler schedules
the tasks. There is no other task performed by scheduler like no restart of the job after
failing, tracking or monitoring of tasks. The different types of scheduler plugins are Fair
Scheduler and Capacity Scheduler, which are supported by the YARN scheduler for the

partition of cluster resources.

COMODITY CLUSTER AND FAULT TOLERENCE:

Commodity clusters are an important class of modern-day supercomputers. A commodity

cluster is an ensemble of fully independent computing systems integrated by a commodity off-
the-shelf interconnection communication network. Commodity clusters exploit the economy of
scale of their mass-produced subsystems and components to deliver the best performance relative
to cost in high performance computers for many user workloads. Clusters represent more than
80% of all the systems on the Top 500 list and a larger part of commercial scalable systems.
While they do not drive the very peak performance in the field, they are the class of system most
likely to be encountered in a typical machine room, even when such a data centre may include a
more tightly coupled and expensive massively parallel processor among its other computing
resources. Clusters have now been in use for more than 2 decades and almost all applications,
software environments, and various tools found within the domain of supercomputing run on
them. All programming interfaces, software environments and tools, and methods described in
this book are directly relevant to and can be used with commodity clusters. In this chapter the
generic commodity cluster computer is described as a vehicle for a first-pass examination of a
total supercomputer, both hardware and software, so that in-depth discussions of future topics
can be conceptually placed within their respective context.

When the size of the data itself becomes part of the problem, big data era is
approaching. Big data technologies describe a new generation of technologies and architectures,
designed to economically extract value from very large volumes of a wide variety of data, by
enabling high-velocity capture, discovery, and/or analysis. Fault tolerance is of great importance
for big data systems, which have potential software and hardware faults after their development.
This paper introduces some popular applications and case studies of big data mining. The
architecture of big data’s individual components has parallel and distributed features, including
distributed data processing, distributed storage and distributed memory, this paper briefly
introduces Hadoop architecture of big data systems. Then presents some fault tolerance work
recently in the big data systems.

PROGRAMMING MODELS FOR BIG DATA:

CLOUD COMPUTING:
 Cloud computing is the on-demand availability of computer system resources, especially
data storage and computational power, without direct active management by the user.
 The term is generally used to describe data centres available to many users over the
Internet. Large clouds, predominant today, often have functions spread over multiple
locations from central servers.
 Cloud servers are based on data centers all around the globe.
 It can be accessed through browsers or dedicated mobile apps.
 Cloud benefits similar to car rental company
- Pay for what you use, low capital investment
-Quick implementation
-Faster results
-Happy customers
- Less resource management issue
 Some famous cloud providers are-
AWS, GCE (Google Cloud Platform), Luna Cloud, Microsoft Azure, Cloud Sigma,
Dimension data etc.
 Cloud Service Models are- 1. IaaS(Infrastructure as a Service)
2. PaaS(Platform as a Service
3. SaaS(Software as a Service)
PLAGIARISM CHECK REPORT:-
Plagiarism detected: 15%

CONCLUSION:

Datasets are growing rapidly because they are being captured by cheap and numerous gadgets
such as mobile devices, remote sensing, software logs, cameras, microphones, and wireless
sensor networks. The work to analyze big data may require massively parallel software running
on hundreds or even thousands of servers.

Harnessing the power of big data will enable analysts, researchers, and business owners to make
better and faster decisions using data that was previously inaccessible or unusable.

REFERENCES:

In the completion of this report, help has been taken from the following references:

 https://www.coursera.org/learn/big-data-introduction
 https://www.futurelearn.com
 https://chartio.com/learn
 https://training.digitalvidya.com/courses/big-data-foundation-course
 https://en.wikipedia.org

Big Data Seminar Report Rahul Jain
No ratings yet
Big Data Seminar Report Rahul Jain
41 pages
ETB 1 (Big Data)
No ratings yet
ETB 1 (Big Data)
28 pages
Bda U1
No ratings yet
Bda U1
78 pages
Big Data Analysis
No ratings yet
Big Data Analysis
3 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
IT UNIT 2 Part 1
No ratings yet
IT UNIT 2 Part 1
33 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
UNIT-1:Overview of Big Data
No ratings yet
UNIT-1:Overview of Big Data
10 pages
Bda PST
No ratings yet
Bda PST
11 pages
Bda Unit-1 Notes
No ratings yet
Bda Unit-1 Notes
10 pages
BDA Notes Part 1
No ratings yet
BDA Notes Part 1
11 pages
ETEM S01 - (Big Data)
No ratings yet
ETEM S01 - (Big Data)
24 pages
Big Data Analysis by Deshbandhu
No ratings yet
Big Data Analysis by Deshbandhu
368 pages
Big Data Report
No ratings yet
Big Data Report
10 pages
Unit 1 - BDS - DS307
No ratings yet
Unit 1 - BDS - DS307
47 pages
117769
No ratings yet
117769
20 pages
Unit - 1
No ratings yet
Unit - 1
104 pages
Big Data
No ratings yet
Big Data
28 pages
BDA-1st Unit
No ratings yet
BDA-1st Unit
39 pages
Now To Be Data
No ratings yet
Now To Be Data
16 pages
1.big Data and Its Importance
No ratings yet
1.big Data and Its Importance
17 pages
Introduction To Big Data Unit - 2
No ratings yet
Introduction To Big Data Unit - 2
75 pages
Big Data
No ratings yet
Big Data
16 pages
BDA
100% (1)
BDA
148 pages
Da 1
No ratings yet
Da 1
20 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Big Data Analytics 1
No ratings yet
Big Data Analytics 1
21 pages
UNIT Two Emerging Technology
No ratings yet
UNIT Two Emerging Technology
43 pages
Prepared By: Asmita Deshmukh
No ratings yet
Prepared By: Asmita Deshmukh
51 pages
Big Data Analytics
No ratings yet
Big Data Analytics
14 pages
Big Data Analtics (Unit 1)
No ratings yet
Big Data Analtics (Unit 1)
31 pages
Unit - 1 Bda
No ratings yet
Unit - 1 Bda
14 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
Big Data Analysis Seminar
100% (1)
Big Data Analysis Seminar
15 pages
I Jcs It 2015060405
No ratings yet
I Jcs It 2015060405
6 pages
Title - Concept of Big Data: Presented by - Divyanshu Upadhyay Naman Gupta Adarsh Pandey Pankaj Chaudhary Shivbrat Singh
No ratings yet
Title - Concept of Big Data: Presented by - Divyanshu Upadhyay Naman Gupta Adarsh Pandey Pankaj Chaudhary Shivbrat Singh
17 pages
Bigdata Units
No ratings yet
Bigdata Units
80 pages
UNIT 1 - BIG DATA ANALYTICS Full
No ratings yet
UNIT 1 - BIG DATA ANALYTICS Full
28 pages
Emerging Tech & Big Data Guide
No ratings yet
Emerging Tech & Big Data Guide
30 pages
Big Data Technology Report With Pages Removed
No ratings yet
Big Data Technology Report With Pages Removed
32 pages
Unit 1 Big Data Analytics Full
No ratings yet
Unit 1 Big Data Analytics Full
29 pages
DBMS Unit1
No ratings yet
DBMS Unit1
30 pages
Unit 1 and Unit 2 Notes Bda
No ratings yet
Unit 1 and Unit 2 Notes Bda
11 pages
BCC (IEEE Format) Big Data
No ratings yet
BCC (IEEE Format) Big Data
2 pages
Introductions: What Are The 5 Vs of Big Data/ Characteristics of Big Data or Nature of Data
No ratings yet
Introductions: What Are The 5 Vs of Big Data/ Characteristics of Big Data or Nature of Data
75 pages
CCS334
No ratings yet
CCS334
55 pages
Presentation Print Temp
No ratings yet
Presentation Print Temp
90 pages
Unit I
No ratings yet
Unit I
64 pages
Big Data Project
100% (3)
Big Data Project
61 pages
Big Data Analytics M1
No ratings yet
Big Data Analytics M1
27 pages
Big Data
No ratings yet
Big Data
30 pages
Big Data 1 Unit
No ratings yet
Big Data 1 Unit
21 pages
BDA Unit 1
No ratings yet
BDA Unit 1
39 pages
Big Data Class - Introduction
No ratings yet
Big Data Class - Introduction
60 pages
BDA - Unit-I
No ratings yet
BDA - Unit-I
35 pages
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
No ratings yet
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
58 pages
$R3N9XOZ
No ratings yet
$R3N9XOZ
56 pages
What Is Big Data?
No ratings yet
What Is Big Data?
3 pages
Findur Brochure
No ratings yet
Findur Brochure
16 pages
Marketing Data Lake
No ratings yet
Marketing Data Lake
221 pages
SAP BPC for S/4HANA Overview
No ratings yet
SAP BPC for S/4HANA Overview
85 pages
Difference Between OLAP and OLTP
No ratings yet
Difference Between OLAP and OLTP
7 pages
MODULE 2 - Business Analytics Framework FN
No ratings yet
MODULE 2 - Business Analytics Framework FN
16 pages
Data Warehousing
No ratings yet
Data Warehousing
19 pages
Dokumen - Tips - Datastage Administratoras Guide This Guide Describes Datastage Setup Routine
No ratings yet
Dokumen - Tips - Datastage Administratoras Guide This Guide Describes Datastage Setup Routine
82 pages
Notes DATA MINING MBA III
No ratings yet
Notes DATA MINING MBA III
8 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
20 pages
Data Mining UNIT 2 LECTURE NOTES
No ratings yet
Data Mining UNIT 2 LECTURE NOTES
32 pages
Data Mining and Database Systems: Where Is The Intersection?
No ratings yet
Data Mining and Database Systems: Where Is The Intersection?
5 pages
Datastage Training Course Syllabus PDF
No ratings yet
Datastage Training Course Syllabus PDF
6 pages
DATA MINING Unit 1
No ratings yet
DATA MINING Unit 1
22 pages
Unleashed Potential Aa WP
No ratings yet
Unleashed Potential Aa WP
10 pages
Requirements As The Driving Force For Data Warehousing: Mr. Hubert I. Caguiat
No ratings yet
Requirements As The Driving Force For Data Warehousing: Mr. Hubert I. Caguiat
64 pages
DWDM
No ratings yet
DWDM
25 pages
Database Management System
No ratings yet
Database Management System
95 pages
IBM Cognos Architecture Guide
No ratings yet
IBM Cognos Architecture Guide
13 pages
Deepak Chaudhary
No ratings yet
Deepak Chaudhary
4 pages
Ramesh Babu
No ratings yet
Ramesh Babu
3 pages
Data Mining & Warehousing Guide
No ratings yet
Data Mining & Warehousing Guide
2 pages
Data Mining Concepts and Techniques 3rd Edition by Jiawei Han
No ratings yet
Data Mining Concepts and Techniques 3rd Edition by Jiawei Han
310 pages
Distributed Database Design Guide
No ratings yet
Distributed Database Design Guide
52 pages
DWDM Assignment 1 VI M2
No ratings yet
DWDM Assignment 1 VI M2
2 pages
Hightouch 2511 Complete Guide To Composable CDP
No ratings yet
Hightouch 2511 Complete Guide To Composable CDP
12 pages
B.tech MDU Syllabus (IT) .47-52
No ratings yet
B.tech MDU Syllabus (IT) .47-52
6 pages
Data Warehousing
No ratings yet
Data Warehousing
48 pages
MIS Characteristics and Functions
No ratings yet
MIS Characteristics and Functions
23 pages
Test Units
No ratings yet
Test Units
19 pages
CS614 Quiz-2 by Vu Topper RM
No ratings yet
CS614 Quiz-2 by Vu Topper RM
63 pages

Seminar Report Alisha

Uploaded by

Seminar Report Alisha

Uploaded by

SEMINAR REPORT

INTRODUCTION TO BIG DATA

Department of Electronics & Communication

Institute of Engineering and Technology, Ayodhya

SUBMITTED BY: SUBMITTED TO:

Name-Alisha Khan Er. Shambhavi Shukla

Year\ Sem - 3rd\5th

Roll no- 19204

 Introduction to big data

WHY BIG DATA:

CHARACTERISTICS OF BIG DATA:

TECHNOLOGY CAREER PATHS/ HOW CAN VARIOUS SECTORS

 AIG offers drivers an internet-enabled application to score and track driving

 Creating profile of customer

BIG DATA AS A CAREER PATH:

Steps in the Data Science Process:

ETL & DATA WAREHOUSING:

• Three main component of Data Warehouse.

– A data staging area where data is cleaned and ordered

– A presentation area where data is warehoused.

• Data may be:

• Types of Data Warehouse

– Enterprise Data Warehouse:

– Operational Data Store

Typical Data Warehouse

Operational or transactional or day – to –day business data is gathered.

The three main parts of Hadoop.

HDFS- HADOOP DISTRIBUTED FILE SYSTEM:

 It is a storage system for enormous data.

 It is employed for metadata.

 It is employed for block storage.

MAP-> apply() {applies operations to all elements}

REDUCE-> summarize() {summarizes operations on elements}

 Google uses MapReduce for indexing websites

IMORTANT FEATURES OF YARN:

 YARN provides the feature of multi-tenancy to the company.

partition of cluster resources.

COMODITY CLUSTER AND FAULT TOLERENCE:

Commodity clusters are an important class of modern-day supercomputers. A commodity

PROGRAMMING MODELS FOR BIG DATA:

You might also like