BDA - Lecture 3

The document provides an overview of big data processes, frameworks, platforms, technologies and tools. It discusses the typical big data life cycle of acquisition, extraction, integration, analysis, interpretation and decisions. Popular big data platforms like Hadoop and MapReduce are described. Example big data technologies include MongoDB, Cassandra, Hive and Spark. The document also lists various big data tools and provides success stories of big data applications in healthcare, transportation and life insurance.

Uploaded by

rumman hashmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

114 views17 pages

BDA - Lecture 3

Uploaded by

rumman hashmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

BIG DATA

CS-585
Unit-1: Lecture 3
• Structure of Big Data,
• Big Data Processes
• Big Data Framework
• Big Data Plateform and Applications Framework
Contents • Examples of Big Data Plateform in Practice
• Big Data Plateform Manifesto
• Big Data Technologies
• Big Data Tools
• Big Data Analytics and Techniques
• Big Data Use Case
Big Data Processes / Life Cycle

●Problem: The sale of Chewing gum is going down

●Acquisition

– Sales by customer, region and time

– Survey by users Acquistion
– Social Networks
Extraction
●

– Data Loading from receipts

– Automatics reading of questionnaires Extraction
– Data extraction from twitter Decision
Integration
●

– Based on user types

Analysis
●

– Chewing gum bought by people older than 25

– Chewing gum preferred by people younger than an
age
Interpretation
●
Interpretation
– Moms believes: Chewing gum = bad teeth Integration
– Boys and girl believe that chewing gum are for
babies
Decisions
●

– We make Chewing gums without sugar

– We ask dentists to advertise our chewing gum as Analysis
refresher
– We make commercials targeted to boys and girls.
Big Data Framework
Big Data Framework
Big Data Platform Manifesto
• Hadoop/HDFS (2007)
• A framework based on the principles given by Map Reduce and Big
Tables. Follows the principle of distributed computing, where the
data is distributed, managed and stored on different systems
known as nodes (HDFS: Hadoop Distributed File System). First used
by yahoo to support the storage of structured, unstructured and
semi structured data.
• Designed to parallelize data processing across computing nodes to
speed the computation and hide the latency.(Doug Cutting)
• Map Reduce(2007)
• Designed to process a large amount of data in batch mode. Follow
Big Data the distributed computing model, where each of the task is mapped
to many systems for processing in a way that manages the recovery
from failure and balance of load. The system was developed by
Technologies Google.
• Reduce operation aggregates the results. Mainly designed to work
with HDFS but now support other db formats also like Cassandra,
Hbase etc.
• Big Table
• Data storage was solved with the help of big tables. It is distributed
storage system to manage the vast quantity of highly scalable
structured data.
• It is like a multidimensional sorted map. Data captured is stored in
different nodes across the systems. It is unlike the traditional
databases where data is organized in rows and columns.
Big Data Technologies
•Cassandra - 2008 - A key-value pair NoSQL database, with column family data representation and asynchronous
masterless replication.
•HBase - 2008 - A key-value pair NoSQL database, with column family data representation, with master-slave
replication. It uses HDFS as underlying storage.
•Zookeeper - 2008 - A distributed coordination service for distributed applications. It is based on Paxos algorithm
variant called Zab.
•Pig - 2009 - Pig is a scripting interface over MapReduce for developers who prefer scripting interface over native
Java MapReduce programming.
•Hive - 2009 - Hive is a SQL interface over MapReduce for developers and analysts who prefer SQL interface over
native Java MapReduce programming.
•Mahout - 2009 - A library of machine learning algorithms, implemented on top of MapReduce, for finding
meaningful patterns in HDFS datasets.
•Sqoop - 2010 - A tool to import data from RDBMS/DataWarehouse into HDFS/HBase and export back.
•YARN - 2011 - A system to schedule applications and services on an HDFS cluster and manage the cluster resources
like memory and CPU.
•Flume - 2011 - A tool to collect, aggregate, reliably move and ingest large amounts of data into HDFS.
•Storm - 2011 - A system to process high-velocity streaming data with 'at least once' message semantics.
•Spark - 2012 - An in-memory data processing engine that can run a DAG of operations. It provides libraries for
Machine Learning, SQL interface and near real-time Stream Processing.
•Kafka - 2012 - A distributed messaging system with partitioned topics for very high scalability.
•SolrCloud - 2012 - A distributed search engine with a REST-like interface for full-text search. It uses Lucene library
for data indexing.
• Databases
• MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase,
Dypertable, Voldemart, Riak, ZooKeeper.
• Map Reduce
• Hadoop, Hive, Pig, Cascading, Cascalog, Mrjob,
Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban,
Oozie, Greenplum
• Storage

Big Data • S3, HDFS

Tools • Servers
• EC2, Google App Engine, Elastic, Beanstalk, Heroku
• Processing
• R, Yahoo, Pipes, Mechanical, Turk, Solr/Lucence, Elastic
Search, Datameer, BigSheets, TinderPop
Big Data Tools and Technologies (Graphical View)
Apache Hadoop Eco System update
How Uber handled/solved their data problem
Big Data Success Stories
• Epidepmic Early warning
Healthcare • ICU and remote monitoring
• USD 150000 reduction in the cost of unnecessary neonatal surgery

• Fleet risk advisors helping truck operators by building stronger and

Transportation faster risk prediction models. 80% reduction in serious accidents, 20%
reduction in minor accidents, 30% reduction in driver retention rates.

• In japan claim processing has been made faster. 22% fewer mistakenly
Life Insurance unpaid claims, 90% accuracy in coding medical terms, 20% reducation
in assessment workforce.

IT Company • IBM’s Big data business grew over 150% in 2014. IBM joins apple &
twitter in strategic partnership.
References
• IBM ICE course Material
• http://blog.newtechways.com/2017/10/apache-hadoop-
ecosystem.html
• https://www.edureka.co/blog/what-is-big-data/
• https://bigdataanalyticsnews.com/
• https://data-flair.training/blogs/hadoop-tutorial/
• https://intellipaat.com/blog/tutorial/hadoop-tutorial/
• https://intellipaat.com/blog/tutorial/hadoop-tutorial/
• https://nptel.ac.in/courses/106/104/106104189/
• Thank You
Wish you a prosperous career
with Big Data

OPC UA Part 1 - Overview and Concepts Release 1.04 Specification
100% (1)
OPC UA Part 1 - Overview and Concepts Release 1.04 Specification
30 pages
Artificial Intelligence and Machine Learning Digital Notes
No ratings yet
Artificial Intelligence and Machine Learning Digital Notes
185 pages
7 More Steps To Mastering Machine Learning With Python - Page1
No ratings yet
7 More Steps To Mastering Machine Learning With Python - Page1
8 pages
Network, Host, & Cloud Security Operations
No ratings yet
Network, Host, & Cloud Security Operations
13 pages
Lecture Notes-Cns by Suthoju Girija Rani
100% (1)
Lecture Notes-Cns by Suthoju Girija Rani
163 pages
DoS Attack Insights for CS Students
No ratings yet
DoS Attack Insights for CS Students
21 pages
Explainable Ai For Cybersecurity Zhixin Pan Prabhat Mishra Download
No ratings yet
Explainable Ai For Cybersecurity Zhixin Pan Prabhat Mishra Download
72 pages
BDA - Lecture 4
No ratings yet
BDA - Lecture 4
41 pages
Network Addressing Essentials
No ratings yet
Network Addressing Essentials
30 pages
Keycloak - CNCF Security SIG - Self Assesment
No ratings yet
Keycloak - CNCF Security SIG - Self Assesment
35 pages
Analytics For The Internet of Things IoT Intelligent Analytics For Your Intelligent Devices 1st Edition Andrew Minteer Updated 2025
No ratings yet
Analytics For The Internet of Things IoT Intelligent Analytics For Your Intelligent Devices 1st Edition Andrew Minteer Updated 2025
122 pages
Sarker I. AI-Driven Cybersecurity and Threat Intelligence... 2024
No ratings yet
Sarker I. AI-Driven Cybersecurity and Threat Intelligence... 2024
207 pages
Neural Networks and Deep Learning A Textbook Charu C. Aggarwal Full Chapters Instanly
100% (2)
Neural Networks and Deep Learning A Textbook Charu C. Aggarwal Full Chapters Instanly
156 pages
Data Security and Privacy in Big Data
No ratings yet
Data Security and Privacy in Big Data
12 pages
UNIT - 01: Network and Information Security Fundamentals
No ratings yet
UNIT - 01: Network and Information Security Fundamentals
31 pages
2 CSF
No ratings yet
2 CSF
43 pages
Module 1 Introduction To Security
No ratings yet
Module 1 Introduction To Security
40 pages
Secure AI Lifecycle
No ratings yet
Secure AI Lifecycle
41 pages
1 Artificial Intelligence Overview
No ratings yet
1 Artificial Intelligence Overview
50 pages
Green It
No ratings yet
Green It
22 pages
Oltp Olap Rtap
No ratings yet
Oltp Olap Rtap
53 pages
Big Data Seminar Report Rahul Jain
No ratings yet
Big Data Seminar Report Rahul Jain
41 pages
Leveraging Industrial Iot and Advanced Technologies For Digital Transformation
No ratings yet
Leveraging Industrial Iot and Advanced Technologies For Digital Transformation
76 pages
Day 01 Lect 1 Part 3 Introduction To UML
No ratings yet
Day 01 Lect 1 Part 3 Introduction To UML
39 pages
Machine Learning
No ratings yet
Machine Learning
2 pages
Explainable AI For Cybersecurity Automation, Intelligence and Trustworthiness 5
No ratings yet
Explainable AI For Cybersecurity Automation, Intelligence and Trustworthiness 5
24 pages
DS ML CompleteSlides PDF
No ratings yet
DS ML CompleteSlides PDF
211 pages
OWASP AI
No ratings yet
OWASP AI
144 pages
High-Performance Web Apps With FastAPI: The Asynchronous Web Framework Based On Modern Python 1st Edition Malhar Lathkar Newest Edition 2025
0% (1)
High-Performance Web Apps With FastAPI: The Asynchronous Web Framework Based On Modern Python 1st Edition Malhar Lathkar Newest Edition 2025
127 pages
Malware Data Science Attack Detection and Attribution Joshua Saxe Ready To Read
No ratings yet
Malware Data Science Attack Detection and Attribution Joshua Saxe Ready To Read
103 pages
Zabbix & Grafana OnCall Integration Guide
No ratings yet
Zabbix & Grafana OnCall Integration Guide
32 pages
Understanding The Must-Haves of Data and Cyber Resilience Storage Edition
No ratings yet
Understanding The Must-Haves of Data and Cyber Resilience Storage Edition
13 pages
Smart Home NLP for Non-Tech Users
No ratings yet
Smart Home NLP for Non-Tech Users
9 pages
Real-Time Personalization Using Embeddings For Search Ranking at Airbnb PDF
100% (1)
Real-Time Personalization Using Embeddings For Search Ranking at Airbnb PDF
10 pages
Unit 2 - Lecture 1 - Multidimensional Arrays - Applications of Arrays
No ratings yet
Unit 2 - Lecture 1 - Multidimensional Arrays - Applications of Arrays
19 pages
Deep Learning For IoT Big Data and Streaming Analytics
No ratings yet
Deep Learning For IoT Big Data and Streaming Analytics
34 pages
Cybersecurity Framework 101 Webinar 20170301 v2
No ratings yet
Cybersecurity Framework 101 Webinar 20170301 v2
28 pages
Unit 3 SE PDF
No ratings yet
Unit 3 SE PDF
78 pages
Onbase: A Secure, Protected Environment: Critical Information Secure at Every Data State
No ratings yet
Onbase: A Secure, Protected Environment: Critical Information Secure at Every Data State
3 pages
Crash Course Iot Analytics July15
No ratings yet
Crash Course Iot Analytics July15
19 pages
BDACh 07 L06 Real Time Analytics Platform
No ratings yet
BDACh 07 L06 Real Time Analytics Platform
14 pages
Course File of Ecommerce
100% (2)
Course File of Ecommerce
30 pages
Big Data Analysis Workshop 2018
No ratings yet
Big Data Analysis Workshop 2018
4 pages
Big Data Course for MBA Students
No ratings yet
Big Data Course for MBA Students
27 pages
Introduction To UML: Lawrence Chung CS6358.OT1: Module 2 1
No ratings yet
Introduction To UML: Lawrence Chung CS6358.OT1: Module 2 1
60 pages
NoSQL Databases: A Developer's Guide
No ratings yet
NoSQL Databases: A Developer's Guide
36 pages
Data Science Intro for Women
0% (1)
Data Science Intro for Women
24 pages
Data Security Everywhere Executive Pitch
No ratings yet
Data Security Everywhere Executive Pitch
62 pages
2015 IEEE International Advance Computing Conference Conference Number #35547 Xplore Compliant ISBN: 978-1-4799-8047-5
No ratings yet
2015 IEEE International Advance Computing Conference Conference Number #35547 Xplore Compliant ISBN: 978-1-4799-8047-5
4 pages
Secure File Sharing Using Access Control
No ratings yet
Secure File Sharing Using Access Control
73 pages
21IT1701 - MCMAD - Unit IV Notes
No ratings yet
21IT1701 - MCMAD - Unit IV Notes
23 pages
4.2 NoSQL Databases UNIT-1
No ratings yet
4.2 NoSQL Databases UNIT-1
35 pages
Bda - 2 Unit
No ratings yet
Bda - 2 Unit
12 pages
AZ-500 Syllabus
No ratings yet
AZ-500 Syllabus
4 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
Database Security by Alfred Basta Unlocked Test Bank
No ratings yet
Database Security by Alfred Basta Unlocked Test Bank
303 pages
Securing Ai
No ratings yet
Securing Ai
16 pages
Digital Forensics and Internet of Things: Impact and Challenges 1st Edition Anita Gehlot Download
No ratings yet
Digital Forensics and Internet of Things: Impact and Challenges 1st Edition Anita Gehlot Download
125 pages
CCS334 BIG DATA ANALYTICS Session 1 Intr
No ratings yet
CCS334 BIG DATA ANALYTICS Session 1 Intr
18 pages
Data Science
No ratings yet
Data Science
87 pages
Astm D 7007 - 03
No ratings yet
Astm D 7007 - 03
11 pages
Homework Chapter 13: Pooling Cross Sections Across Time: Simple Panel Data Methods
No ratings yet
Homework Chapter 13: Pooling Cross Sections Across Time: Simple Panel Data Methods
2 pages
Science Department Lab Report Format
0% (1)
Science Department Lab Report Format
6 pages
Grams, Kilograms, Milliliters and Liters: by Leigh Langton
50% (2)
Grams, Kilograms, Milliliters and Liters: by Leigh Langton
27 pages
Peak Load Shifting Using A Price-Based Control in PCM-enhanced Buildings
No ratings yet
Peak Load Shifting Using A Price-Based Control in PCM-enhanced Buildings
13 pages
शिष्यवृत्ती परीक्षा 8वी पेपर 1 प्रश्नपत्रिका 2024
No ratings yet
शिष्यवृत्ती परीक्षा 8वी पेपर 1 प्रश्नपत्रिका 2024
32 pages
SAP Analytics Cloud Overview
No ratings yet
SAP Analytics Cloud Overview
30 pages
Time Value of Money Calculations
No ratings yet
Time Value of Money Calculations
37 pages
Nema Bu 1.2 A I B R 600 V L: Pplication Nformation FOR Usway Ated Olts OR ESS
No ratings yet
Nema Bu 1.2 A I B R 600 V L: Pplication Nformation FOR Usway Ated Olts OR ESS
19 pages
Robotic Fabrication With Cork: Emerging Opportunities in Architecture and Building Construction
No ratings yet
Robotic Fabrication With Cork: Emerging Opportunities in Architecture and Building Construction
11 pages
BS EN 12520 Vs ANSI BIFMA X5.1
No ratings yet
BS EN 12520 Vs ANSI BIFMA X5.1
2 pages
JEE ADVANCED-Assignment-3: Presented by Kailash Sharma
No ratings yet
JEE ADVANCED-Assignment-3: Presented by Kailash Sharma
7 pages
Lab 3 - Dissolution Reactions - Heats of Dissociation
No ratings yet
Lab 3 - Dissolution Reactions - Heats of Dissociation
8 pages
Triumph
No ratings yet
Triumph
15 pages
Statistics Quiz for Students
No ratings yet
Statistics Quiz for Students
2 pages
Midterm Test Bank
No ratings yet
Midterm Test Bank
3 pages
HIT-HY 200 RV3 Injection Mortar: Technical Datasheet
No ratings yet
HIT-HY 200 RV3 Injection Mortar: Technical Datasheet
31 pages
Review Module 19 Hydraulics 5 Part 1
No ratings yet
Review Module 19 Hydraulics 5 Part 1
2 pages
A. Digital Subscriber Line
No ratings yet
A. Digital Subscriber Line
3 pages
Determination of Refractive Index and Optical Rotation
No ratings yet
Determination of Refractive Index and Optical Rotation
16 pages
Types of Motion
No ratings yet
Types of Motion
10 pages
Nanoporous Materials PDF
100% (1)
Nanoporous Materials PDF
458 pages
Lecture Notes 2 Probability
No ratings yet
Lecture Notes 2 Probability
25 pages
Linear Integrated Circuits
No ratings yet
Linear Integrated Circuits
299 pages
Assignment 3-Dos 29 Jan 2022
No ratings yet
Assignment 3-Dos 29 Jan 2022
2 pages
UTILIZATION OF E-WASTE IN FLEXIBLE PAVEMENT by Raj Soni
No ratings yet
UTILIZATION OF E-WASTE IN FLEXIBLE PAVEMENT by Raj Soni
26 pages
SXG Operator Manual Consola Rev.01.20160426.01
No ratings yet
SXG Operator Manual Consola Rev.01.20160426.01
86 pages
O. Serra - The Interpretation of Logging Data - 1986
100% (3)
O. Serra - The Interpretation of Logging Data - 1986
663 pages
Linguistics: Semantics Overview
No ratings yet
Linguistics: Semantics Overview
34 pages
Maze Solving Robot Using Arduino
No ratings yet
Maze Solving Robot Using Arduino
22 pages

BDA - Lecture 3

Uploaded by

BDA - Lecture 3

Uploaded by

BIG DATA

●Problem: The sale of Chewing gum is going down

– Sales by customer, region and time

– Data Loading from receipts

– Based on user types

– Chewing gum bought by people older than 25

– We make Chewing gums without sugar

Big Data • S3, HDFS

• Fleet risk advisors helping truck operators by building stronger and

You might also like