0% found this document useful (0 votes)
97 views37 pages

Session 1

Bigdata

Uploaded by

Abhijit Sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views37 pages

Session 1

Bigdata

Uploaded by

Abhijit Sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

BIG DATA STRATEGY AND

TECHNOLOGY INNOVATION Session -1

(BDSTI)
ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 1
SESSION PLAN

Big Data Landscape and cross section Big Data as a Service (BDaaS) and Architecture

The Big Data Organizational Transformation and Value Rise of Collaborative Economy and Transportation with Big Data
Proposition and Framework for Big data

Big Data and Stream Analytics Big Data with Geospatial Technologies enabling Innovations

Use Cases of Big Data in Department of Defense and Use Cases of Big Data in Government and Security
Intelligence Community
Use Cases of Big Data in Healthcare and Life Sciences Challenges and Privacy Conundrum of handling Big Data

Machine learning and healthcare big data Deep Reinforcement Learning and Big data

Creating a culture of Innovation and Discovery in the Organization and Empowering the workforce

Industry Student Interactions (Guest Session by Industry Expert)

Hands on practical sessions on Big Data Handling (Cloudera, AWS, R)


Cases / Articles / Blog discussion
(12th December, 2019 – 13th February, 2020): 20 Sessions
ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 2
GUEST SESSIONS AND INDUSTRY STUDENT
INTERACTIONS
1. 9th January 10:10 -- 80 minutes Mr. Sukanta Big data applications for Manufacturing
(Thursday) 11:30am Padhy Industry

2. 18th 11:50 – 80 minutes Mr. Kunal Big Data Competitive Landscape (Demo of
January 1:10pm Gupta Cloudera and Amazon AWS by illustrating a
(Saturday) business case)
Extra Slot
3., 4. 1st February 11:50 – 13:10 80+80 Dr. Kirti Big Data Analytics and Reporting using R
(Saturday) and 14:00 – minutes Wankhede (Demo by illustrating a business case)
Extra Slot 15:20pm
*All of you are requested to bring your laptops

ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 3


EVALUATION CRITERIA
Headers under Evaluation Schedule for Evaluation Total marks
Group Projects – (10 Groups) 6th Feb, 11th Feb and 13th Feb, 30 Marks
Final Theme Based 2020
• Plagiarism free Report (detailed scheduled will be send in
• Presentation advance)
Class Participation/Class Activity Detailed scheduled will be send in 20 Marks
• Discussion and Presentation on advance (Will begin from 2nd week;
Article/news/links/Caselets given to each 17th December, 2019)
group 10 (group performance during
• e-learning (Harvard edX on-line courses / specific slot) + 10 (individual
MIT online / Coursera / SWAYAM / performance in all the classes)
Udemy) – 3 to 6 Hours
Quiz (Online on Portal) 4th Feb, 2020 (17th Session) 10 (MCQ based)
End Term Final Exam 40
In case you need any sort of help or meeting for discussions please feel free to contact the faculty through email:
Preeti.khanna@sbm.nmims.edu.
ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 4
SESSION – 1: BIG DATA LANDSCAPE AND CROSS
SECTION
Key Content:

 Exploring Big Data & its Wholeness: Telematics data, text data, Geospatial
data, smart-grid data, sensor data
 Value Big data Hold
 Big data life cycle

ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 5


FACTS TO BEGIN ABOUT BIG DATA:
Between 2012 and 2020, the digital universe will grow by two times every two years. (Source: IDC)
How much data is generated every day? Over 2.5 quintillion bytes by the 2018 figures. (Source: Domo)
In 2019, Twitter users send more than 500,000 tweets every minute. (Source: Domo)
90% of enterprise analytics and business professionals currently say data and analytics are key to their organization’s
digital transformation initiatives. (Source: MicroStrategy)
The number of firms investing more than $500 million annually in big data has grown from 12.7% in 2018 to 21.1% in
2019. (Source: NewVantage Partners)
How much do companies spend on data analytics? About $187 billion in 2019. (Source: IDC)
IBM is the largest big data and analytics vendor in terms of revenue, with $2.66 billion in 2017. (Source: Statista)
Amazon S3 is the most popular big data data-access method, with more than 50% of respondents considering it critical
or very important. (Source: Dresner Advisory Services
The Hadoop and Big Data market are projected to grow from $17.1 billion in 2017 to $99.3 billion in 2022.
(Source: Statista)

By 2020, there will be 2.7 million job postings for data science and analytics roles in the US alone.
(Source: PwC)
http://bigdata.stratebi.com/spark-streaming/index.htm;jsessionid=1CABCA5319BF88A43EC4405C701B3331.bd1
ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 6
VOLUME OF DATA GENERATED IN ONE MINUTES

Source: August, 2016 Article; Marcia Conner-Big data Statistics ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 7
Source: August, 2016 Article; Marcia Conner-Big data Statistics
ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 8
Data volume of global consumer internet traffic from 2017 to 2022, by sub segment (in
Exabyte's per month)
The statistic shows a forecast for the trend in internet traffic from 2017 to
2022, by segment. In 2020, consumer data traffic in the online gaming
Global consumer internet subsegment traffic 2017-2022 segment is expected to amount to 7 exabytes per month, up from 1 EB in
2017. The 2017-2022 CAGR of this subsegment amounts to 59 percent.

Internet video Web, email and data File sharing Online gaming
350

300
Data volume in exabytes per month

250

200

150

100

50

0
2017 2018 2019* 2020* 2021* 2022*

Note: Worldwide; 2017 to 2018; fixed and mobile


Further information regarding this statistic can be found on page 8.
2 Source(s): Cisco Systems; ID 267194
Big data technology adoption plans in organizations worldwide as of 2018, by vertical
Adoption expectations for big data technology worldwide 2018, by vertical

Share of respondents
Yes. We use big data today We may use big data in the future No. We have no plans to use big data at all

0.0% 20.0% 40.0% 60.0% 80.0% 100.0% 120.0%

Telecommunications

Insurance

Advertising

Financial Services

Healthcare

Technology

Education (K-12 & higher Ed)

Government (federal, state & local)

Retail and wholesale

Manufacturing

Note: Worldwide; 2018; Research community of over 5,000 organizations as well as crowdsourcing and vendors' customer communities
Further information regarding this statistic can be found on page 8.
4 Source(s): Dresner; Statista estimates; ID 919683
Big data market size revenue forecast worldwide from 2011 to 2027 (in billion U.S. dollars)
Forecast revenue big data market worldwide 2011-2027

120
The global big data market is forecasted to grow to 103 billion U.S. dollars by 2027, more than double its
expected market size in 2018. With a share of 45 percent, the software segment would become the large
103
big data market segment by 2027.
100 96
90
Market volume in billion U.S. dollars

84
80 77
70
64
60 56
49
42
40 35
28
22.6
19.6 18.3
20
12.25
7.6

0
2011 2012 2013 2014 2015 2016 2017 2018* 2019* 2020* 2021* 2022* 2023* 2024* 2025* 2026* 2027*

Note: Worldwide; 2014 to 2018


Further information regarding this statistic can be found on page 8.
2 Source(s): Wikibon; SiliconANGLE; ID 254266
SEVERAL BIG DATA SOURCES

ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 12


SCOPE OF BIG DATA : EVOLVING ROLE
EVOLUTION OF DATA: WHERE IT ALL STARTED
TYPE OF ANALYTICS KEEPS CHANGING

ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 14


EXAMPLES OF 'BIG DATA'
The New York Stock Exchange generates about one
terabyte of new trade data per day.

Social Media : Statistic shows that 500+terabytes of new


data gets ingested into the databases of social media
site Facebook, every day. This data is mainly generated in
terms of photo and video uploads, message exchanges,
putting comments etc.

Single Jet engine can generate 10+terabytes of data


in 30 minutes of a flight time. With many thousand flights
per day, generation of data reaches up to
many Petabytes.
ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 15
BIG DATA TIMELINE

1991 2004
1995 2005
1998 2007
1999 2008
2001 2011
2002 2012
2003 2013

ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 16


BIG DATA TIMELINE
1991 • Internet (WWW) is born. HTTP becomes the standard means for sharing information in this new
medium

1995 • Sun releases the Java Platform, 2nd most popular language behind C.
• GPS becomes fully operational

1998 • Carlo Strozzi develops an open source relational databases, NoSQL.


• Google is founded by Larry Page and Sergey Brin, who worked for about a year on a standard
search engine project called BackRub.

1999 • Kevin Ashton, cofounder of Auto-ID at MIT invent the term “Internet of Things”
2001 • Wikipedia is launched
2002 • Version 1.1 of Bluetooth is released by IEEE
2003 • LinkedIn launches. And in 2013, the site had about 260 million users.
2004 • Facebook is founded by Mark Zukerberg in Cambridge. In 2013, the site had more than 1.15
billion users.
ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 17
BIG DATA TIMELINE

2005  Hadoop project is created by Doug Cutting and Mike Caferella. The Name of
the project came from the toy elephant of Cutting’s young son.
2007 • Apple releases the iPhone and creates a strong consumer market for
smartphones.
2008 • The number of devices connected to the Internet exceeds the world’s population.
2011 • IBM’s Watson computer, analyze 4 TB (200million pages) of data in seconds to
defeat two human players on the television show Jeopardy!
• The IPv4 standard internet protocol (232 or 4.5 billion unique addresses) address
spaces have all be assigned.
2012 • The Obama administration announces the Big data research and development
initiative, consisting of 84 programs in six department.
2013 • With smartphone, tab, wi-fi, everyone generates data.
ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 18
TECHNOLOGICAL CHALLENGES AND SOLUTIONS FOR BIG DATA
1. VOLUME
Challenge
 How to avoid the risk of data loss from machine failure in clusters of
commodity machines
Solution
 Replicate segments of data in multiple machines , master node keeps
track of segment location
Technology
 HDFS (Hadoop Distributed File System)
 This system is built on the patterns of Google’s Big File systems, designed
to store billions on pages and sort them to answer user search queries

ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 19


TECHNOLOGICAL CHALLENGES AND SOLUTIONS FOR BIG DATA
2. INGESTING STREAMS AT AN EXTREMELY FAST PACE
Challenge
 How to handle torrential streams of data?
 How to avoid choking of network bandwidth by moving large volumes
of data?
Solution
 Creating special scalable ingesting systems that can open an unlimited
number of channels for receiving data.
 They can hold data in queues and manage using parallel processing
Technology
 Map-Reduce
ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 20
TECHNOLOGICAL CHALLENGES AND SOLUTIONS FOR BIG DATA
3. HANDLING A VARIETY OF FORMS AND FUNCTIONS OF DATA
Challenge
 How to structure and access all varieties of data?
Solution
 Storing the data in non-relational systems i.e. NoSQL database.
 These databases are optimized for certain tasks such as query processing, or
graph processing, document processing, etc.
Technology
 NoSQL databases :
 HBase (stores each data element separately along with its key identifying
information)
 NoSQL languages like Hive and Pig are used to access this data.
ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 21
COMPARE BIG DATA WITH
TRADITIONAL DATA
ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 22
There are four categories of analytics that organizations need
to consider:

DESCRIPTIVE: monitoring current state performance or results


DIAGNOSTIC: understanding (quantifying) drivers of
performances
PREDICTIVE: forecasting likely outcomes
PRESCRIPTIVE: recommending actions for future decisions

ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 23


Traditional Data Big Data
Representative Structure
Primary Purpose
Source of data
Volume of data
Velocity of data
Variety of data
Veracity of data
Structure of data
Physical Storage of data
Database organization
Data Access
Data Manipulation
Data base tools
Total Cost of system ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 24
Traditional Data Big Data
Representative Structure Lake/Pool Flowing Stream / River
Primary Purpose Manage business activities Communicate, Monitor
Source of data Business transactions, documents Social Media, Web, Sensors, IoT
Volume of data Gigabytes, Terabytes More than Exabyte's
Velocity of data Ingest level is controlled Real time unpredictable ingest
Variety of data Alphanumeric Audio, Video, Graphs, Text
Veracity of data Clean and more trustworthy Varies depending on source
Structure of data Mostly Structured Un-structured
Physical Storage of data In SAN Distributed clusters of commodity computers
Database organization Relational database NoSQL database
Data Access SQL NoSQL such as Pig
Data Manipulation Conventional data processing Parallel processing
Data base tools Commercial systems Hadoop, Spark
Total Cost of system Medium to high Very High
ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 25
BIG DATA AND INDUSTRY
ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 26
CASE OF GOOGLE FLU TRENDS
Was an enormously successful influenza forecasting service, pioneered by Google.
The program aimed to better predict flu outbreaks using data and information from the
US centers for Disease control and prevention (CDC).
Amazing fact was: this application could predict the onset of flue, almost two weeks
before CDC saw it coming.
From 2004 till about 2012 it was able to successfully predict the timing and
geographical location of the arrival of the flue season around the world.

However, it failed to predict the 2013 Ebola’s spread, hence created a major panic.

Reasons: Big data hubris and its underlying concerns


ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 27
1. CUSTOMER ANALYTICS
The more data sources they use, the more complete picture they will get. Say, for each
of their 10+ million customers they can analyze:

Demographic data (this customer is a woman, 35 years old, has two children, etc.).
Transactional data (the products she buys each time, the time of purchases, etc.)
Web behavior data (the products she puts into her basket when she shops online).
Data from customer-created texts (comments about the retailer that this woman leaves
on the internet).

To create a 360-degree customer view, retailers need to collect, store and
analyze a plethora of data.

ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 28


2. BUSINESS PROCESS ANALYTICS
Companies also use big data analytics to monitor the performance of
their remote employees and improve the efficiency of the processes.

Let’s take transportation as an example.


Companies can collect and store the telemetry data that comes from each
truck in real time to identify a typical behavior of each driver.

Thus, the company can ensure safe working conditions (as drivers should
take rest when, etc.).
ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 29
3. ANALYTICS FOR FRAUD DETECTION
Banks can detect an unusual card behavior in real time and block
suspicious activities or at least postpone them to notify the owner.

For example, if the user is trying to withdraw money in Spain, while they
reside in Texas,

Besides, the bank can verify if this user has any linkage with fraud-related
accounts or activities across all other channels.

ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 30


4. INDUSTRIAL ANALYTICS
To avoid expensive downtimes that affect all the related processes,
manufacturers can use sensor data to foster proactive maintenance.

Imagine that the analytical system has been collecting and analyzing sensor
data for several months to form a history of observations.

Based on this historical data, the system has identified a set of patterns that
are likely to end up with a machine breakdown

For instance, the system recognizes that picture formed by temperature and
load sensors is similar to pre-failure situation and alerts the maintenance
team to check the machinery.
ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 31
5. SUPPLY CHAIN MONITORING
All containers on ships communicate their status and location using RFID tags.
Thus retailers and their suppliers can gain real-time visibility to the inventory
through the global supply chain.

Retailers can know exactly where the items are in the warehouses and so
can bring them into the store at the right time.

This is particularly relevant for seasonal items that must be sold on time, or
else they will be sold at a discount.

Better visibility
Better customer service ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 32
6. PREDICTIVE POLICING

Los Angeles Police Department (LAPD)


invented the concept of predictive policing
LAPD worked with Berkeley researches to
analyze its large database of 13 millions
crimes recorded over 80 years.
And predicted the likeliness of crimes of
the certain types, at certain times and in
certain locations.
They identified hotspots of crime where
crimes had occurred and where crime was
likely to happen in the future. A map generated by PredPol software, short for predictive
policing, forecasting where crimes are likely to occur. Updated
maps are printed daily for each shift of patrol officers at the Lost
Angeles Police Department.(LAPD)
ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 33
CASE STUDY EXAMPLES BY INDUSTRY:
Marketing, A/B Testing, and Personalization
Advertising Technology
Sales and Sales Enablement
Customer Success and Support Technology
Infrastructure, Hosting, and CDNs
Security Software and Services
Big Data Software and Services
Application Performance Monitoring and Log File Analysis
Business Intelligence and Analytics
HR Software, ATS, and Recruiting

https://www.docsend.com/blog/best-b2b-case-study-
examples/#data

ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 34


SPECIFIC EXAMPLES OF COMPANIES THAT
ARE LEVERAGING BIG DATA AND
ANALYTICS
ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 35
Google: PageRank® and Ad Serving
Yahoo: Behavioral Targeting and Retargeting
Facebook: Ad Serving and News Feed
Apple: iTunes® Recommendations
Netflix: Movie Recommendations
Amazon: “Customers Who Bought This Item”, 1-Click® ordering and
Supply Chain & Logistics
Walmart: Demand Forecasting, Supply Chain Logistics and Retail Link®
Procter & Gamble: Brand and Category Management
Federal Express: Critical Inventory Logistics
American Express and Visa: Fraud Detection
GE: Asset Optimization and Operations Optimization (Predix®)
ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 36
ONLY FOR ACADEMIC PURPOSE (PREPARED BY DR. PREETI KHANNA) 37

You might also like