0% found this document useful (0 votes)
13 views76 pages

Unit 1

The document provides an overview of Data Analytics using R, focusing on Big Data, its characteristics, types, and the importance of analytics in various fields. It discusses the evolution of data management, the challenges associated with Big Data, and the different types of analytics (descriptive, predictive, prescriptive). Additionally, it highlights the significance of effectively utilizing data for business growth and decision-making.

Uploaded by

sanketraikar78
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views76 pages

Unit 1

The document provides an overview of Data Analytics using R, focusing on Big Data, its characteristics, types, and the importance of analytics in various fields. It discusses the evolution of data management, the challenges associated with Big Data, and the different types of analytics (descriptive, predictive, prescriptive). Additionally, it highlights the significance of effectively utilizing data for business growth and decision-making.

Uploaded by

sanketraikar78
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Data Analytics using R

B20CA6010 – Academic Year 2024-2025 – Sem VI EVEN


Semester
BCA, School of CSA
Prof.Sneha N
Assistant Professor
Data Analytics using R
UNIT 1 CONTENTS

• Introduction to Big Data:


• Types of Digital Data, Introduction to Big Data, Elements of Big Data (Facts,
capabilities, benefits, where it is used), Big Data Analytics, How to analyze Big
Data, History of Big Data, Big Data in the real world (Myths, Challenges,
Future), Big Data Management.

3
BIG DATA ANALYTICS

• What is Big Data?


• It is also data but with a huge size.

• It is a term used to describe a collection of data that is huge in size and


yet growing exponentially with time.

• In this data is so large and complex that none of the traditional data
management tools are able to store it or process it efficiently.
BIG DATA ANALYTICS

• Big data analytics refers to the systematic processing and analysis of large
amounts of data and complex data sets, known as big data, to extract valuable
insights.
• Big data analytics allows for the uncovering of trends, patterns and correlations
in large amounts of raw data to help analysts make data-informed decisions.
• This process allows organizations to leverage the exponentially growing data
generated from diverse sources, including internet-of-things (IoT) sensors, social
media, financial transactions and smart devices to derive actionable intelligence
through advanced analytic techniques.

5
BIG DATA ANALYTICS

• It refers to the strategy of analysing large volumes of


data, or big data.
• This big data is gathered from a wide variety of
sources, including social networks, videos, digital
images, sensors, and sales transaction records.
• The aim in analysing all this data to uncover
information -- such as hidden patterns, unknown
correlations, market trends and customer preferences -
- that can help organizations make informed business
decisions.

6
IMPORTANCE OF BIG DATA

7
IMPORTANCE OF BIG DATA

• The importance of big data does not revolve around how much data a company has
but how a company utilizes the collected data.
• Every company uses data in its own way; the more efficiently a company uses its
data, the more potential it has to grow.
• The company can take data from any source and analyse it to find answers which will
enable:
• Cost Savings
• Time Reductions
• Understand the market conditions
• Control online reputation

8
BIG DATA CAN BE USED IN

• Using Big Data Analytics to Boost Customer Acquisition and Retention


• Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing Insights
• Big Data Analytics As a Driver of Innovations and Product Development.
• Fraud Detection
• Credit Risk
• Digital Marketing
• Risk Management
• Health Care
• Advertising
• Risk Management
9
10
11
12
13
CHARACTERISTICS OF BIG DATA

14
TYPES OF DIGITAL DATA
Gartner estimates that 80% of data generated in any
enterprise today is unstructured data. Roughly 10% of
• Structured Data data is in the structured and semi-structured category.
• Semi-Structured Data
• Unstructured Data

15
TYPES OF DIGITAL DATA

16
TYPES OF DIGITAL DATA

• Structured Data
• Data which is in a organized form (rows and columns) and can be easily used
by a computer program.
• Conforms to a Data model.
• E.g: Data stored in Databases.
• Sources of Structured Data
• Database such as Oracle, MySQL, DB2, Teradata, ….
• Spreadsheets
• OLTP systems

17
TYPES OF DIGITAL DATA

• Semi-Structured Data
• Data which does not conform to a data model but has some structure.
• It is not in a form which can be easily used by a computer program.
• E.g: XML, HTML, e-mails,…..
• Sources of Semi-structured Data
• XML (eXtensible Markup Language)
• JSON (Java Script Object Notation)
• Used to transmit data between a server and a web application.

18
19
TYPES OF DIGITAL DATA

• Un Structured Data
• Data which does not conform to a data model or is not in a form which can be
used easily by a computer program.
• E.g: Memos, Images, Audio, Video, letters, etc.
• Sources of Unstructured Data
• Web pages Body of e-mail
• Images Text messages
• Audios Chat
• Videos Social media data
Word document

20
TYPES OF DIGITAL DATA

Structured data Unstructured data Semi-Structured data

• Rows & Columns • Audio, Video, Analog • XML data.


data
• DBMS, RDBMS • Supports both
• Data capture by Structured and
• SQL Sensors Unstructured data.
• RFID

• Weather forecasting

• NoSQL

21
22
The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are


consuming data

23
DATA SIZE

Name Symbol Number of bytes Equal to


Kilobyte KB 1,024 1,024 bytes
Megabyte MB 1,048,576 1,024 KB
Gigabyte GB 1,073,741,824 1,024 MB
Terabyte TB 1,099,511,627,776 1,024 GB
Petabyte PB 1,125,899,906,842, 624 1,024 TB
Exabyte EB 1,152,921,504,606, 846,976 1,024 PB
Zettabyte ZB 1,180,591,620,717, 411,303, 424 1,024 EB
Yottabyte YB 1,208,925,819,614,629,174, 1,024 ZB
706,176

24
HOW TO DEAL WITH UNSTRUCTURED DATA?

• Data Mining
• Process of discovering knowledge hidden in large volumes of data.
• Text Analytics (or) Text Mining
• Process of gleaning high quality meaningful information from text.
• Includes tasks such as Text categorization, Text clustering, Sentiment analysis, ….
• Natural Language Processing (NLP)
• Related to the area of human computer interaction.
• Enabling computers to understand human (or) Natural language input.
• Noisy Text Analytics
• Process of extracting structured (or) semi-structured information from noisy
unstructured data such as chats, blogs, emails, text messages, etc.

25
DATA GROWS

• “Every day, we create 2.5 quintillion bytes of data — so much that 90%
of the data in the world today has been created in the last few years
alone.”
• How to manage very large amounts of data and extract value and knowledge
from them ?

26
WHERE DO WE GET DATA

• This data comes from everywhere:


• sensors used to gather climate information,
• posts to social media sites,
• digital pictures and videos,
• purchase transaction records, and
• cell phone GPS signals to name a few.

• Definition of BIG DATA: Collection of Datasets that are large and complex
that can not be processed by traditional data processing applications.
• Constitute both structured and un structured data that grow large so fast
that they are not manageable by traditional RDBMS tools or conventional
statistical tools.
27
BIG DATA

BIG DATA

Volume .. facts and statistics collected


Velocity together for reference or
Variety analysis
value

28
29
VS

• Volume
• Bits -> Bytes -> Kilobytes -> Megabytes -> Gigabytes
-> Terabytes -> Petabytes -> Exabytes -> Zettabytes -
> Yottabytes

30
VS

• Velocity
• refers to the increasing speed at which this data is created, and the increasing speed at
which the data can be processed, stored and analysed by relational databases
• Data is being generated fast and need to be processed fast
• Online Data Analytics
• Late decisions  missing opportunities
• Examples
• E-Promotions: Based on your current location, your purchase history, what you like  send
promotions right now for store next to you
• Healthcare monitoring: sensors monitoring your activities and body  any abnormal
measurements require immediate reaction .

31
VS

Variety (Structured, Unstructured and Semi-structured)


• Various formats, types, and structures
• Text, numerical, images, audio, video, sequences, time series, social media data,
multi-dimensional arrays, etc…
• Static data vs. streaming data
• A single application can be generating/collecting many types of data

32
SUMMARY OF VS

• Veracity
• Refers to biases, noise and abnormality in data.
• Validity
• Refers to the accuracy and correctness of data.
• Volatility
• Deals with how long the data is valid? And how long it should be
stored?
• Variability
• Data whose meaning is constantly changing.

33
WHO IS GENERATING BIG DATA

• Internet - Google, amazon, eBay, AOL, etc.,


• Mobile gaming - On-line betting, multi-user, games
• Marketing - Social Network Analysis,
Digital Advertising, etc.,
• Telecom - Call Routing Management,
Subscriber data, etc.,
• Healthcare - Maintaining patient records, etc.,

34
BIG DATA ANALYTICS

“Big Data Analytics is the process of examining big data to uncover patterns,
unearth trends and find unknown correlations and other useful information to
make faster and better decisions”

“Process of collecting, organizing and analyzing of large sets of data (big data)
to discover patterns and other useful information”

35
CLASSIFICATION OF ANALYTICS

• First school of thought


• Basic, Operationalized, Advanced and Monetized
• Second school of thought
• Analytics 1.0, 2.0 and 3.0

36
FIRST SCHOOL OF THOUGHT

• Basic Analytics
• Slicing and dicing of data to help with basic business insights.
• Reporting on historical data, basic visualization, etc.,
• Operationalized analytics
• Gets woven into enterprise’s business process
• Advanced analytics
• Forecasting for the future by way of predictive and prescriptive modeling.
• Monetized analytics
• To derive direct business revenue.

37
38
• Predictive Analytics
• which use statistical models and forecasting techniques to understand the
future and answer: “What could happen?”
• Understanding the future
• Use Predictive Analytics any time you need to know something about the
future, or fill in the information that you do not have.

• Prescriptive Analytics
• which use optimization and simulation algorithms to advise on possible
outcomes and answer: “What should we do?”
• Advise on possible outcomes
• Use Prescriptive Analytics any time you need to provide users with advice on
what action to take.

39
40
41
SECOND SCHOOL OF THOUGHT

• Analytics 1.0
• Mid 1950’s to 2009
• Descriptive statistics (and Diagnostic)
• Report on events, occurrences, etc of the past.
• What happened?
• Why did it happen?
• Analytics 2.0
• 2005 to 2012
• Descriptive statistics + Predictive statistics
• Use data from the past to make predictions for the future
• What will happen?
• Why will it happen?

42
SECOND SCHOOL OF THOUGHT

• Analytics 3.0
• 2012 to present
• Descriptive + Predictive + Prescriptive statistics
• Use data from the past to make prophecies for the future and at the same
time make recommendations to leverage the situation to one’s advantage.
• What will happen?
• When will it happen?
• Why will it happen?
• What should be the action taken to take advantage of what will happen?

43
ANALYTICS 1.0, 2.0, 3.0

44
45
ANALYTICS 1.0, 2.0, 3.0

• Descriptive Analytics
• which use data aggregation and data mining to provide insight into the past and
answer: “What has happened?”
• Insight into the past
• Use Descriptive Analytics when you need to understand at an aggregate level what is
going on in your company, and when you want to summarize and describe different
aspects of your business.

• Predictive Analytics
• which use statistical models and forecasting techniques to understand the future
and answer: “What could happen?”
• Understanding the future
• Use Predictive Analytics any time you need to know something about the future, or
fill in the information that you do not have.
46
ANALYTICS 1.0, 2.0, 3.0

• Prescriptive Analytics
• which use optimization and simulation algorithms to advise on possible outcomes
and answer: “What should we do?”
• Advise on possible outcomes
• Use Prescriptive Analytics any time you need to provide users with advice on what
action to take.

47
TRADITIONAL BI VS. BIG DATA

• In traditional BI environment, all the enterprise’s data is stored in central


server whereas in a big data environment data resides in a distributed file
system.
• DFS scales by scaling in or out horizontally as compared to typical database
server that scales vertically
• In traditional BI, data is generally analysed in an offline mode whereas in big
data it is analysed in both real time as well as in offline mode.
• Traditional BI is about structured data and it is here that data is taken to
processing functions (move data to code)
• Big Data is about variety; processing functions are taken to the data (move
code to the data)

48
READING DATA WITH A SINGLE MACHINE

PARALLEL PROCESSING

49
BIG DATA CHALLENGES

50
THE EVOLUTION OF DATA MANAGEMENT

• The technology challenges are


• How does your organization deal with massive amounts of data in a
meaningful way?
• How do you make sense of that data when you can’t easily recognize the
patterns that are the most meaningful for your business decisions?
• Each data management wave is born out of the necessity to try and solve a
specific type of data management problem.
• These waves or phases evolved because of cause and effect.
• So, to understand big data, you have to understand the foundation of these
waves.

51
DATA MANAGEMENT-WAVE1
Creating manageable data structures
•The relational model added a level of abstraction (the structured query
language [SQL], report generators, and data management tools) so that it
was easier for programmers to satisfy the growing business demands to extract
value from data.
•The relational model offered an ecosystem of tools from a large number of
emerging software companies.
•Problem:
Storing this growing volume of data was expensive and accessing it was slow.
lots of data duplication existed,
actual business value of that data was hard to measure.

52
DATA MANAGEMENT-WAVE1
Creating manageable data structures
• Solution:
• When the volume of data that organizations needed to manage grew out of
control, the data warehouse provided a solution.
• The data warehouse enabled the IT organization to select a subset of the data
being stored so that it would be easier for the business to try to gain insights.
• Data warehouses and data marts solved many problems for companies needing a
consistent way to manage massive transactional data.
• Problem
managing huge volumes of unstructured or semi-structured data, the ware-house
was not able to evolve enough to meet changing demands.
too slow for increasingly real-time business and consumer environments.

53
DATA MANAGEMENT-WAVE2
Web and content management-Wave 2
• Enterprise Content Management systems evolved in the 1980s to provide
businesses with the capability to better manage unstructured data, mostly
documents.
• Documents and store and manage web content, images, audio, and video.
• A platform that incorporated business process management, version control,
information recognition, text management, and collaboration. This new
generation of systems added meta-data.
• With big data, it is now possible to virtualize data so that it can be stored
efficiently and, utilizing cloud-based storage, more cost-effectively as well.

54
DATA MANAGEMENT-WAVE3
Managing big data-Wave 3
• With big data, it is now possible to virtualize data so that it can be stored
efficiently and, utilizing cloud-based storage, more cost-effectively as well.
• the heart of big data, such as virtualization, parallel processing, distributed
file systems, and in-memory database

55
BEGINNING WITH CAPTURE, ORGANIZE,
INTEGRATE, ANALYZE, AND ACT
• Data must first be captured, and then organized
and integrated.
• After this phase is successfully implemented, data
can be analyzed based on the problem being
addressed.
• Finally, management takes action based on the
outcome of that analysis.
• For example, Amazon.com might recommend a
book based on a past purchase or a customer
might receive a coupon for a discount for a future The cycle of Big Data
purchase of a related product to one that was just Management
purchased.

56
ORGANIZATION SHOULD THINK?

• How much data will my organization need to manage today and in the
future?
• How often will my organization need to manage data in real time or near
real time?
• How much risk can my organization afford? Is my industry subject to strict
security, compliance, and governance requirements?
• How important is speed to my need to manage data?
• How certain or precise does the data need to be?

57
REQUIREMENTS OF BIG DATA
• Interfaces:
Big data is the fact that it relies on picking up lots of data from lots of
sources.
Therefore, open application programming interfaces (APIs) will be core to any
big data architecture.
• Physical Infrastructure
Without the availability of robust physical infrastructures, big data would
probably not have emerged
data may be physically stored in many different locations and can be linked
together through networks, the use of a distributed file system, and various
big data analytic tools and applications.

58
REQUIREMENTS OF BIG DATA

• The more important big data analysis becomes to companies, the more important it will
be to secure that data.

• For example, if you are a healthcare company, you will probably want to use big data
applications to determine changes in demographics or shifts in patient needs.

• new emerging approaches to data management in the big data world, including
document, graph, columnar, and geospatial database architectures.

• Collectively, these are referred to as NoSQL, or not only SQL, databases.

59
REQUIREMENTS OF BIG DATA

• MapReduce, Hadoop, and Big Table


• MapReduce was designed by Google as a way of efficiently executing a set
of functions against a large amount of data in batch mode.

• The “map” component distributes the programming problem or tasks across


a large number of systems and handles the placement of the tasks in a way
that balances the load and manages recovery from failures.

• the distributed computation is completed, another function called “reduce”


aggregates all the elements back together to provide a result.
60
REQUIREMENTS OF BIG DATA
• Big Table
• Big Table was developed by Google to be a distributed storage system
intended to manage highly scalable structured data.

• Data is organized into tables with rows and columns. It is intended to store
huge volumes of data across commodity servers.

•Hadoop

• Apache-managed software framework derived from MapReduce and Big Table.

• Hadoop allows applications based on MapReduce to run on large clusters of


commodity hardware.
61
BIG DATA IN REAL WORLD APPLICATIONS

• Big Data in Real world


• Nothing helps us understand Big Data more than examples of how the
technology and approaches is being used in the “real world.”
• Helps us to learn how to apply ideas from other industries into business.
• Every Data today is from the Online world.
• Examples include ..
• Digital Marketing, Financial Services, Advertising, and Healthcare.

62
BENEFITS OF BIG DATA

• Data accumulation from multiple sources, including the Internet, social media platforms, online shopping
sites, company databases, external third-party sources, etc.
• Real-time forecasting and monitoring of business as well as the market.
• Identify crucial points hidden within large datasets to influence business decisions.
• Promptly mitigate risks by optimizing complex decisions for unforeseen events and potential threats.
• Identify issues in systems and business processes in real-time.
• Unlock the true potential of data-driven marketing.
• Dig in customer data to create tailor-made products, services, offers, discounts, etc.
• Facilitate speedy delivery of products/services that meet and exceed client expectations.
• Diversify revenue streams to boost company profits and ROI.
• Respond to customer requests, grievances, and queries in real-time.
• Foster innovation of new business strategies, products, and services

63
CHALLENGES OF BIG DATA

• One of the issues with Big data is the exponential growth of raw data. The
data centres and databases store huge amounts of data, which is still rapidly
growing. With the exponential growth of data, organizations often find it
difficult to rightly store this data.
• The next challenge is choosing the right Big Data tool. There are various Big
Data tools, however choosing the wrong one can result in wasted effort,
time and money too.
• Next challenge of Big Data is securing it. Often organizations are too busy
understanding and analyzing the data, that they leave the data security for a
later stage, and unprotect data ultimately becomes the breeding ground for
the hackers.

64
ADVANTAGES OF USING BIG DATA IN BUSINESS –

• Better decision making


• Greater innovations
• Improvement in education sector
• Product price optimization
• Recommendation engines
• Life-Saving application in the healthcare industry

65
APPLICATIONS OF BIG DATA

66
APPLICATIONS OF BIG DATA

• Tracking Customer Spending Habit, Shopping Behavior:


• In big retails store (like Amazon, Walmart, Big Bazar etc.) management team has to
keep data of customer’s spending habit (in which product customer spent, in which
band they wish to spent, how frequently they spent), shopping behavior, customer’s
most liked product (so that they can keep those products in the store). Which
product is being searched/sold most, based on that data, production/collection rate
of that product get fixed.
• Virtual Personal Assistant Tool: Big data analysis helps virtual personal
assistant tool (like Siri in Apple Device, Cortana in Windows, Google Assistant in
Android) to provide the answer of the various question asked by users. This tool
tracks the location of the user, their local time, season, other data related to
question asked, etc. Analyzing all such data, it provides an answer.

67
APPLICATIONS OF BIG DATA

• Education Sector:
• Online educational course conducting organization utilize big data to search
candidate, interested in that course. If someone searches for YouTube tutorial
video on a subject, then online or offline course provider organization on that
subject send ad online to that person about their course.
• Media and Entertainment Sector:
• Media and entertainment service providing company like Netflix, Amazon
Prime, Spotify do analysis on data collected from their users. Data like what
type of video, music users are watching, listening most, how long users are
spending on site, etc are collected and analyzed to set the next business
strategy.

68
APPLICATIONS OF BIG DATA

• Smart Traffic System:


• Data about the condition of the traffic of different road, collected through camera kept
beside the road, at entry and exit point of the city, GPS device placed in the vehicle (Ola,
Uber cab, etc.). All such data are analyzed and jam-free or less jam way, less time taking
ways are recommended. Such a way smart traffic system can be built in the city by Big
data analysis. One more profit is fuel consumption can be reduced.
• Banking and Securities Industry
• The Securities Exchange Commission (SEC) is using Big Data to monitor financial market activity.
They are currently using network analytics and natural language processors to catch illegal
trading activity in the financial markets. Big Data for trade analytics used in high-frequency
trading, pre-trade decision-support analytics, sentiment measurement, Predictive Analytics, etc.
• This industry also heavily relies on Big Data for risk analytics, including; anti-money laundering,
demand enterprise risk management, "Know Your Customer," and fraud mitigation.

69
APPLICATIONS OF BIG DATA

• Healthcare:
• The healthcare sector has access to huge amounts of data but has been
plagued by failures in utilizing the data to curb the cost of rising healthcare
and by inefficient systems that stifle faster and better healthcare benefits
across the board.

70
APPLICATIONS OF BIG DATA

• Manufacturing and Natural Resources


• Increasing demand for natural resources, including oil, agricultural products,
minerals, gas, metals, and so on, has led to an increase in the volume,
complexity, and velocity of data that is a challenge to handle.
• Similarly, large volumes of data from the manufacturing industry are
untapped. The underutilization of this information prevents the improved
quality of products, energy efficiency, reliability, and better profit margins.

71
MYTHS IN BIG DATA

Myth 1: Big data is everywhere


• Fact: It is true that at present Big data technologies and services are the centers of attention in the
industries with a record high usage. However, Gartner Big data facts and figures show that among all the
organizations only 73 percent of organizations are planning and investing in Big data. However, they are
still in a budding stage of Big data adoption.
Myth 2: Big data is all about size
• Fact: Big data is characterized by 5V’s – Volume, Velocity, Variety, Veracity, and Value. Though handling
a massive amount of data is one of the main features of Big data; however, volume – is merely the prime
defining characteristics of Big Data.

72
MYTHS IN BIG DATA

Myth 3: Big data can predict everything about the future of the business
• Fact: Analytics can predict the trend using Big data, but it’s not the data which drives the business. A
business stands on many factors like economy, human resources, technology and many more. Hence,
when it comes to predicting the future of a business, you cannot predict anything certain through data.
Myth 4: Big Data means big budget and it is for big companies
• Fact: It’s true that we have seen organizations like multi-national corporations and governments bodies
investing a huge amount to set up large-scale data centers and high-end technologies for implementing
Big data. Not only that, employing skilled big data professionals and data scientists is also a very costly
affair as their demand is high due to resource crunch in the market.

73
MYTHS IN BIG DATA

• Myth 5: There is no need of Data warehouse as Big data in place


• Fact: First of all, Data warehouse is an architecture whereas big data is purely a
technology. Hence, one cannot replace other technically. A technology, like Big
data, stores and manages the enormous scale of data to use them for different
Big data solutions at a reasonable low cost.
• Myth 6: Big Data Technology will eliminate the necessity of Data Integration
• Fact: Big data technology uses “schema on read” approach to process
information. This enables organizations to use multiple data models for reading
the same sources. It is a common thought that it brings the flexibility to allow
end users determining how to interpret the data assets on demand. Also, the
assumption is that Big data provides the data access tailored to individual users.

74
MYTHS IN BIG DATA

• Myth 7: Big Data is always quality data


• Fact: Big data not necessarily mean it contains clean and quality data. On the
contrary, in most of the cases, Big data includes data quality errors.
Furthermore, to leverage better and correct insights from collected big data, it
is necessary to clean them. Hence, it’s a wrong assumption that there is no
need for data cleaning, collecting or analyzing Big data.

75

You might also like