0% found this document useful (0 votes)

11 views15 pages

Unit - 1

Uploaded by

Sathish Koppoju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views15 pages

Unit - 1

Uploaded by

Sathish Koppoju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

UNIT – 1

What is Big Data?

According to Gartner, the definition of Big Data –
“Big data” is high-volume, velocity, and variety information assets that demand cost-
effective, innovative forms of information processing for enhanced insight and decision
making.”
This definition clearly answers the “What is Big Data?” question – Big Data refers to
complex and large data sets that have to be processed and analyzed to uncover
valuable information that can benefit businesses and organizations.
However, there are certain basic tenets of Big Data that will make it even simpler to
answer what is Big Data:
• It refers to a massive amount of data that keeps on growing exponentially with time.
• It is so voluminous that it cannot be processed or analyzed using conventional data
processing techniques.
• It includes data mining, data storage, data analysis, data sharing, and data
visualization.
• The term is an all-comprehensive one including data, data frameworks, along with
the tools and techniques used to process and analyze the data.

Distributed file system:

In big data analytics (BDA), a distributed file system plays a crucial role in storing and
managing large volumes of data across a distributed cluster of computers or nodes. It
provides a scalable, fault-tolerant, and efficient solution for handling the massive
amounts of data involved in big data processing. A distributed file system allows data
to be stored across multiple nodes, enabling parallel processing and distributed data
access.

Here are some commonly used distributed file systems in the context of big data
analytics:
1. Hadoop Distributed File System (HDFS): HDFS is a distributed file system
designed for the Hadoop platform. It is highly scalable and fault-tolerant,
capable of handling large-scale data sets across a cluster of machines. HDFS
breaks data into blocks and replicates them across multiple nodes for
redundancy and high availability.

2. Google File System (GFS): GFS is a distributed file system developed by Google
and serves as the foundation for various Google services. It is designed to
handle large files and is optimized for sequential reads and appends. GFS
provides fault tolerance, data replication, and efficient data access.

3. Apache Cassandra: Although not a traditional file system, Cassandra is a highly

scalable and distributed NoSQL database that provides a distributed file system-
like architecture. It offers high availability, fault tolerance, and linear scalability
across a distributed cluster. Cassandra is often used for real-time and highly
available data storage and processing.

4. Apache HBase: HBase is a distributed columnar database built on top of HDFS.

It provides random read and write access to large-scale data, making it suitable
for real-time applications. HBase is often used for low-latency data access, such
as serving as a real-time database for applications requiring quick data retrieval.

5. Amazon S3 (Simple Storage Service): Although not a traditional distributed file

system, Amazon S3 is a highly scalable and durable object storage service
provided by Amazon Web Services (AWS). It is widely used in cloud-based big
data analytics platforms to store and retrieve large datasets. S3 offers high
availability, durability, and scalability.

Explain data sources of big data in detail or Types:

Big data refers to large volumes of data that cannot be easily managed, processed, or
analyzed using traditional data processing tools and techniques. The sources of big
data can be categorized into three main types: structured, unstructured, and semi-
structured data sources. Here's a detailed explanation of each type:

1. Structured Data Sources:

Structured data refers to well-organized data that conforms to a specific schema or
data model. It is typically stored in relational databases and can be easily queried using
SQL.
Some examples of structured data sources in the context of big data include:
a) Relational Databases: Traditional relational databases store structured data in
tables with predefined schemas. These databases are widely used in various
industries, such as finance, healthcare, and retail. Examples include Oracle,
MySQL, Microsoft SQL Server, etc.

b) Enterprise Systems: Many organizations have enterprise systems like Customer

Relationship Management (CRM), Enterprise Resource Planning (ERP), and
Supply Chain Management (SCM) systems. These systems generate structured
data related to sales, inventory, customer information, and financial
transactions.

c) Log Files: Log files generated by servers, applications, or network devices often
contain structured data. These logs record events, errors, and activities and are
valuable for monitoring system performance, security, and troubleshooting.

d) Sensor Data: Internet of Things (IoT) devices and sensor networks generate
structured data. For example, temperature sensors in manufacturing plants or
GPS data from vehicles produce structured data that can be analyzed for various
purposes.

2. Unstructured Data Sources:

Unstructured data refers to data that does not have a predefined structure or format.
It does not fit into traditional relational databases and is often stored in raw or semi-
structured formats. Some examples of unstructured data sources in big data include:
a) Text Documents: Textual data from various sources such as emails, social media
posts, news articles, documents, and web pages. Analyzing this data can provide
valuable insights into customer sentiment, market trends, and opinions.

b) Multimedia Content: Images, videos, audio files, and other multimedia content
generate vast amounts of unstructured data. Analyzing this data can involve
techniques such as image recognition, video processing, and speech analysis.

c) Social Media Data: Social media platforms generate enormous volumes of

unstructured data, including posts, comments, tweets, images, and videos. This
data can be mined for sentiment analysis, social network analysis, and
understanding user behavior.

d) Web Logs and Clickstreams: Web logs and clickstream data capture user
interactions with websites, including page views, clicks, navigation paths, and
time spent on pages. Analyzing this data helps improve user experience,
optimize web content, and target advertisements.

3. Semi-Structured Data Sources:

Semi-structured data refers to data that does not have a rigid structure like traditional
databases but has some organizational properties. It contains elements of both
structured and unstructured data. Some examples of semi-structured data sources in
big data include:
a) XML and JSON Files: XML (eXtensible Markup Language) and JSON (JavaScript
Object Notation) are widely used formats for storing and exchanging data. They
provide some hierarchical structure and allow flexibility in defining data
elements. Many web APIs, web services, and data interchange formats use XML
or JSON.

b) NoSQL Databases: NoSQL databases like MongoDB, Cassandra, and CouchDB

are designed to handle semi-structured and unstructured data. They provide
flexible schemas and scalability, making them suitable for storing and
processing big data.
c) Web Scraping: Web scraping involves extracting data from websites, which may
be in various formats such as HTML, XML, or JSON. This data can be transformed
into a structured format for analysis.

d) Data Streams: Real-time data streams generated by IoT devices, social media
platforms, or financial markets often have a semi-structured format. Streaming
data processing frameworks like Apache Kafka and Apache Flink can handle
these data streams.

Characteristics or 5V’s of Big Data:

Big data in big data analytics (BDA) is characterized by several key attributes that
distinguish it from traditional data sources. These characteristics, often referred to as
the 5 V's of big data, highlight the unique aspects of big data that pose challenges and
require specialized approaches for storage, processing, and analysis. The key
characteristics of big data in BDA are:

1. Volume: Big data refers to extremely large volumes of data that exceed the
processing capabilities of traditional data management systems. It involves
terabytes, petabytes, or even exabytes of data, generated from various sources
such as sensors, social media, and transactional systems. The sheer volume of
data requires scalable storage solutions and parallel processing techniques.

2. Velocity: Big data is generated and collected at high speeds, often in real-time
or near real-time. This real-time nature presents challenges in terms of data
ingestion, processing, and analysis. Big data systems must be capable of
handling high data arrival rates, performing continuous processing, and
delivering timely insights.

3. Variety: Big data encompasses a variety of data types and formats, including
structured, unstructured, and semi-structured data. It includes text documents,
images, videos, audio files, social media posts, log files, and more. The diverse
nature of big data requires flexible data models and analysis techniques that
can handle different data formats.
4. Veracity: Veracity refers to the quality, accuracy, and reliability of big data. Big
data sources often have inherent uncertainties, noise, and inconsistencies,
posing challenges for analysis and decision-making. Data cleansing,
preprocessing, and validation techniques are required to ensure the
trustworthiness of the insights derived from big data analytics.

5. Value: The ultimate goal of big data analytics is to extract meaningful insights
and value from the data. Big data contains hidden patterns, trends, and
correlations that can drive decision-making, improve business processes,
enhance customer experiences, and unlock new opportunities. Extracting value
from big data requires advanced analytics techniques, machine learning
algorithms, and data visualization tools.

Why is Big Data Important?

The importance of big data does not revolve around how much data a company has
but how a company utilizes the collected data. Every company uses data in its own
way; the more efficiently a company uses its data, the more potential it has to grow.
The company can take data from any source and analyze it to find answers which will
enable:

1. Cost Savings: Some tools of Big Data like Hadoop and Cloud-Based Analytics can
bring cost advantages to business when large amounts of data are to be stored and
these tools also help in identifying more efficient ways of doing business.

2. Time Reductions: The high speed of tools like Hadoop and in-memory analytics can
easily identify new sources of data which helps businesses analyzing data immediately
and make quick decisions based on the learning.
3. Understand the market conditions: By analyzing big data you can get a better
understanding of current market conditions. For example, by analyzing customers’
purchasing behaviors, a company can find out the products that are sold the most and
produce products according to this trend. By this, it can get ahead of its competitors.

4. Control online reputation: Big data tools can do sentiment analysis. Therefore, you
can get feedback about who is saying what about your company. If you want to monitor
and improve the online presence of your business, then, big data tools can help in all
this.

5. Using Big Data Analytics to Boost Customer Acquisition and Retention:

The customer is the most important asset any business depends on. There is no single
business that can claim success without first having to establish a solid customer base.
However, even with a customer base, a business cannot afford to disregard the
highcompetition it faces. If a business is slow to learn what customers are looking for,
then it is very easy to begin offering poor quality products. In the end, loss of clientele
will result, and this creates an adverse overall effect on business success. The use of
big data allows businesses to observe various customer related patterns and trends.
Observing customer behavior is important to trigger loyalty.

6. Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing

Insights: Big data analytics can help change all business operations. This includes the
ability to match customer expectation, changing company’s product line and of course
ensuring that the marketing campaigns are powerful.

7. Big Data Analytics As a Driver of Innovations and Product Development: Another

huge advantage of big data is the ability to help companies innovate and redevelop
their products.
Drivers for Big data:

There are several key drivers that have fueled the emergence and adoption of big data.
These drivers have contributed to the growth and significance of big data in various
industries and sectors. Here are some of the main drivers for big data:

1. Increase in Data Volume: The exponential growth of digital data has been a
significant driver for big data. With the proliferation of connected devices, social
media platforms, sensors, and online transactions, vast amounts of data are
being generated at an unprecedented rate. The sheer volume of data has
necessitated the development of new technologies and approaches to store,
manage, and analyze such massive datasets.

2. Advancements in Data Storage and Processing Technologies: The evolution of

storage technologies, such as cloud computing and distributed file systems, has
made it more feasible and cost-effective to store and process large volumes of
data. Technologies like Hadoop and NoSQL databases have emerged as popular
solutions for distributed storage and parallel processing of big data. These
advancements have lowered the barriers to entry and facilitated the adoption
of big data analytics.

3. Growth of Internet of Things (IoT): The proliferation of IoT devices, including

sensors, wearables, and connected devices, has generated a massive amount of
data from various sources. IoT devices produce real-time data streams, enabling
organizations to collect and analyze data for insights, predictive analytics, and
process optimization. The IoT has significantly contributed to the exponential
growth of data and the need for big data analytics.
4. Increase in Data Variety: Big data encompasses a variety of data types,
including structured, unstructured, and semi-structured data. Traditional
databases were primarily designed for structured data, but the rise of big data
has brought unstructured and semi-structured data into the spotlight. Social
media posts, emails, videos, images, and other unstructured data sources hold
valuable insights, and the ability to extract meaning from such diverse data has
become a key driver for big data analytics.

5. Demand for Real-time Analytics: Real-time analytics has become increasingly

critical for organizations to gain a competitive edge. With the need for quick
decision-making, immediate insights, and actionable intelligence, big data
analytics provides the capability to process and analyze data in real-time or near
real-time. This real-time analytics demand has driven the development of
technologies and platforms that can handle high-velocity data streams and
deliver timely insights.

6. Competitive Advantage and Business Value: The ability to leverage big data
analytics provides organizations with a competitive advantage. By extracting
insights from big data, businesses can make data-driven decisions, optimize
operations, improve customer experiences, personalize marketing strategies,
and identify new revenue streams. The potential for gaining valuable insights
and driving business value has been a major driver for the adoption of big data
analytics.

7. Regulatory and Compliance Requirements: The introduction of data protection

and privacy regulations, such as the General Data Protection Regulation (GDPR),
has emphasized the need for organizations to effectively manage and protect
data. Big data analytics can help organizations comply with regulatory
requirements by ensuring proper data governance, privacy protection, and
security measures.
8. Technological Advancements in Data Analytics: The advancements in data
analytics technologies, such as machine learning, artificial intelligence, and data
mining, have greatly contributed to the growth of big data. These technologies
enable organizations to extract meaningful patterns, correlations, and
predictions from large datasets, enabling better decision-making and insights
generation.

Classification of big data analytics:

Big data analytics can be classified into different categories based on the objectives,
techniques, and approaches used in the analysis. Here are some common
classifications of big data analytics:

1. Descriptive Analytics: Descriptive analytics focuses on summarizing and

understanding historical data to provide insights into what has happened in the
past. It involves techniques such as data aggregation, data visualization, and
statistical analysis. Descriptive analytics helps in identifying trends, patterns,
and key metrics from large datasets.

2. Diagnostic Analytics: Diagnostic analytics aims to understand the causes and

reasons behind past events or trends. It involves deeper analysis and
investigation of data to uncover relationships, dependencies, and factors that
contribute to specific outcomes. Diagnostic analytics uses techniques such as
data exploration, root cause analysis, and correlation analysis.
3. Predictive Analytics: Predictive analytics uses historical data and statistical
modeling techniques to make predictions and forecasts about future events or
outcomes. It involves the development of predictive models using machine
learning algorithms and statistical methods. Predictive analytics helps in
estimating probabilities, making forecasts, and identifying patterns for future
predictions.

4. Prescriptive Analytics: Prescriptive analytics goes beyond predictions and

provides recommendations and optimal solutions. It uses advanced
optimization algorithms, simulation techniques, and decision models to suggest
actions and strategies to achieve desired outcomes. Prescriptive analytics helps
in making informed decisions, optimizing processes, and identifying the best
course of action.

5. Diagnostic-Predictive-Prescriptive (DPP) Analytics: This classification combines

diagnostic, predictive, and prescriptive analytics to provide a comprehensive
approach to data analysis. It involves first diagnosing the current state,
identifying patterns and relationships, then making predictions about future
outcomes, and finally prescribing optimal actions and strategies.

6. Text Analytics: Text analytics focuses on extracting meaningful insights and

information from unstructured textual data, such as documents, emails, social
media posts, and customer reviews. It involves techniques such as natural
language processing (NLP), sentiment analysis, and text mining to analyze and
interpret textual data.
7. Social Media Analytics: Social media analytics involves analyzing data from
social media platforms to gain insights into user behavior, sentiment, trends,
and brand perception. It includes techniques such as social network analysis,
text mining, and sentiment analysis to extract valuable information from social
media data.

8. Real-time Analytics: Real-time analytics focuses on processing and analyzing

data as it is generated in real-time or near real-time. It involves streaming data
processing techniques, complex event processing, and real-time analytics
engines to handle high-velocity data streams. Real-time analytics enables
immediate insights, real-time monitoring, and proactive decision-making.

Big data applications:

Big data applications are diverse and span across various industries and domains. Here
are some notable applications of big data:

1. E-commerce and Retail: Big data analytics is widely used in e-commerce and
retail industries for customer analytics, personalized recommendations,
demand forecasting, inventory management, and fraud detection. Retailers
analyze customer purchase patterns, browsing behavior, and social media data
to understand customer preferences and tailor their marketing strategies.

2. Healthcare: Big data analytics is transforming the healthcare industry by

enabling personalized medicine, disease prediction, patient monitoring, and
health outcome analysis. It helps healthcare providers analyze large volumes of
patient data, electronic health records (EHRs), medical images, and genomics
data to improve diagnosis accuracy, optimize treatment plans, and enhance
patient care.
3. Financial Services: Big data analytics is extensively used in the financial sector
for fraud detection, risk assessment, algorithmic trading, customer
segmentation, and credit scoring. Financial institutions analyze vast amounts of
transaction data, market data, social media feeds, and news articles to identify
fraudulent activities, make data-driven investment decisions, and personalize
financial services.

4. Manufacturing and Supply Chain: Big data analytics plays a crucial role in
optimizing manufacturing processes, supply chain management, and predictive
maintenance. Manufacturers analyze sensor data from equipment, production
line data, and inventory data to identify bottlenecks, reduce downtime,
optimize inventory levels, and improve overall operational efficiency.

5. Energy and Utilities: Big data analytics helps energy and utility companies
monitor and optimize energy consumption, predict equipment failures, and
improve energy efficiency. It involves analyzing smart meter data, sensor data
from power grids, weather data, and customer data to optimize energy
distribution, identify energy wastage, and enable demand response programs.

6. Transportation and Logistics: Big data analytics is applied in the transportation

and logistics industry for route optimization, fleet management, predictive
maintenance, and demand forecasting. It involves analyzing data from GPS
devices, telematics, traffic sensors, weather data, and shipment information to
optimize routes, improve delivery efficiency, and enhance supply chain visibility.
7. Social Media and Marketing: Big data analytics enables social media platforms
and marketers to analyze user-generated content, sentiment analysis, social
network analysis, and customer behavior to understand trends, target specific
customer segments, and personalize marketing campaigns. It helps in
identifying influencers, optimizing advertising strategies, and improving
customer engagement.

8. Government and Public Services: Big data analytics is employed by government

agencies for various purposes, including public safety, fraud detection, urban
planning, and policy-making. It involves analyzing diverse data sources such as
citizen data, social media feeds, satellite imagery, and IoT data to make data-
driven decisions, improve public services, and enhance public safety.

Algorithms using map reduce:

MapReduce is a programming model and framework designed for processing and

analyzing large volumes of data in a distributed computing environment. It consists of
two main phases: the map phase and the reduce phase. MapReduce algorithms
leverage parallel processing and distributed storage to handle big data efficiently. Here
are some commonly used algorithms implemented using the MapReduce paradigm in
big data analytics:

1. Word Count: The Word Count algorithm is a simple yet powerful example of a
MapReduce algorithm. It counts the frequency of each word in a given text
corpus. The map phase splits the input data into key-value pairs, where the key
represents each word, and the value is set to 1. The reduce phase aggregates
the intermediate results by summing up the values for each unique word,
yielding the final word count.

2. PageRank: PageRank is an algorithm used by search engines to rank web pages

based on their importance. It determines the relevance of a web page by
analyzing the structure of hyperlinks between pages. The MapReduce
implementation of PageRank involves iterative map and reduce phases to
calculate the page ranks for each web page based on the incoming links from
other pages.

3. K-Means Clustering: K-Means is an unsupervised machine learning algorithm

used for clustering data points into K distinct groups. The MapReduce
implementation of K-Means involves multiple iterations of map and reduce
phases. In the map phase, data points are assigned to the nearest cluster
centroid. In the reduce phase, new cluster centroids are computed based on the
assigned data points, and the process is repeated until convergence.

4. Naive Bayes Classifier: Naive Bayes is a popular algorithm for text classification,
sentiment analysis, and spam filtering. The MapReduce implementation of
Naive Bayes involves the map phase, where the training data is processed to
calculate the conditional probabilities for each feature. In the reduce phase, the
probabilities are combined and stored, allowing the classifier to make
predictions for new data points.

5. Apriori Algorithm: The Apriori algorithm is used for association rule mining,
which discovers frequent itemsets in a transactional dataset. The MapReduce
implementation of the Apriori algorithm involves generating candidate itemsets
in the map phase and counting their frequencies in the reduce phase. This
process iterates to discover frequent itemsets and generate association rules.

6. Collaborative Filtering: Collaborative Filtering is a technique used in

recommender systems to provide personalized recommendations based on
user behavior and preferences. The MapReduce implementation of
Collaborative Filtering involves the map phase, where user-item interactions are
processed to create user-item matrices. The reduce phase combines the
matrices to calculate user similarities or item similarities, which are used for
generating recommendations.

Imp Answers
No ratings yet
Imp Answers
29 pages
Notesfor BDA
No ratings yet
Notesfor BDA
59 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
Big Data Analysis by Deshbandhu
No ratings yet
Big Data Analysis by Deshbandhu
368 pages
BIG DATA 1 Unit
100% (1)
BIG DATA 1 Unit
17 pages
Uc PDF
No ratings yet
Uc PDF
10 pages
Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
BIG Data1
No ratings yet
BIG Data1
49 pages
Bda Unit 1
No ratings yet
Bda Unit 1
20 pages
Bigdata Intro
No ratings yet
Bigdata Intro
25 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Data Analytics
No ratings yet
Data Analytics
69 pages
Big Data Unit 1 Notes - 240311 - 100703
No ratings yet
Big Data Unit 1 Notes - 240311 - 100703
15 pages
Big Data Analytics 1-5
100% (1)
Big Data Analytics 1-5
63 pages
Big Data Analytics
No ratings yet
Big Data Analytics
21 pages
Ccs 334
No ratings yet
Ccs 334
16 pages
R II Bca IV Sem Unit 3 Balu Sir
No ratings yet
R II Bca IV Sem Unit 3 Balu Sir
14 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
34 pages
Big Data Essentials for IT Professionals
No ratings yet
Big Data Essentials for IT Professionals
26 pages
Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
Big Data Analytics
No ratings yet
Big Data Analytics
21 pages
Subject: Port Information Systems and Platforms: Proposed By: Prof Tali
No ratings yet
Subject: Port Information Systems and Platforms: Proposed By: Prof Tali
9 pages
BigData UNIT-1
No ratings yet
BigData UNIT-1
19 pages
Big Data Answers
No ratings yet
Big Data Answers
14 pages
Unit 1 Big Data Notes
No ratings yet
Unit 1 Big Data Notes
48 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
Bda Only Red QB
No ratings yet
Bda Only Red QB
63 pages
BDA Unit 1 Notes-1
No ratings yet
BDA Unit 1 Notes-1
34 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
34 pages
Unit 1
No ratings yet
Unit 1
54 pages
Big Data
No ratings yet
Big Data
10 pages
Big Data Fundamentals & Applications
No ratings yet
Big Data Fundamentals & Applications
34 pages
Big Data Analytics - Project
50% (2)
Big Data Analytics - Project
27 pages
Bigdata Writing
No ratings yet
Bigdata Writing
11 pages
Big-Data-Unit 1
No ratings yet
Big-Data-Unit 1
23 pages
Stream Processing Chapter 2
No ratings yet
Stream Processing Chapter 2
21 pages
Bda TT
No ratings yet
Bda TT
73 pages
BD 1
No ratings yet
BD 1
15 pages
Course Material
100% (1)
Course Material
57 pages
Block-2-Unit 5
No ratings yet
Block-2-Unit 5
101 pages
Unit-01 Bda
No ratings yet
Unit-01 Bda
25 pages
Teaching Note - Big Data and Cloud Computing-Vaidik
No ratings yet
Teaching Note - Big Data and Cloud Computing-Vaidik
17 pages
Emerging Big Data and Cloud Computing
No ratings yet
Emerging Big Data and Cloud Computing
15 pages
UNIT-1 Bda Kalyan
No ratings yet
UNIT-1 Bda Kalyan
25 pages
Data Science Essentials & Big Data Concepts
No ratings yet
Data Science Essentials & Big Data Concepts
20 pages
Big Data12
No ratings yet
Big Data12
11 pages
Big Data Analytics
No ratings yet
Big Data Analytics
10 pages
Unit 1 (Diagrams)
No ratings yet
Unit 1 (Diagrams)
10 pages
Big Data Chapter 1
No ratings yet
Big Data Chapter 1
7 pages
Bda Ak
No ratings yet
Bda Ak
107 pages
Big Data (Unit 1)
No ratings yet
Big Data (Unit 1)
32 pages
UNIT - 1NOTES - To Print
No ratings yet
UNIT - 1NOTES - To Print
21 pages
Big Data
No ratings yet
Big Data
17 pages
CC Becse Unit 4 PDF
No ratings yet
CC Becse Unit 4 PDF
32 pages
Data Analytics Notes Unit 1
No ratings yet
Data Analytics Notes Unit 1
23 pages
ACFrOgAo1SpYCo1YmTJeiGbHKH22nYKAL3GLgRtzpk4R3gRbHCAsTnCSMxfKm0SFBNYGz7keG7rfZN Y3QVo gdxiQyqG - 6KLsY2icn
No ratings yet
ACFrOgAo1SpYCo1YmTJeiGbHKH22nYKAL3GLgRtzpk4R3gRbHCAsTnCSMxfKm0SFBNYGz7keG7rfZN Y3QVo gdxiQyqG - 6KLsY2icn
14 pages
BDAchap 1
No ratings yet
BDAchap 1
15 pages
Internet of Things
No ratings yet
Internet of Things
1 page
Snapdragon Processors
No ratings yet
Snapdragon Processors
1 page
Servlet Complete
No ratings yet
Servlet Complete
15 pages
WT Programs For Labsession
No ratings yet
WT Programs For Labsession
18 pages
? Spring MVC
No ratings yet
? Spring MVC
8 pages
Hari Mini
No ratings yet
Hari Mini
40 pages
XOR A String With A Zero: Hello World'. The
No ratings yet
XOR A String With A Zero: Hello World'. The
30 pages
Major Documentation
No ratings yet
Major Documentation
77 pages
Unit - 2 (A)
No ratings yet
Unit - 2 (A)
8 pages
ML Unit 5
No ratings yet
ML Unit 5
13 pages
ML 5
No ratings yet
ML 5
20 pages
Unit - 4
No ratings yet
Unit - 4
18 pages
CNS CW .
No ratings yet
CNS CW .
151 pages
DPPM Unit - I
No ratings yet
DPPM Unit - I
16 pages
CNS Laqs
No ratings yet
CNS Laqs
45 pages
DAA
No ratings yet
DAA
15 pages
CNS Vsaqs
No ratings yet
CNS Vsaqs
25 pages
DAA
No ratings yet
DAA
10 pages
NLP Unit-V
No ratings yet
NLP Unit-V
30 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
118 pages
Design and Analysis of Algorithms (Complete)
No ratings yet
Design and Analysis of Algorithms (Complete)
116 pages
Travelling Sales Person Problem
No ratings yet
Travelling Sales Person Problem
8 pages
SUM Tool
No ratings yet
SUM Tool
7 pages
College Event Management SRS
No ratings yet
College Event Management SRS
3 pages
Case Study - Rca - Customer Complaints - Sologic
No ratings yet
Case Study - Rca - Customer Complaints - Sologic
4 pages
Sage 300 Workflow Optimization
No ratings yet
Sage 300 Workflow Optimization
2 pages
SQL and Database
No ratings yet
SQL and Database
32 pages
IT Corrective Actions Report 2013
100% (1)
IT Corrective Actions Report 2013
10 pages
Lepakshi Gosain-Interview
No ratings yet
Lepakshi Gosain-Interview
7 pages
Cloud COMPUTING Module 5
No ratings yet
Cloud COMPUTING Module 5
63 pages
Project641 2021S
No ratings yet
Project641 2021S
2 pages
Tableau Documentation
No ratings yet
Tableau Documentation
17 pages
How To Install Mariadb
No ratings yet
How To Install Mariadb
6 pages
Power BI Interview Prep Guide
No ratings yet
Power BI Interview Prep Guide
10 pages
CS403 DBMS Final Term MCQs Solved
No ratings yet
CS403 DBMS Final Term MCQs Solved
69 pages
Azure ADF
No ratings yet
Azure ADF
22 pages
pkdp-203 0
No ratings yet
pkdp-203 0
23 pages
SQL: Structured Query Language: Prepared By: Prof Momhamad Ubaidullah Bokhari
No ratings yet
SQL: Structured Query Language: Prepared By: Prof Momhamad Ubaidullah Bokhari
102 pages
Creating Backup Jobs in SQL Seerver
No ratings yet
Creating Backup Jobs in SQL Seerver
12 pages
GP Template
No ratings yet
GP Template
22 pages
Good Documentation Practices
No ratings yet
Good Documentation Practices
27 pages
Big Data & Hadoop Course Overview
50% (2)
Big Data & Hadoop Course Overview
3 pages
Circular 20241228125814 Sample Paper It (402) Class X
No ratings yet
Circular 20241228125814 Sample Paper It (402) Class X
11 pages
What Is The Difference Between A Library and A Package
No ratings yet
What Is The Difference Between A Library and A Package
32 pages
Te Install and Maint Guide
No ratings yet
Te Install and Maint Guide
187 pages
Achraf Ajrh CV EN
No ratings yet
Achraf Ajrh CV EN
1 page
Gurskiy CV
No ratings yet
Gurskiy CV
2 pages
Optimizing Information Leakage in Multicloud Storage Services
No ratings yet
Optimizing Information Leakage in Multicloud Storage Services
14 pages
Database Exam Old Question 1 To 10
No ratings yet
Database Exam Old Question 1 To 10
15 pages
DBCD Lab-2
No ratings yet
DBCD Lab-2
20 pages
Smart Banking Chatbot
No ratings yet
Smart Banking Chatbot
7 pages
Raghuveer Resume
No ratings yet
Raghuveer Resume
7 pages

Unit - 1

Uploaded by

Unit - 1

Uploaded by

UNIT – 1

What is Big Data?

Distributed file system:

3. Apache Cassandra: Although not a traditional file system, Cassandra is a highly

4. Apache HBase: HBase is a distributed columnar database built on top of HDFS.

5. Amazon S3 (Simple Storage Service): Although not a traditional distributed file

Explain data sources of big data in detail or Types:

1. Structured Data Sources:

b) Enterprise Systems: Many organizations have enterprise systems like Customer

2. Unstructured Data Sources:

c) Social Media Data: Social media platforms generate enormous volumes of

3. Semi-Structured Data Sources:

b) NoSQL Databases: NoSQL databases like MongoDB, Cassandra, and CouchDB

Characteristics or 5V’s of Big Data:

Why is Big Data Important?

5. Using Big Data Analytics to Boost Customer Acquisition and Retention:

7. Big Data Analytics As a Driver of Innovations and Product Development: Another

2. Advancements in Data Storage and Processing Technologies: The evolution of

3. Growth of Internet of Things (IoT): The proliferation of IoT devices, including

5. Demand for Real-time Analytics: Real-time analytics has become increasingly

7. Regulatory and Compliance Requirements: The introduction of data protection

Classification of big data analytics:

1. Descriptive Analytics: Descriptive analytics focuses on summarizing and

2. Diagnostic Analytics: Diagnostic analytics aims to understand the causes and

4. Prescriptive Analytics: Prescriptive analytics goes beyond predictions and

5. Diagnostic-Predictive-Prescriptive (DPP) Analytics: This classification combines

6. Text Analytics: Text analytics focuses on extracting meaningful insights and

8. Real-time Analytics: Real-time analytics focuses on processing and analyzing

Big data applications:

2. Healthcare: Big data analytics is transforming the healthcare industry by

6. Transportation and Logistics: Big data analytics is applied in the transportation

8. Government and Public Services: Big data analytics is employed by government

Algorithms using map reduce:

MapReduce is a programming model and framework designed for processing and

2. PageRank: PageRank is an algorithm used by search engines to rank web pages

3. K-Means Clustering: K-Means is an unsupervised machine learning algorithm

6. Collaborative Filtering: Collaborative Filtering is a technique used in

You might also like