UNIT – 1
What is Big Data?
According to Gartner, the definition of Big Data –
“Big data” is high-volume, velocity, and variety information assets that demand cost-
effective, innovative forms of information processing for enhanced insight and decision
making.”
This definition clearly answers the “What is Big Data?” question – Big Data refers to
complex and large data sets that have to be processed and analyzed to uncover
valuable information that can benefit businesses and organizations.
However, there are certain basic tenets of Big Data that will make it even simpler to
answer what is Big Data:
• It refers to a massive amount of data that keeps on growing exponentially with time.
• It is so voluminous that it cannot be processed or analyzed using conventional data
processing techniques.
• It includes data mining, data storage, data analysis, data sharing, and data
visualization.
• The term is an all-comprehensive one including data, data frameworks, along with
the tools and techniques used to process and analyze the data.
Distributed file system:
In big data analytics (BDA), a distributed file system plays a crucial role in storing and
managing large volumes of data across a distributed cluster of computers or nodes. It
provides a scalable, fault-tolerant, and efficient solution for handling the massive
amounts of data involved in big data processing. A distributed file system allows data
to be stored across multiple nodes, enabling parallel processing and distributed data
access.
Here are some commonly used distributed file systems in the context of big data
analytics:
1. Hadoop Distributed File System (HDFS): HDFS is a distributed file system
designed for the Hadoop platform. It is highly scalable and fault-tolerant,
capable of handling large-scale data sets across a cluster of machines. HDFS
breaks data into blocks and replicates them across multiple nodes for
redundancy and high availability.
2. Google File System (GFS): GFS is a distributed file system developed by Google
and serves as the foundation for various Google services. It is designed to
handle large files and is optimized for sequential reads and appends. GFS
provides fault tolerance, data replication, and efficient data access.
3. Apache Cassandra: Although not a traditional file system, Cassandra is a highly
scalable and distributed NoSQL database that provides a distributed file system-
like architecture. It offers high availability, fault tolerance, and linear scalability
across a distributed cluster. Cassandra is often used for real-time and highly
available data storage and processing.
4. Apache HBase: HBase is a distributed columnar database built on top of HDFS.
It provides random read and write access to large-scale data, making it suitable
for real-time applications. HBase is often used for low-latency data access, such
as serving as a real-time database for applications requiring quick data retrieval.
5. Amazon S3 (Simple Storage Service): Although not a traditional distributed file
system, Amazon S3 is a highly scalable and durable object storage service
provided by Amazon Web Services (AWS). It is widely used in cloud-based big
data analytics platforms to store and retrieve large datasets. S3 offers high
availability, durability, and scalability.
Explain data sources of big data in detail or Types:
Big data refers to large volumes of data that cannot be easily managed, processed, or
analyzed using traditional data processing tools and techniques. The sources of big
data can be categorized into three main types: structured, unstructured, and semi-
structured data sources. Here's a detailed explanation of each type:
1. Structured Data Sources:
Structured data refers to well-organized data that conforms to a specific schema or
data model. It is typically stored in relational databases and can be easily queried using
SQL.
Some examples of structured data sources in the context of big data include:
a) Relational Databases: Traditional relational databases store structured data in
tables with predefined schemas. These databases are widely used in various
industries, such as finance, healthcare, and retail. Examples include Oracle,
MySQL, Microsoft SQL Server, etc.
b) Enterprise Systems: Many organizations have enterprise systems like Customer
Relationship Management (CRM), Enterprise Resource Planning (ERP), and
Supply Chain Management (SCM) systems. These systems generate structured
data related to sales, inventory, customer information, and financial
transactions.
c) Log Files: Log files generated by servers, applications, or network devices often
contain structured data. These logs record events, errors, and activities and are
valuable for monitoring system performance, security, and troubleshooting.
d) Sensor Data: Internet of Things (IoT) devices and sensor networks generate
structured data. For example, temperature sensors in manufacturing plants or
GPS data from vehicles produce structured data that can be analyzed for various
purposes.
2. Unstructured Data Sources:
Unstructured data refers to data that does not have a predefined structure or format.
It does not fit into traditional relational databases and is often stored in raw or semi-
structured formats. Some examples of unstructured data sources in big data include:
a) Text Documents: Textual data from various sources such as emails, social media
posts, news articles, documents, and web pages. Analyzing this data can provide
valuable insights into customer sentiment, market trends, and opinions.
b) Multimedia Content: Images, videos, audio files, and other multimedia content
generate vast amounts of unstructured data. Analyzing this data can involve
techniques such as image recognition, video processing, and speech analysis.
c) Social Media Data: Social media platforms generate enormous volumes of
unstructured data, including posts, comments, tweets, images, and videos. This
data can be mined for sentiment analysis, social network analysis, and
understanding user behavior.
d) Web Logs and Clickstreams: Web logs and clickstream data capture user
interactions with websites, including page views, clicks, navigation paths, and
time spent on pages. Analyzing this data helps improve user experience,
optimize web content, and target advertisements.
3. Semi-Structured Data Sources:
Semi-structured data refers to data that does not have a rigid structure like traditional
databases but has some organizational properties. It contains elements of both
structured and unstructured data. Some examples of semi-structured data sources in
big data include:
a) XML and JSON Files: XML (eXtensible Markup Language) and JSON (JavaScript
Object Notation) are widely used formats for storing and exchanging data. They
provide some hierarchical structure and allow flexibility in defining data
elements. Many web APIs, web services, and data interchange formats use XML
or JSON.
b) NoSQL Databases: NoSQL databases like MongoDB, Cassandra, and CouchDB
are designed to handle semi-structured and unstructured data. They provide
flexible schemas and scalability, making them suitable for storing and
processing big data.
c) Web Scraping: Web scraping involves extracting data from websites, which may
be in various formats such as HTML, XML, or JSON. This data can be transformed
into a structured format for analysis.
d) Data Streams: Real-time data streams generated by IoT devices, social media
platforms, or financial markets often have a semi-structured format. Streaming
data processing frameworks like Apache Kafka and Apache Flink can handle
these data streams.
Characteristics or 5V’s of Big Data:
Big data in big data analytics (BDA) is characterized by several key attributes that
distinguish it from traditional data sources. These characteristics, often referred to as
the 5 V's of big data, highlight the unique aspects of big data that pose challenges and
require specialized approaches for storage, processing, and analysis. The key
characteristics of big data in BDA are:
1. Volume: Big data refers to extremely large volumes of data that exceed the
processing capabilities of traditional data management systems. It involves
terabytes, petabytes, or even exabytes of data, generated from various sources
such as sensors, social media, and transactional systems. The sheer volume of
data requires scalable storage solutions and parallel processing techniques.
2. Velocity: Big data is generated and collected at high speeds, often in real-time
or near real-time. This real-time nature presents challenges in terms of data
ingestion, processing, and analysis. Big data systems must be capable of
handling high data arrival rates, performing continuous processing, and
delivering timely insights.
3. Variety: Big data encompasses a variety of data types and formats, including
structured, unstructured, and semi-structured data. It includes text documents,
images, videos, audio files, social media posts, log files, and more. The diverse
nature of big data requires flexible data models and analysis techniques that
can handle different data formats.
4. Veracity: Veracity refers to the quality, accuracy, and reliability of big data. Big
data sources often have inherent uncertainties, noise, and inconsistencies,
posing challenges for analysis and decision-making. Data cleansing,
preprocessing, and validation techniques are required to ensure the
trustworthiness of the insights derived from big data analytics.
5. Value: The ultimate goal of big data analytics is to extract meaningful insights
and value from the data. Big data contains hidden patterns, trends, and
correlations that can drive decision-making, improve business processes,
enhance customer experiences, and unlock new opportunities. Extracting value
from big data requires advanced analytics techniques, machine learning
algorithms, and data visualization tools.
Why is Big Data Important?
The importance of big data does not revolve around how much data a company has
but how a company utilizes the collected data. Every company uses data in its own
way; the more efficiently a company uses its data, the more potential it has to grow.
The company can take data from any source and analyze it to find answers which will
enable:
1. Cost Savings: Some tools of Big Data like Hadoop and Cloud-Based Analytics can
bring cost advantages to business when large amounts of data are to be stored and
these tools also help in identifying more efficient ways of doing business.
2. Time Reductions: The high speed of tools like Hadoop and in-memory analytics can
easily identify new sources of data which helps businesses analyzing data immediately
and make quick decisions based on the learning.
3. Understand the market conditions: By analyzing big data you can get a better
understanding of current market conditions. For example, by analyzing customers’
purchasing behaviors, a company can find out the products that are sold the most and
produce products according to this trend. By this, it can get ahead of its competitors.
4. Control online reputation: Big data tools can do sentiment analysis. Therefore, you
can get feedback about who is saying what about your company. If you want to monitor
and improve the online presence of your business, then, big data tools can help in all
this.
5. Using Big Data Analytics to Boost Customer Acquisition and Retention:
The customer is the most important asset any business depends on. There is no single
business that can claim success without first having to establish a solid customer base.
However, even with a customer base, a business cannot afford to disregard the
highcompetition it faces. If a business is slow to learn what customers are looking for,
then it is very easy to begin offering poor quality products. In the end, loss of clientele
will result, and this creates an adverse overall effect on business success. The use of
big data allows businesses to observe various customer related patterns and trends.
Observing customer behavior is important to trigger loyalty.
6. Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing
Insights: Big data analytics can help change all business operations. This includes the
ability to match customer expectation, changing company’s product line and of course
ensuring that the marketing campaigns are powerful.
7. Big Data Analytics As a Driver of Innovations and Product Development: Another
huge advantage of big data is the ability to help companies innovate and redevelop
their products.
Drivers for Big data:
There are several key drivers that have fueled the emergence and adoption of big data.
These drivers have contributed to the growth and significance of big data in various
industries and sectors. Here are some of the main drivers for big data:
1. Increase in Data Volume: The exponential growth of digital data has been a
significant driver for big data. With the proliferation of connected devices, social
media platforms, sensors, and online transactions, vast amounts of data are
being generated at an unprecedented rate. The sheer volume of data has
necessitated the development of new technologies and approaches to store,
manage, and analyze such massive datasets.
2. Advancements in Data Storage and Processing Technologies: The evolution of
storage technologies, such as cloud computing and distributed file systems, has
made it more feasible and cost-effective to store and process large volumes of
data. Technologies like Hadoop and NoSQL databases have emerged as popular
solutions for distributed storage and parallel processing of big data. These
advancements have lowered the barriers to entry and facilitated the adoption
of big data analytics.
3. Growth of Internet of Things (IoT): The proliferation of IoT devices, including
sensors, wearables, and connected devices, has generated a massive amount of
data from various sources. IoT devices produce real-time data streams, enabling
organizations to collect and analyze data for insights, predictive analytics, and
process optimization. The IoT has significantly contributed to the exponential
growth of data and the need for big data analytics.
4. Increase in Data Variety: Big data encompasses a variety of data types,
including structured, unstructured, and semi-structured data. Traditional
databases were primarily designed for structured data, but the rise of big data
has brought unstructured and semi-structured data into the spotlight. Social
media posts, emails, videos, images, and other unstructured data sources hold
valuable insights, and the ability to extract meaning from such diverse data has
become a key driver for big data analytics.
5. Demand for Real-time Analytics: Real-time analytics has become increasingly
critical for organizations to gain a competitive edge. With the need for quick
decision-making, immediate insights, and actionable intelligence, big data
analytics provides the capability to process and analyze data in real-time or near
real-time. This real-time analytics demand has driven the development of
technologies and platforms that can handle high-velocity data streams and
deliver timely insights.
6. Competitive Advantage and Business Value: The ability to leverage big data
analytics provides organizations with a competitive advantage. By extracting
insights from big data, businesses can make data-driven decisions, optimize
operations, improve customer experiences, personalize marketing strategies,
and identify new revenue streams. The potential for gaining valuable insights
and driving business value has been a major driver for the adoption of big data
analytics.
7. Regulatory and Compliance Requirements: The introduction of data protection
and privacy regulations, such as the General Data Protection Regulation (GDPR),
has emphasized the need for organizations to effectively manage and protect
data. Big data analytics can help organizations comply with regulatory
requirements by ensuring proper data governance, privacy protection, and
security measures.
8. Technological Advancements in Data Analytics: The advancements in data
analytics technologies, such as machine learning, artificial intelligence, and data
mining, have greatly contributed to the growth of big data. These technologies
enable organizations to extract meaningful patterns, correlations, and
predictions from large datasets, enabling better decision-making and insights
generation.
Classification of big data analytics:
Big data analytics can be classified into different categories based on the objectives,
techniques, and approaches used in the analysis. Here are some common
classifications of big data analytics:
1. Descriptive Analytics: Descriptive analytics focuses on summarizing and
understanding historical data to provide insights into what has happened in the
past. It involves techniques such as data aggregation, data visualization, and
statistical analysis. Descriptive analytics helps in identifying trends, patterns,
and key metrics from large datasets.
2. Diagnostic Analytics: Diagnostic analytics aims to understand the causes and
reasons behind past events or trends. It involves deeper analysis and
investigation of data to uncover relationships, dependencies, and factors that
contribute to specific outcomes. Diagnostic analytics uses techniques such as
data exploration, root cause analysis, and correlation analysis.
3. Predictive Analytics: Predictive analytics uses historical data and statistical
modeling techniques to make predictions and forecasts about future events or
outcomes. It involves the development of predictive models using machine
learning algorithms and statistical methods. Predictive analytics helps in
estimating probabilities, making forecasts, and identifying patterns for future
predictions.
4. Prescriptive Analytics: Prescriptive analytics goes beyond predictions and
provides recommendations and optimal solutions. It uses advanced
optimization algorithms, simulation techniques, and decision models to suggest
actions and strategies to achieve desired outcomes. Prescriptive analytics helps
in making informed decisions, optimizing processes, and identifying the best
course of action.
5. Diagnostic-Predictive-Prescriptive (DPP) Analytics: This classification combines
diagnostic, predictive, and prescriptive analytics to provide a comprehensive
approach to data analysis. It involves first diagnosing the current state,
identifying patterns and relationships, then making predictions about future
outcomes, and finally prescribing optimal actions and strategies.
6. Text Analytics: Text analytics focuses on extracting meaningful insights and
information from unstructured textual data, such as documents, emails, social
media posts, and customer reviews. It involves techniques such as natural
language processing (NLP), sentiment analysis, and text mining to analyze and
interpret textual data.
7. Social Media Analytics: Social media analytics involves analyzing data from
social media platforms to gain insights into user behavior, sentiment, trends,
and brand perception. It includes techniques such as social network analysis,
text mining, and sentiment analysis to extract valuable information from social
media data.
8. Real-time Analytics: Real-time analytics focuses on processing and analyzing
data as it is generated in real-time or near real-time. It involves streaming data
processing techniques, complex event processing, and real-time analytics
engines to handle high-velocity data streams. Real-time analytics enables
immediate insights, real-time monitoring, and proactive decision-making.
Big data applications:
Big data applications are diverse and span across various industries and domains. Here
are some notable applications of big data:
1. E-commerce and Retail: Big data analytics is widely used in e-commerce and
retail industries for customer analytics, personalized recommendations,
demand forecasting, inventory management, and fraud detection. Retailers
analyze customer purchase patterns, browsing behavior, and social media data
to understand customer preferences and tailor their marketing strategies.
2. Healthcare: Big data analytics is transforming the healthcare industry by
enabling personalized medicine, disease prediction, patient monitoring, and
health outcome analysis. It helps healthcare providers analyze large volumes of
patient data, electronic health records (EHRs), medical images, and genomics
data to improve diagnosis accuracy, optimize treatment plans, and enhance
patient care.
3. Financial Services: Big data analytics is extensively used in the financial sector
for fraud detection, risk assessment, algorithmic trading, customer
segmentation, and credit scoring. Financial institutions analyze vast amounts of
transaction data, market data, social media feeds, and news articles to identify
fraudulent activities, make data-driven investment decisions, and personalize
financial services.
4. Manufacturing and Supply Chain: Big data analytics plays a crucial role in
optimizing manufacturing processes, supply chain management, and predictive
maintenance. Manufacturers analyze sensor data from equipment, production
line data, and inventory data to identify bottlenecks, reduce downtime,
optimize inventory levels, and improve overall operational efficiency.
5. Energy and Utilities: Big data analytics helps energy and utility companies
monitor and optimize energy consumption, predict equipment failures, and
improve energy efficiency. It involves analyzing smart meter data, sensor data
from power grids, weather data, and customer data to optimize energy
distribution, identify energy wastage, and enable demand response programs.
6. Transportation and Logistics: Big data analytics is applied in the transportation
and logistics industry for route optimization, fleet management, predictive
maintenance, and demand forecasting. It involves analyzing data from GPS
devices, telematics, traffic sensors, weather data, and shipment information to
optimize routes, improve delivery efficiency, and enhance supply chain visibility.
7. Social Media and Marketing: Big data analytics enables social media platforms
and marketers to analyze user-generated content, sentiment analysis, social
network analysis, and customer behavior to understand trends, target specific
customer segments, and personalize marketing campaigns. It helps in
identifying influencers, optimizing advertising strategies, and improving
customer engagement.
8. Government and Public Services: Big data analytics is employed by government
agencies for various purposes, including public safety, fraud detection, urban
planning, and policy-making. It involves analyzing diverse data sources such as
citizen data, social media feeds, satellite imagery, and IoT data to make data-
driven decisions, improve public services, and enhance public safety.
Algorithms using map reduce:
MapReduce is a programming model and framework designed for processing and
analyzing large volumes of data in a distributed computing environment. It consists of
two main phases: the map phase and the reduce phase. MapReduce algorithms
leverage parallel processing and distributed storage to handle big data efficiently. Here
are some commonly used algorithms implemented using the MapReduce paradigm in
big data analytics:
1. Word Count: The Word Count algorithm is a simple yet powerful example of a
MapReduce algorithm. It counts the frequency of each word in a given text
corpus. The map phase splits the input data into key-value pairs, where the key
represents each word, and the value is set to 1. The reduce phase aggregates
the intermediate results by summing up the values for each unique word,
yielding the final word count.
2. PageRank: PageRank is an algorithm used by search engines to rank web pages
based on their importance. It determines the relevance of a web page by
analyzing the structure of hyperlinks between pages. The MapReduce
implementation of PageRank involves iterative map and reduce phases to
calculate the page ranks for each web page based on the incoming links from
other pages.
3. K-Means Clustering: K-Means is an unsupervised machine learning algorithm
used for clustering data points into K distinct groups. The MapReduce
implementation of K-Means involves multiple iterations of map and reduce
phases. In the map phase, data points are assigned to the nearest cluster
centroid. In the reduce phase, new cluster centroids are computed based on the
assigned data points, and the process is repeated until convergence.
4. Naive Bayes Classifier: Naive Bayes is a popular algorithm for text classification,
sentiment analysis, and spam filtering. The MapReduce implementation of
Naive Bayes involves the map phase, where the training data is processed to
calculate the conditional probabilities for each feature. In the reduce phase, the
probabilities are combined and stored, allowing the classifier to make
predictions for new data points.
5. Apriori Algorithm: The Apriori algorithm is used for association rule mining,
which discovers frequent itemsets in a transactional dataset. The MapReduce
implementation of the Apriori algorithm involves generating candidate itemsets
in the map phase and counting their frequencies in the reduce phase. This
process iterates to discover frequent itemsets and generate association rules.
6. Collaborative Filtering: Collaborative Filtering is a technique used in
recommender systems to provide personalized recommendations based on
user behavior and preferences. The MapReduce implementation of
Collaborative Filtering involves the map phase, where user-item interactions are
processed to create user-item matrices. The reduce phase combines the
matrices to calculate user similarities or item similarities, which are used for
generating recommendations.