0% found this document useful (0 votes)
11 views20 pages

Bda Unit 1

Big Data refers to large volumes of data generated at high velocity and in various formats that traditional tools cannot handle. It significantly impacts daily life by enabling personalized experiences, improving healthcare, enhancing urban planning, and optimizing e-commerce and finance. However, challenges such as data volume, velocity, variety, and security must be addressed to harness its full potential.

Uploaded by

ashrithaadepu3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views20 pages

Bda Unit 1

Big Data refers to large volumes of data generated at high velocity and in various formats that traditional tools cannot handle. It significantly impacts daily life by enabling personalized experiences, improving healthcare, enhancing urban planning, and optimizing e-commerce and finance. However, challenges such as data volume, velocity, variety, and security must be addressed to harness its full potential.

Uploaded by

ashrithaadepu3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Define Big Data,

“The large volume of data, generated with high velocity in different variety,
which can not be stored, processed and analyzed using traditional tools are
known as Big data.”

Affects of Big Data on our daily lives


Big data has a impact on various aspects of our daily lives, influencing how we
work, communicate, shop, and make decisions
1. Personalized Experiences:
• Online Services: Big data is used to analyze user preferences and
behavior, enabling personalized recommendations on platforms
like Netflix, Amazon, and social media.
• Targeted Advertising: Advertisers leverage big data to target
specific demographics, interests, and behaviors, resulting in more
personalized and relevant ads.
2. Healthcare:
• Precision Medicine: Big data analytics contribute to personalized
healthcare by analyzing large datasets of patient information,
genetics, and medical records to tailor treatments based on
individual characteristics.
• Disease Prevention: Analyzing health trends and patterns in large
datasets helps identify potential disease outbreaks and supports
preventive measures.
3. Smart Cities:
• Urban Planning: Big data is used to analyze traffic patterns, energy
consumption, and public transportation usage, helping city planners
optimize infrastructure and improve overall efficiency.
• Public Safety: Law enforcement agencies use big data analytics for
predictive policing, analyzing crime patterns to allocate resources
more effectively.
4. E-commerce:
• Recommendation Systems: Online retailers use big data to analyze
customer behavior and provide personalized product
recommendations, enhancing the overall shopping experience.
5. Finance:
• Fraud Detection: Financial institutions employ big data analytics to
detect and prevent fraudulent activities by analyzing transaction
patterns and identifying anomalies.
• Risk Management: Big data is used for assessing and managing
financial risks through the analysis of market trends and economic
indicators.
6. Education:
• Personalized Learning: Big data supports adaptive learning
platforms that tailor educational content to individual students
based on their learning styles and progress.
• Institutional Improvement: Educational institutions use data
analytics to enhance administrative processes, improve resource
allocation, and identify areas for improvement.
7. Social Media:
• Content Personalization: Social media platforms leverage big data
to customize content feeds, showing users posts and ads based on
their interests and engagement history.
Overall, the integration of big data into various aspects of our daily lives has the
potential to enhance efficiency, improve decision-making, and provide more
personalized and tailored experiences
Data Sizes
Data sizes are typically measured in units such as bits, bytes, kilobytes,
megabytes, gigabytes, terabytes, petabytes, exabytes, zettabytes, and yottabytes.

Source of Big Data


Big Data in Big Data Analytics (BDA) comes from various sources, and these
sources can be categorized into three main types: structured, semi-structured,
and unstructured data.
1. Structured Data:
• Databases: Traditional relational databases, such as SQL databases,
store structured data in tables with predefined schemas. These
databases are used to store and manage structured data like
customer information, financial records, and transaction details.
• Data Warehouses: These are repositories for structured data that
consolidate information from various sources to support business
intelligence and analytics.
2. Semi-Structured Data:
• JSON (JavaScript Object Notation) and XML (eXtensible Markup
Language): These formats are commonly used for semi-structured
data. They provide a flexible way to represent data with some level
of hierarchy and can be found in web services, APIs, and
configuration files.
• Log Files: Logs from applications, servers, and systems often
contain semi-structured data, capturing events and activities in a
format that is not strictly tabular.
3. Unstructured Data:
• Text Data: Documents, articles, social media posts, and emails
contain unstructured text data. Natural Language Processing (NLP)
techniques are often used to analyze and derive insights from such
data.
• Multimedia Data: Images, videos, and audio files fall into the
category of unstructured data. Image and speech recognition
technologies are employed for analysis.
• Sensor Data: In the Internet of Things (IoT), sensors generate vast
amounts of unstructured data, capturing information about
temperature, humidity, motion, etc.
4. Transactional Data:
• E-commerce Transactions: Purchase history, customer interactions,
and online transactions provide valuable data for understanding
customer behavior and preferences.
• Financial Transactions: Banking and financial institutions generate
large volumes of data through transactions, providing insights into
financial patterns and risk management.
5. Social Media Data:
• Social Networks: Data from platforms like Facebook, Twitter, and
Instagram include user-generated content, social connections, and
interactions. Analyzing this data can reveal trends, sentiments, and
user preferences.
• User-generated Content: Blogs, forums, and reviews contribute to a
significant amount of unstructured data that can be mined for
insights.
6. Machine-generated Data:
• Sensor Networks: Industrial equipment, IoT devices, and smart
infrastructure generate continuous streams of data. This machine-
generated data is often used for real-time monitoring and predictive
maintenance.
• Server Logs: Data generated by servers and applications, including
error logs and access logs, can be valuable for troubleshooting,
security, and performance optimization.
7. Government and Public Data:
• Open Data Initiatives: Many governments release datasets related
to demographics, healthcare, transportation, and more as part of
open data initiatives. These datasets contribute to the wealth of
information available for analysis.

Challenges of Big Data


While Big Data Analytics (BDA) offers significant opportunities for extracting
valuable insights from large and complex datasets, it also presents several
challenges. Some of the key challenges associated with big data in BDA
include:
1. Volume:
• Storage Capacity: Managing and storing massive volumes of data
can be costly and requires scalable and efficient storage solutions.
• Data Transfer: Moving large datasets between systems or over
networks can be time-consuming and may lead to bottlenecks.
2. Velocity:
• Real-time Processing: Analyzing and processing data in real-time
or near-real-time to keep up with the high velocity of data
generation can be challenging.
• Streaming Data: Handling continuous streams of data, such as
those from sensors and IoT devices, requires specialized processing
capabilities.
3. Variety:
• Data Integration: Combining and integrating data from diverse
sources with different formats and structures poses challenges in
creating a unified view for analysis.
• Data Quality: Ensuring the quality and accuracy of diverse data
types, including structured, semi-structured, and unstructured data,
is crucial for reliable insights.
4. Veracity:
• Data Accuracy: Dealing with inaccuracies, inconsistencies, and
errors in the data can impact the reliability of analytical results.
• Data Uncertainty: Managing uncertainties in data quality and
reliability is essential for making informed decisions.
5. Variability:
• Data Inconsistency: Changes in data formats, structures, or sources
over time can introduce inconsistencies and pose challenges for
analysis.
• Seasonal Variations: Variations in data patterns due to seasonal or
periodic factors may require specialized handling.
6. Complexity:
• Analytical Complexity: Developing and implementing complex
algorithms and models to extract meaningful insights from large
and complex datasets can be challenging.
• Skill Set: Finding and retaining skilled professionals with expertise
in big data technologies and analytics is a common challenge.
7. Security and Privacy:
• Data Security: Protecting sensitive information from unauthorized
access or cyber threats is a critical concern in big data
environments.
• Privacy Compliance: Adhering to data protection regulations and
ensuring ethical use of personal information require careful
consideration.
8. Scalability:
• Infrastructure Scalability: Ensuring that the underlying
infrastructure can scale horizontally to handle growing volumes of
data and increasing computational demands.

5 V’s of Big Data


1. Volume
• Refers to the massive amount of data generated from various sources,
such as social media, sensors, financial transactions, and digital devices,
often measured in terabytes, petabytes, or even zettabytes.
• Managing such large datasets requires scalable storage solutions, like
distributed file systems (e.g., Hadoop HDFS) and cloud-based platforms,
which allow efficient data storage and retrieval.
• As the volume of data continues to grow exponentially, organizations
must adopt efficient data processing frameworks (e.g., Apache Spark) to
extract insights without compromising speed or performance.
2. Velocity
• Represents the speed at which data is generated, collected, and analyzed,
with real-time or near-real-time data streams coming from sources like
IoT devices, social media feeds, and financial markets.
• Handling high-velocity data requires advanced technologies, such as in-
memory computing and stream processing tools like Apache Kafka or
Flink, to process data as it arrives.
• The ability to act on fast-moving data streams is critical for applications
like fraud detection, dynamic pricing, and personalized recommendations.
3. Variety
• Big Data encompasses multiple formats, including structured data
(databases, spreadsheets), semi-structured data (JSON, XML), and
unstructured data (text, images, videos, audio).
• Processing and integrating this diverse data requires flexible data
management systems and tools capable of understanding different
schemas and extracting insights from each format.
• Variety brings complexity, as organizations must design pipelines to
clean, normalize, and merge heterogeneous datasets into a unified form
for analysis.

4. Veracity
• Refers to the accuracy, quality, and reliability of the data, which can often
be noisy, incomplete, or inconsistent, posing challenges for meaningful
analysis.
• Data cleaning techniques, such as handling missing values, detecting
outliers, and ensuring consistency, are critical to improving the veracity
of Big Data.
• High-veracity data is essential for building trustworthy predictive models,
as poor data quality can lead to incorrect insights and flawed business
decisions.
5. Value
• The ultimate goal of Big Data is to generate actionable insights and
business value, such as improving operational efficiency, enhancing
customer experiences, and driving innovation.
• Extracting value requires not just technical solutions but also a clear
understanding of business goals and the ability to translate raw data into
strategic advantages.
• Organizations often use machine learning, predictive analytics, and data
visualization tools to uncover hidden patterns and trends that provide
competitive benefits.

Types of Digital Data in Big Data Analytics


Big Data Analytics (BDA) deals with vast amounts of data, which can be
categorized into three main types: structured, semi-structured, and
unstructured data. Each type has unique characteristics, sources, and
methods of processing.
1. Structured Data
Structured data refers to information that is highly organized and stored in
predefined formats, typically within relational databases. It follows a strict
schema, making it easy to store, query, and analyze.
• Characteristics: Organized into tables, rows, and columns with well-
defined relationships. Each data point has a specific data type (e.g.,
integer, string, date).
• Examples: Customer transaction records, employee databases, stock
market data, and financial reports.
• Processing: Structured data is processed using SQL-based systems (e.g.,
MySQL, PostgreSQL) and data warehouses (e.g., Amazon Redshift,
Google BigQuery).
2. Semi-Structured Data
Semi-structured data lies between structured and unstructured data. It does
not conform to the strict format of structured data but still contains markers
(tags or labels) that help organize it.
• Characteristics: Contains elements of both structured and unstructured
data, often using formats like XML, JSON, or YAML. The structure is not
rigid, allowing for flexibility.
• Examples: Email metadata, sensor data from IoT devices, log files, and
NoSQL database entries (e.g., MongoDB).
• Processing: Tools like Hadoop, Spark, and NoSQL databases are used to
process and analyze semi-structured data efficiently.
3. Unstructured Data
Unstructured data makes up the majority of digital information and lacks a
predefined format, making it the most challenging to manage and analyze.
This data type requires advanced tools and techniques for processing and
extracting insights.
• Characteristics: Does not follow a clear structure, often consisting of
free text, images, videos, and other media formats.
• Examples: Social media posts, multimedia files (videos, audio, images),
emails, customer reviews, and web pages.
• Processing: Requires advanced technologies like natural language
processing (NLP) for text analysis, computer vision for images, and deep
learning for audio and video recognition.
Structured/Unstructured data- Advantages and Sources
Structured Data:
Advantages:
1. Easy to store and manage using relational databases, making data
retrieval fast and efficient with SQL queries.
2. Ensures high data integrity and accuracy due to its predefined schema,
reducing the chances of inconsistencies or errors.
3. Enables seamless integration with business intelligence tools and data
visualization platforms for generating actionable insights.
4. Simplifies data analytics processes as it’s optimized for sorting, filtering,
and joining across multiple tables.
5. Suitable for handling large transactional systems, such as financial
databases, where precision and speed are critical.
Sources:
1. Relational databases like MySQL, Oracle, and SQL Server that store data
in tables with rows and columns.
2. Enterprise Resource Planning (ERP) and Customer Relationship
Management (CRM) systems that manage business operations and
customer interactions.
3. Point-of-sale (POS) systems that generate transaction records, inventory
management logs, and billing details.
4. Online booking systems in industries like travel, hospitality, and
healthcare that store customer appointments and reservations.
5. Sensor data from IoT devices that record timestamped measurements in
structured formats for analysis.
Unstructured Data:
Advantages:
1. Offers richer insights by capturing complex human behaviors, opinions,
and preferences from diverse formats like text, images, and videos.
2. Allows businesses to leverage advanced technologies such as natural
language processing (NLP) and computer vision for sentiment analysis,
content recommendations, and pattern recognition.
3. Scales easily with cloud storage solutions, enabling organizations to
collect massive datasets without worrying about rigid schema limitations.
4. Provides a more comprehensive view of business operations and
customer experiences by analyzing data from multiple, dynamic sources.
5. Supports innovation and competitive advantage by uncovering hidden
trends that structured data alone may not reveal.
Sources:
1. Social media platforms like Facebook, Twitter, and Instagram, where
user-generated content (posts, images, and videos) flows continuously.
2. Multimedia files, including videos from YouTube, images from digital
cameras, and audio recordings from podcasts or voice assistants.
3. Email communications, where the text body, attachments, and metadata
provide valuable unstructured information.
4. Website logs and clickstream data that track user journeys and behaviors
on digital platforms.
5. Customer feedback channels, such as online reviews, surveys, and
chatbot interactions, that capture subjective opinions and experiences

Architecture of Data Warehouse


A data warehouse is a centralized repository designed to store, process, and
manage structured and unstructured data from multiple sources. It facilitates
efficient querying, reporting, and data analysis for decision-making.
Key Components of Data Warehouse Architecture
A data warehouse architecture consists of three main components:
1. Data Sources
2. Data Integration (ETL - Extract, Transform, Load)
3. Data Analytics & Business Intelligence (BI)
Each of these components plays a vital role in ensuring seamless data flow and
accessibility.
1. Data Sources
Data sources refer to the various systems that generate and collect data, which
can be classified into:
• Operational Databases (e.g., MySQL, PostgreSQL, Oracle)
• Application Logs (e.g., Web server logs, API logs)
• Flat Files (e.g., CSV, JSON, XML)
• Enterprise Systems (e.g., ERP, CRM, SCM)
• External Data Sources (e.g., Social Media, Market Data, IoT Sensors)
The data collected from these sources is often raw, inconsistent, and needs to be
processed before it can be useful for analytics.
2. Data Integration (ETL Process)
Data integration is a crucial process that involves:
• Extracting raw data from multiple sources
• Transforming data by cleansing, normalizing, aggregating, and
restructuring it
• Loading the processed data into the data warehouse
There are two main approaches to data integration:
• ETL (Extract, Transform, Load) – Data is transformed before loading it
into the warehouse.
• ELT (Extract, Load, Transform) – Data is loaded first and transformed
later, usually in cloud-based architectures.
Common ETL Tools: Apache NiFi, Talend, Informatica, Microsoft SSIS, AWS
Glue.
3. Data Analytics & Business Intelligence
Once the data is processed and stored in the data warehouse, it can be used for
data analytics and reporting. This layer enables users to query and analyze the
data to gain meaningful insights.
Some key tools and techniques used at this stage:
• Query Languages: SQL, NoSQL
• Reporting & Dashboards: Power BI, Tableau, Looker, Google Data
Studio
• Machine Learning & Data Mining: Python (Pandas, Scikit-learn), R,
Apache Spark
• Business Intelligence: Tools that help in decision-making and forecasting
This layer ensures that businesses can extract actionable insights for improved
decision-making.
Types of Data Warehouse Architectures
Depending on the complexity and separation of components, data warehouses
can be classified into three main architectures:
1. Single-Tier Architecture
• The simplest architecture, where the data warehouse is directly connected
to analytics tools.
• Suitable for small-scale applications.
• Low latency but lacks scalability.
2. Two-Tier Architecture
• Data from multiple sources is stored in an intermediate staging area
before loading into the warehouse.
• Helps in efficient ETL processing but can face scalability issues.
3. Three-Tier Architecture (Most Common)
• Bottom Tier: Data sources and ETL processes
• Middle Tier: The data warehouse (central repository)
• Top Tier: Business intelligence and analytics tools
• Ensures better scalability, performance, and flexibility.

Analytical tools used for big data analytics?


1. Hadoop
Hadoop is an open-source framework that allows for distributed storage and
processing of large datasets using the MapReduce programming model. It
efficiently handles structured and unstructured data across multiple machines in
a scalable manner. Companies use Hadoop for big data analytics, machine
learning, and log processing.
2. MongoDB
MongoDB is a NoSQL database designed for handling large datasets that
change frequently, making it ideal for real-time applications. It stores data in a
flexible, JSON-like format, allowing for easy scaling and dynamic schema
changes. Many modern applications use MongoDB for storing user data,
product catalogs, and IoT data.
3. Talend
Talend is a data integration tool that helps in extracting, transforming, and
loading (ETL) data from different sources into a unified system. It supports
various databases, cloud platforms, and big data environments to ensure
seamless data management. Businesses use Talend to improve data quality,
automate workflows, and enhance reporting capabilities.
4. Cassandra
Apache Cassandra is a highly scalable NoSQL database designed for distributed
and fault-tolerant storage of large amounts of data. It provides high availability
with no single point of failure, making it ideal for applications requiring fast
read and write operations. Companies like Netflix and Facebook use Cassandra
for handling real-time analytics and massive datasets.
5. Spark
Apache Spark is a powerful big data processing engine known for its in-
memory computing capabilities, which accelerate data analysis. It supports real-
time and batch processing, making it useful for machine learning, graph
processing, and streaming analytics. Spark is widely used in financial services,
healthcare, and e-commerce for quick data insights.
6. Storm
Apache Storm is an open-source distributed real-time computation system that
processes streams of data continuously. It is highly scalable and fault-tolerant,
making it suitable for processing real-time events such as social media feeds
and financial transactions. Storm integrates with big data ecosystems like
Hadoop and Kafka for enhanced analytics.
7. Kafka
Apache Kafka is a distributed event streaming platform that enables real-time
data ingestion and processing at scale. It is used for building event-driven
architectures, handling large message queues, and ensuring fault-tolerant data
pipelines. Companies use Kafka for log aggregation, monitoring, and streaming
analytics in applications like fraud detection and recommendation systems.

Hadoop environment and its components?


Hadoop is an open-source framework based on Java that manages the storage
and processing of large amounts of data for applications. Hadoop uses
distributed storage and parallel processing to handle big data and analytics jobs,
breaking workloads down into smaller workloads that can be run at the same
time. Following are the components that collectively form a Hadoop ecosystem:
1. HDFS (Hadoop Distributed File System)
HDFS is a distributed file system that stores large amounts of data across
multiple machines in a fault-tolerant manner. It splits files into blocks and
replicates them across different nodes to ensure reliability. HDFS follows
a master-slave architecture with a NameNode managing metadata and
DataNodes storing actual data.
2. YARN (Yet Another Resource Negotiator)
YARN is a resource management layer that allocates cluster resources
and schedules tasks for parallel execution. It consists of a
ResourceManager that oversees resource allocation and NodeManagers
that handle execution on worker nodes. YARN enhances Hadoop's
scalability and efficiency by decoupling resource management from data
processing.
3. MapReduce
MapReduce is a programming model for processing large datasets by
dividing tasks into Map (data filtering and sorting) and Reduce
(aggregation and summarization) phases. It distributes computations
across multiple nodes, optimizing parallel processing. Although powerful,
MapReduce can be slower than newer in-memory frameworks like
Apache Spark.
4. Spark
Spark is an in-memory data processing engine that performs
computations much faster than MapReduce by reducing disk I/O
operations. It supports batch processing, real-time analytics, machine
learning, and graph processing. Spark's Resilient Distributed Dataset
(RDD) enables fault tolerance and efficient data management.
5. PIG & HIVE
PIG is a high-level scripting language for processing large datasets using
a simplified syntax called Pig Latin. HIVE is a data warehouse
infrastructure that provides SQL-like querying (HiveQL) on top of
Hadoop. Both help non-programmers efficiently analyze and manipulate
big data without writing complex Java code.
6. HBase
HBase is a NoSQL distributed database that provides real-time read and
write access to large datasets. It is modeled after Google's Bigtable and
runs on top of HDFS, ensuring scalability and high availability. HBase is
ideal for handling unstructured or semi-structured data that require fast
retrieval.
7. Mahout & Spark MLLib
Mahout and Spark MLLib are libraries designed for scalable machine
learning on big data. Mahout focuses on algorithms like clustering,
classification, and recommendation systems, leveraging Hadoop’s
parallelism. Spark MLLib, integrated into Spark, offers more efficient
and faster machine learning due to in-memory processing.
8. Solr & Lucene
Solr and Lucene are search and indexing tools that help retrieve
information quickly from large datasets. Lucene is a core text search
library, while Solr builds upon it to provide a full-featured search
platform. They are widely used in applications requiring high-speed text
search and analytics.
9. Zookeeper
Zookeeper is a distributed coordination service that manages
configuration, synchronization, and leader election in Hadoop clusters. It
ensures consistency across distributed applications by maintaining a
centralized repository of metadata. Many Hadoop components, like
HBase and Kafka, rely on Zookeeper for smooth operation.
10.Oozie
Oozie is a workflow scheduler that automates job execution in Hadoop,
managing dependencies between tasks. It supports workflows written in
XML and can trigger jobs based on time or event-based conditions. Oozie
integrates well with MapReduce, Spark, and other Hadoop ecosystem
components to streamline data pipelines.

Business Intelligence (BI)


Business Intelligence (BI) in Big Data Analytics refers to the process of
collecting, analyzing, and transforming large volumes of data into actionable
insights that help businesses make informed decisions. It involves using
technologies, tools, and methodologies to extract valuable information from
structured and unstructured data.
Key Aspects of Business Intelligence in Big Data
1. Data Collection & Integration
o BI tools gather data from multiple sources such as databases, cloud
platforms, IoT devices, social media, and transactional systems.
o Big Data technologies like Hadoop, Spark, and data warehouses
(e.g., Amazon Redshift, Google BigQuery) help manage large-
scale data.
2. Data Processing & Storage
o Big Data systems use distributed computing to process massive
datasets.
o Data lakes and warehouses store structured, semi-structured, and
unstructured data for analysis.
3. Data Analysis & Insights
o BI tools use descriptive, diagnostic, predictive, and prescriptive
analytics to analyze trends and patterns.
o Machine learning and AI can enhance insights by identifying
hidden correlations.
4. Data Visualization & Reporting
o Dashboards and reports (using tools like Power BI, Tableau, and
Looker) help businesses interpret data easily.
o Interactive graphs, charts, and KPIs provide real-time insights for
decision-making.
5. Decision-Making & Strategy
o Businesses use BI insights for forecasting, optimizing operations,
and improving customer experiences.
o Industries like finance, healthcare, retail, and marketing leverage
BI for data-driven strategies.
6. Self-Service BI & Automation
• Modern BI tools enable self-service analytics, allowing non-technical
users to generate reports and insights without IT assistance.
• Automation in BI reduces manual tasks, streamlining processes like data
cleansing, report generation, and alerting.

7. Scalability & Performance Optimization


• BI solutions must scale efficiently to handle increasing data volumes.
• Optimization techniques like in-memory computing, parallel
processing, and indexing improve query performance and reduce
processing time.
Compare business intelligence from traditional big
data

You might also like