Unit 1
Unit 1
INTRODUCTION
1. Volume-
Volume is the amount of data generated by organizations or individuals. Today, the volume of
data in most organizations is approaching around exabytes. Some experts predict the volume of
data to reach zetabytes in the coming years. Organizations are doing their best to handle this
ever-increasing volume of data. For example, Google Inc. processes around 20 petabytes of
data, and Twitter feeds generate around 8 terabytes of data every day.
2. Velocity-
Velocity describes the rate at which data is generated, captured, and shared. Enterprises can
capitalize on data only if it is captured and shared in real time. Information processing systems
such as CRM and Enterprise Resource Planning (PRP) face problems associated with data, which
keeps adding up but cannot be processed quickly. These systems are able to attend data in
batches every few hours; however, even this time lag causes the data to lose its importance as
new data is constantly being generated. For example, eBay analyzes around 5 million
transactions per day in real time to detect and prevent frauds arising from the use of PayPal.
3. Variety-
We all know that data is being generated at a very fast pace. Now, this data is generated from
different types of sources, such as internal, external, social, and behavioral, and comes in
different formats, such as images, test, videos, etc. Even a single source can generate data in
varied formats: for example, GPS and social networking such as Facebook, produce data of all
types, including text, images, videos, etc.
4. Veracity-
Veracity generally refers to the uncertainty of data, i.e., whether the obtained data is correct or
consistent. Out of the huge amount of data that is generated in almost every process, only the
data that is correct and consistent can be used for further analysis. Data when processed
becomes information; however, a lot of effort goes in processing the data. Big Data, especially
in the unstructured and semi-structured forms, is messy in nature, and it takes a good amount
of time and expertise to clean that data and make it suitable for analysis
5. Value:
This fifth and final characteristic can be defined as the added value or utility that the
collected data can bring to a decision-making process, business activity or analysis.
However, for data to be useful, it is necessary to convert it into knowledge. This requires
the use and combination of different technologies such as data mining, predictive
analytics, text mining, etc. This aims to achieve three major business objectives: cost
reduction, quick and effective decision-making, and the design of new products or
services.
The Emergence of Key Development: Data warehouses and Online Analytical Processing
Data Warehousing (OLAP).
(1980s–1990s) Technologies: Companies like Oracle, IBM, and Microsoft introduced
systems for storing historical data for reporting and analysis.
Features: Centralized repositories allowed for structured data
analysis using Business Intelligence (BI) tools.
Usage: Enterprises began using data to identify trends and patterns
for decision-making.
The Internet Boom Key Development: Explosion of data from web applications.
(1990s–2000s): Technologies: Emergence of XML, data mining, and the beginning of
distributed systems.
Impact: Businesses collected more unstructured data (e.g., emails,
documents, and web logs), and traditional databases struggled to
handle growing volumes.
Usage: E-commerce, search engines, and web applications generated
vast amounts of user data.
Big Data Era Key Development: Big data as a concept emerged with the "3Vs"
(2000s–2010s): (Volume, Velocity, Variety) coined by Doug Laney in 2001.
Technologies:
o Apache Hadoop (2006): Enabled distributed storage and
processing of large datasets.
o NoSQL databases (e.g., MongoDB, Cassandra): Allowed
flexibility for unstructured data.
o Apache Spark (2010): Enabled faster data processing
compared to Hadoop.
Impact: Real-time analytics became possible, and industries like
social media, finance, and healthcare adopted big data tools.
Usage: Predictive analytics, recommendation systems, and IoT data
processing.
Advanced Key Development: Integration of machine learning (ML) and artificial
Analytics and AI intelligence (AI) with big data.
(2010s–2020s): Technologies:
o Cloud computing platforms (e.g., AWS, Azure, Google Cloud):
Scaled big data processing.
o Tools like TensorFlow, PyTorch: Advanced ML and deep
learning.
o Real-time streaming technologies (e.g., Kafka, Flink): Enabled
continuous data ingestion and processing.
Impact: Big data shifted from descriptive analytics to predictive and
prescriptive analytics.
Usage: Personalization, fraud detection, autonomous systems, and
more.
Modern Big Data Key Development: Data generated by IoT, 5G, and edge computing.
and Edge Technologies:
Computing o Edge computing: Processes data closer to its source to reduce
(2020s–Present): latency.
o Data lakes and lakehouses: Combine structured and
unstructured data for unified processing.
o AI-driven analytics: Automates insights using generative AI
and advanced models.
Impact: Big data integrates seamlessly with AI, enabling real-time
decision-making and hyper-personalization.
Usage: Smart cities, autonomous vehicles, and advanced predictive
maintenance.
Big data offers immense potential, but it also comes with several challenges related to its
management, analysis, and ethical considerations. Here are the key challenges:
1. Data Volume
Issue: The sheer amount of data generated daily is overwhelming, making storage,
management, and processing complex.
Impact: Traditional systems cannot handle the exponential growth of data, requiring
scalable and cost-effective solutions.
Solution: Cloud storage, distributed systems (e.g., Hadoop, Apache Spark), and data
compression techniques.
2. Data Variety
Issue: Big data includes structured, semi-structured, and unstructured data from diverse
sources (e.g., text, images, videos, IoT sensors).
Impact: Integrating and analyzing heterogeneous data formats is challenging.
Solution: Use of NoSQL databases, data lakes, and advanced data integration tools.
3. Data Velocity
Issue: The speed at which data is generated and needs to be processed (e.g., IoT
sensors, social media feeds) can overwhelm traditional systems.
Impact: Difficulty in performing real-time analysis and decision-making.
Solution: Real-time streaming technologies like Apache Kafka and Flink.
4. Data Quality
Issue: Big data often contains noise, incomplete, inconsistent, or inaccurate information.
Impact: Poor data quality leads to unreliable insights and decision-making errors.
Solution: Data cleansing, preprocessing, and robust validation protocols.
Issue: Handling sensitive data raises concerns about compliance with laws like GDPR,
HIPAA, and CCPA.
Impact: Unauthorized access, data breaches, or misuse can lead to legal and
reputational consequences.
Solution: Implement strong governance frameworks, encryption, and access control
mechanisms.
6. Skill Gap
Issue: A shortage of professionals skilled in big data tools, frameworks, and analytics.
Impact: Organizations struggle to fully leverage big data capabilities.
Solution: Invest in training programs, certifications, and hiring specialized talent.
8. Cost Management
Issue: High costs associated with storage, processing, and analysis infrastructure.
Impact: Budget constraints for small and medium-sized enterprises.
Solution: Optimize resource allocation, leverage open-source tools, and adopt pay-as-
you-go cloud services.
9. Security Risks
Issue: Turning raw data into meaningful insights is complex and requires advanced
analytics tools and expertise.
Impact: Organizations may fail to derive value from their big data investments.
Solution: Invest in AI/ML-powered analytics, data visualization tools, and domain-
specific algorithms.
Issue: Algorithms may inherit biases from training data, leading to unfair or
discriminatory outcomes.
Impact: Reputational damage and ethical violations.
Solution: Regular audits, diverse datasets, and bias-mitigation techniques.
Big data is essential because it transforms raw information into actionable insights. It
empowers organizations to innovate, optimize, and stay competitive in a rapidly changing
world. Whether it's improving customer experience, operational efficiency, or predictive
capabilities, big data is the key to unlocking the full potential of data. Here’s why big data
matters:
1. To Make Better Decisions
Example: Businesses use big data analytics to understand customer behavior, optimize
operations, and improve strategies.
Why: Big data enables data-driven decision-making, reducing guesswork and increasing
precision.
Example: Manufacturers use big data to monitor equipment performance and predict
failures (predictive maintenance).
Why: It reduces downtime, minimizes costs, and improves productivity.
4. To Drive Innovation
5. To Stay Competitive
Example: Financial institutions use big data to detect fraud in real time.
Why: It allows immediate actions that can prevent losses or improve outcomes.
Example: Weather forecasting systems analyze historical and real-time data to predict
natural disasters.
Why: Predictive analytics helps mitigate risks and prepare for the future.
Example: Amazon uses big data to create targeted advertising campaigns based on user
preferences.
Why: It increases the effectiveness of marketing efforts and ROI.
Example: Social media platforms process billions of posts, images, and videos daily.
Why: Big data systems are built to handle volume, velocity, and variety that traditional
methods can't manage.
Structuring Data:
Structuring data is essential because it transforms raw, unorganized information into a format
that is efficient, actionable, and aligned with business, regulatory, and technological
requirements. It is the foundation for accurate decision-making, operational efficiency, and
innovation. Structuring data is required to make it usable, efficient, and meaningful for various
applications. Here’s why it is essential:
Structured data is organized in a standardized format (e.g., tables with rows and
columns), allowing for easier querying and analytics.
Example: An e-commerce database with fields like "Order ID," "Product Name," and
"Purchase Date" enables quick trend analysis.
3. To Enable Scalability
Structured data systems are designed to scale efficiently as data volumes grow,
especially in databases.
Example: A relational database can handle increasing numbers of transactions
systematically when structured correctly.
Structured data reduces the time and computational resources needed for processing.
Example: A well-structured employee database speeds up payroll processing compared
to sifting through unstructured text files.
Types of data:
Data that comes from multiple sources, such as databases, Enterprise Resource Planning (ERP)
systems, weblogs, chat history, and GPS maps, varies in its format. However, different formats
of data need to be made consistent and clear to be used for analysis. Data is obtained primarily
from the following types of sources:
Semi-
Structured Unstructured Big Data
Structured
Data data
data
1. Structured data
Structured data can be defined as the data that has a defined repeating pattern. This pattern
makes it easier for any program to sort, read, and process the data. Processing structured data
is much easier and faster than processing data without any specific repeating patterns.
Structure data:
Is organized data in a predefined format
Is stored in tabular form
Is the data that resides in fixed fields within a record or file
Is formatted data that has entities and their attributes mapped
Is used to query and report against predetermined data types
2. Unstructured Data:
Unstructured data is a set of data that might or might not have any logical or repeating
patterns.
Unstructured data:
Text both internal and external to an organization- Documents, logs, survey results,
feedbacks, and emails from both within and across the organization
Social media- Data obtained from social networking platforms, including YouTube,
Facebook, Twitter, LinkedIn, and Flickr.
Mobile Data- Data such as text messages and location information.
3. Semi-structured Data
Semi-structured data, also known as having a schema-less or self-describing structure, refers to
a form of structured data that contains tags or markup elements in order to separate elements
and generate hierarchies of records and fields in the given data. Such type of data does not
follow the proper structure of data models as in relational databases. In other words, data is
stored inconsistently in rows and columns of a database.
For example:
Big data analytics refers to the methods, tools, and applications used to collect, process,
and derive insights from varied, high-volume, high-velocity data sets. These data sets
may come from a variety of sources, such as web, mobile, email, social media, and
networked smart devices.
Big data analytics reformed the ways to conduct business in many ways, such as it
improves, decision making, business process management, etc. Business Analytics uses
the data and different other techniques like information technology, features of
statistics, quantitive methods. And different models to provide results.
1. Descriptive Analytics
2. Diagnostic Analytics
3. Predictive Analytics
4. Prescriptive Analytics
1. Descriptive Analytics:
Descriptive analytics is the most prevalent form of analytics, and it serves as a base for
advanced analytics.
Descriptive analytics looks at data and analyze past event for insight as to how to
approach future events. It looks at past performance and understands the
performance by mining historical data to understand the cause of success or failure in
the past.
Techniques Used: Data aggregation, data visualization, and basic statistical analysis.
Example: Almost all management reporting such as sales, marketing, operations, and
finance uses this type of analysis.
2. Diagnostic Analytics:
In this analysis, we generally use historical data over other data to answer any
question or for the solution of any problem. We try to find any dependency and
pattern in the historical data of the particular problem. Its purpose is to determine the
causes of past events and identify patterns or anomalies that ‘Why did it happen?”
Techniques Used: Drill-down, data mining, and correlation analysis.
Example: 1. Analyzing why product sales dropped in a specific quarter. 2. Identifying the
reasons behind increased customer churn rates.
3. Predictive Analysis:
Predictive analytics is about understanding and predicting the future and answers the
question 'What could happen?” by using statistical models and different forecast
techniques.
It predicts the near future probabilities and trends and helps in what-if analysis. In
predictive analytics, we use statistics, data mining techniques, and machine learning to
analyze the future. Following Figure shows the steps involved in predictive analytics:
4. Prescriptive Analytics:
Prescriptive analysis answers 'What should we do?' on the basis of complex data
obtained from descriptive and predictive analyses. By using the optimization technique,
prescriptive analytics determines the finest substitute to minimize or maximize some
equitable finance, marketing, and many other areas.
For example, if we have to find the best way of shipping goods from a factory to a
destination, to minimize costs, we will use the prescriptive analytics.
Figure shows a diagrammatic representation of the stages involved in the prescriptive
analytics:
Techniques Used: Optimization models, simulation, and advanced machine learning
algorithms.
Big Data Analytics is critical for several reasons, spanning business, research, and societal
applications:
1. Enhanced Decision-Making
2. Improved Efficiency
3. Personalization
4. Competitive Advantage
5. Cost Reduction
6. Risk Management
Fuels research in fields such as healthcare, climate science, and genomics by uncovering
patterns in massive data sets.
Drives innovation by identifying new product opportunities or untapped markets.
The right analysis of the available data can improve major business processes in various ways.
For example, in a manufacturing unit, data analytics can improve the functioning of the
following processes:
1. Procurement-To find out which suppliers are more efficient and cost-effective in
delivering products on time
2. Product Development-To draw insights on innovative product and service formats and
designs for enhancing the development process and coming up with demanded
products
3. Manufacturing-To identify machinery and process variations that may be indicators of
quality problems
4. Distribution-To enhance supply chain activities and standardize optimal inventory levels
of various external factors such as weather, holidays, economy, etc.
5. Marketing-To identify which marketing campaigns will be the most effective in driving
and engaging customers and understanding customer behaviors and channel behaviors
6. Price Management-To optimize prices based on the analysis of external factors
7. Merchandising-To improve merchandise breakdown on the basis of current buying
patterns and increase inventory levels and product interest insights on the basis of the
analysis of various customer behaviors
8. Sales-To optimize assignment of sales resources and accounts, product mix, and other
operations
9. Store Operations-To adjust inventory levels on the basis of predicted buying patterns,
study of demographics, weather, key events, and other factors
10. Human Resources-To find or the characteristics and behaviors of successful and
effective employees, as well as other employee insights for managing talent better
11. Every business and industry today is affected by and benefitted from Big Data analytics
in multiple ways.
Data Science:
Data science is the study of data to extract meaningful insights for business. It is a
multidisciplinary approach that combines principles and practices from the fields of
mathematics, statistics, artificial intelligence, and computer engineering to analyze large
amounts of data.
This analysis helps data scientists to ask and answer questions like what happened, why
it happened, what will happen, and what can be done with the results.
Data science is important because it combines tools, methods, and technology to
generate meaning from data.
Data science helps businesses make informed decisions and improve their products and
services.
Data science can help solve global challenges, such as predicting the risk of hospital
readmission or which skin lesions are likely to become cancerous.
Responsibilities of a Data Scientist:
A data scientist can use a range of different techniques, tools, and technologies as part of the
data science process. Based on the problem, they pick the best combinations for faster and
more accurate results.
A data scientist is responsible for turning raw data into actionable insights that guide decision-
making. They blend technical expertise, analytical skills, and domain knowledge to tackle
diverse challenges, making them key contributors to data-driven organizations.
Their work typically spans data collection, processing, analysis, and communication. Here are
the primary responsibilities of a data scientist:
Gather data from various sources, such as databases, APIs, web scraping, or third-party
providers.
Organize data into structured formats for analysis using tools like SQL, Python, or R.
2. Data Analysis
Collaborate with stakeholders to understand business problems and translate them into
data science solutions.
Define clear goals for projects and align analytical efforts with organizational objectives.
Propose data-driven strategies and recommendations for decision-making.
Stay updated with the latest advancements in data science, machine learning, and AI.
Experiment with new algorithms, frameworks, and tools to improve workflows and
outcomes.
Contribute to the development of proprietary methods or solutions.
8. Cross-Functional Collaboration
Work closely with engineers, analysts, product managers, and domain experts to
implement data-driven solutions.
Help data engineers design efficient data pipelines and architectures.
Collaborate with business teams to align insights with strategic goals.
In a big data environment, numerous terminologies are commonly used to describe processes,
tools, technologies, and concepts. Here are the key terminologies organized into categories:
Big Data: Large, complex datasets that traditional data processing tools cannot
efficiently manage.
3Vs/4Vs/5Vs: Characteristics of big data:
o Volume: The size of data.
o Velocity: The speed at which data is generated and processed.
o Variety: Different types and formats of data (structured, unstructured, semi-
structured).
o Veracity: Data accuracy and reliability.
o Value: The actionable insights derived from data.
Data Warehouse: A centralized storage system optimized for structured data and query
performance.
ETL (Extract, Transform, Load): The process of extracting data from multiple sources,
transforming it into a usable format, and loading it into a storage system.
Data Mining: The practice of discovering patterns and relationships in large datasets.
Machine Learning (ML): Using algorithms to enable systems to learn and make
predictions or decisions based on data.
4. Data Formats
Data Governance: Policies and procedures to manage the availability, usability, and
security of data.
Data Encryption: Protecting sensitive data through encoding.
Compliance: Adhering to legal and regulatory requirements (e.g., GDPR, HIPAA).
7. Emerging Trends
IoT (Internet of Things): Devices that generate massive data streams from sensors and
connected systems.
AI (Artificial Intelligence): Advanced analytics for decision-making, often overlapping
with big data technologies.