BigData Techniques and Technologies
Introduction to Big Data
NGUYỄN Ngọc Hoá
Department of Information Systems
VNU University of Engineering and Technology
Hoa.Nguyen@vnu.edu.vn
Outline
1. Definitions
2. SOTA Data Models
3. Data Analytics
4. Big Data Stack
5. Big Data Potential Applications & Landscape
6. Big Data Jobs
2 Big Data @ DIS
1. Definitions
3 Big Data @ DIS
Big Data
Big in Big Data refers to:
– Big size is the primary definition.
– Big complexity rather than big volume. it can be small and not all
large datasets are big data
– size matters... but so does accessibility, interoperability and
reusability
define Big Data using 3Vs; namely:
– Volume, Variety, Velocity
4 Big Data @ DIS
5 Big Data @ DIS
Big Data: A buzz word?
6 Big Data @ DIS
7 Big Data @ DIS
8 Big Data @ DIS
Where Big Data Comes From?
Big Data is not specific application – More Type of data (variety of data)
type, but rather a trend –or even a – Faster Ingest of data (velocity of data)
collection of trends- napping – More Accessibility of data (internet,
multiple application types instruments , …)
Data growing in multiple ways – Data Growth and availability exceeds
– More data (volume of data ) organization ability to make intelligent
decision based on it
9 Big Data @ DIS
Who is generating Big Data
10 Big Data @ DIS
Processes 40 EB a day (2023)
Search Index 100 EB (2023)
Perform 8.5B searches/day (2023)
How much data?
Crawls 20B web pages a day (2023)
19 Hadoop clusters: 600
PB, 40k servers (9/2015)
550 PB on 50k+ servers
Hadoop: 10K nodes, 150K
running 15k apps (2024)
cores, 150 PB (4/2014)
1,000 PB data in Hive + LHC: ~15 PB a year
4 PB/day (2024)
S3: 2T objects, 1.1M
request/second (4/2013)
LSST: 6-10 PB a year
640K ought to be (~2020)
enough for
anybody.
SKA: 0.3 – 1.5 EB
per year (~2020)
11 Big Data @ DIS
From http://www.umiacs.umd.edu/~jimmylin/
How much data?
Batch – More Compute
Airbus A350: Equipped with Management
~6,000 sensors, Level
generating ~2.5 TBs of data Planning
per day. Level
Autonomous vehicles: Supervision
Level
Generate 4~6 TBs of data
per day. Control
Level
Smart factories: Produce
~500 TBs of data per day. Field
Level
IoT data is projected to reach
73.1 ZB by 2025.
(Source: IDC)
Stream – More Data
12 Big Data @ DIS
Volume, Variety, and Velocity
Aggregation that used to be measured in petabytes (PB) is
now referenced by a term: zettabytes (ZB).
– A zettabyte is a trillion gigabytes (GB)
– or a billion terabytes
in 2010, we crossed the 1 ZB marker, and at the end of
2025 that number was estimated to be 175 ZB (source IDC)
– Google: 100 PB/day process, 15000 PB storage
– eBay: 100 TB/day, 90 PB storage
– Baidu: 10-100 TB/day, 2000 PB storage
– Facebook: 600 TB/day, 300 PB storage
– Spotify: 2.2 TB/day, 100 PB storage
13 Big Data @ DIS
Volume, Variety, and Velocity
Different types: single application can be
generating/collecting many types of data
– Relational data (relation/transactions/legacy data)
– Text data (Web)
– Semi-structured data (XML)
– Graph data: social network, semantic web, …
– Streaming data, …
Different sources:
– User shopping behaviors from Shoppe/Lazada/Tiki/Amazon …
– Product reviews from different provider websites
To extract knowledge, all these types of data need to be
linked together
– Trying to capture all of the data that pertains to our decision-making process.
– Making sense out of unstructured data, such as opinion, or analysing images.
14 Big Data @ DIS
A global view of linked big data
15 Big Data @ DIS
Volume, Variety, and Velocity
The rate at which data arrives at the enterprise and is
processed or well understood
In other terms “How long does it take you to do something
about it or know it has even arrived?
Data is generated fast and need to be processed fast
Late decisions missing opportunities
Examples:
– e-Promotions: based on your current location, your purchase history,
what you like send promotions right now for the store next to you
– Healthcare monitoring: sensors monitoring your activities and body
any abnormal measurements require immediate reaction
– Disaster management and response
16 Big Data @ DIS
Realtime Analytics/Decision Requirements
Today, it is possible using real-
time analytics to optimize Like
buttons across both website
and on Facebook.
FaceBook use anonymised data to
show the number of times people:
– saw Like buttons,
– clicked Like buttons,
– saw Like stories on Facebook,
– and clicked Like stories to visit a given
website.
17 Big Data @ DIS
Extensions: 6Vs of Big Data
Volume: The amount of data being Veracity: The quality of big data can
generated is growing rapidly and be uncertain, making it important to
becoming increasingly large, making validate and clean the data before
it difficult to store and process using using it for analysis.
traditional methods. Value: Despite its challenges, big
Variety: Big data comes in many data holds the potential to deliver
different formats, including valuable insights and drive business
structured, semi-structured, and results, making it an important asset
unstructured data, such as text, for organizations.
images, videos, and sensor data. Variability: refers to how often this
Velocity: Big data is generated and change happens. Big Data helps in
processed at a high speed, requiring managing these drifts of data that
real-time processing capabilities. benefit organizations to come up with
the latest products.
18 Big Data @ DIS
Other V’s
Visibility/Visualization: after big data being processed, we
need a way to present the data in a manner that is readable
and accessible.
Viscosity: describe the latency or lag time in the data relative
to the event being described. We found that this is just as
easily understood as an element of Velocity.
Virality: defined by some users as the rate at which the data
spreads; how often it is picked up and repeated by other
users or events.
Volatility: refers to how long data is valid and how long it
should be stored. You need to determine at what point data
is no longer relevant to the current analysis.
…
19 Big Data @ DIS
Who cares Big Data?
Government
Finance, Banking
Manufacture
Education
Health
Traffic
IoT
….
20 Big Data @ DIS
How to deal with Big Data?
Advice From Jim Gray (Turing 98):
1. Analysing big data requires scale‐out solutions not scale-up
solutions
2. Move the analysis to the data.
3. Work with scientists to find the most common “20 queries” and make
them fast.
4. Go from “working to working.”
21 Big Data @ DIS
20 Common Queries
1. Data retrieval & search • Find co-occurring items (e.g., products
• Find a specific record by ID (e.g., retrieve a frequently bought together).
user, product, or experiment result). • Detect missing or mismatched records
• Retrieve a list of recent records (e.g., (e.g., students registered for a course but
latest 100 transactions, newest publications). missing from attendance logs).
• Full-text search on large text fields (e.g., 5. Ranking & Sorting
search scientific papers for a keyword). • Find the top N results (e.g., top 10 best-
• Find records with partial matches (e.g., selling books).
autocomplete suggestions for search terms). • Find anomalies/outliers (e.g., detect
2. Aggregation & Statistics unusually high spending in credit card
• Compute total count (e.g., how many users transactions).
made purchases this month?). 7. Geospatial Processing
• Find the sum, average, min, max of a field • Find records within a location radius (e.g.,
(e.g., average sensor temperature, highest all hospitals within 10km of a given point).
sales). • Find the nearest neighbors (e.g., closest
• Group by and aggregate (e.g., total sales weather stations to a given location).
per region, number of cases per category). 8. Machine Learning & Anomaly Detection
3. Time-Based Analysis • Find similar records (e.g., customers with
• Find records in a given time range (e.g., similar purchasing behavior).
transactions from Jan 1–Feb 1). • Detect duplicate records (e.g., merge
• Compare trends over time (e.g., sales duplicate user profiles).
growth from last year to this year). • Find sudden spikes or drops (e.g., unusual
• Compute moving averages over time (e.g., traffic surge on a website).
rolling 7-day average of website visits). • Detect missing expected values (e.g.,
4. Join & Relationships expected daily report missing for a date).
22 • Join two or more datasets (e.g., link
customer data with purchase history). Big Data @ DIS
Why study big data technologies?
Hot topic in both research and industry
Highly demanded in real world
A promising future career
– Research and development of big data systems: distributed systems
(e.g. Hadoop), visualization tools, data warehouse/lake, OLAP, data
integration, data quality control, …
– Big data applications: social marketing, healthcare, …
– Data analysis: to get values out of big data, such as discovering and
applying patterns, predictive analysis, business intelligence, privacy
and security, …
23 Big Data @ DIS
2. SOTA Data Models
24 Big Data @ DIS
Data Is Driving Everything
“Big data” “Deep learning”
“Data science” “Statistical analysis”
“Data lakes” “Biomedical informatics”
“Visual analytics” “Business analytics”
Lots of trends in pursuit of the same goals!
Discovery, models, decision-making, …
Data Needs to Be Modeled, Cleaned, and Linked!
25 Big Data @ DIS
Most Scenarios: Lots of “Medium
Data” that Isn’t Ready for Analytics
Other than in the Web and in monitoring scenarios – we
typically don’t have all of the data in one place
– In different systems
– Bringing in public datasets
– Requiring access to Twitter APIs etc.
Also, it’s often not in a form where:
– It’s clean and regular – e.g., we may have missing values, spurious
values, etc.
– The features we want to use to make predictions are immediately 26
available to us
26 Big Data @ DIS
Data vs Structured Data
Structural relationships are
sometimes important features
Images
Data +
feature
Genes extraction,
wrangling
Text
Goal: raw data structured data
– Fields, entities, objects, machine learning features
– May be very regular or semi-structured
27 Ultimately, goal is data information knowledge
Big Data @ DIS
Linked Data: Find Patterns in Connectivity
(Clusters, Paths, …)
28 Big Data @ DIS
Knowledge Graphs
Classes, subclasses, instances, and properties
29 Big Data @ DIS
Dynamic Data: Track over Time,
Forecast the Future
30 Big Data @ DIS
Tabular (Relational) Data and Joins /
Lookups (eg to Web Services)
New York Taxi Data
Reverse
Geocode
Data
Street View
31 Big Data @ DIS
SOTA Data Models
RDBMS, NoSQL, NewSQL, Graph DB, Realtime DB, Vector
DB, and GPU DB.
Data Model Best For Examples
RDBMS Structured data, Transactions MySQL, PostgreSQL
Scalability, Semi-structured
NoSQL MongoDB, Cassandra
data
High-performance Google Spanner,
NewSQL
transactions CockroachDB
Graph DB Relationship-heavy data Neo4j, TigerGraph
Realtime DB Instant data updates Firebase, Apache Ignite
Vector DB AI, Similarity search Pinecone, FAISS
GPU DB Big data, Real-time analytics Kinetica, OmniSci
Automated insights,
AI-Driven DB Google BigQuery ML
Optimization
Multi-Model DB Handling diverse workloads ArangoDB, MarkLogic
32 Big Data @ DIS
Traditional Relational Databases
(RDBMS)
Structured and tabular format (rows & columns)
Uses SQL for data manipulation
ACID-compliant (Atomicity, Consistency, Isolation, Durability)
Examples: MySQL, PostgreSQL, Oracle, SQL Server
Best suited for structured data and transactional applications
33 Big Data @ DIS
NoSQL vs NewSQL Databases
NoSQL NewSQL
Designed for scalability and Combines the consistency of
flexibility RDBMS with the scalability of
Four main types: NoSQL
– Key-Value Stores (Redis, ACID compliance with high
DynamoDB) performance
– Document Stores (MongoDB, Examples: Google Spanner,
CouchDB) CockroachDB, TiDB
– Column-Family Stores Best for large-scale transactional
(Cassandra, HBase)
applications
– Graph Databases (Neo4j,
ArangoDB)
Suitable for semi-structured and
unstructured data
34 Big Data @ DIS
Graph & Vector Databases
GraphDB VectorDB
• Optimized for handling • Stores high-dimensional vector
relationships and connected data data
• Uses nodes and edges to • Used in AI, machine learning,
represent entities and recommendation systems
relationships • Examples: Pinecone, FAISS,
• Examples: Neo4j, ArangoDB, Weaviate
TigerGraph • Best for similarity search, NLP,
• Ideal for social networks, fraud and computer vision
detection, recommendation
engines
35 Big Data @ DIS
Realtime & GPU Databases
Realtime DB GPU DB
• Supports low-latency and high- • Uses GPUs for parallel
throughput data processing processing of massive datasets
• Often used in financial trading, • Accelerates analytics and AI
gaming, real-time analytics workloads
• Examples: Firebase, Apache • Examples: Kinetica, BlazingDB,
Ignite, SingleStore OmniSci
• Best for applications requiring • Suitable for real-time big data
instant data access and updates processing and deep learning
36 Big Data @ DIS
AI-Driver & Multi-Model Databases
AI-Driven DB Multi-Model DB
• Integrates AI and machine • Supports multiple data models
learning for automation and within a single system
optimization • Benefits:
• Features: • Flexibility to handle diverse data
• Automated indexing and query types
optimization • Reduces data silos and simplifies
• Predictive analytics and anomaly architecture
detection • Examples: ArangoDB, MarkLogic,
• Self-healing and auto-tuning OrientDB
capabilities
• Best for applications requiring
• Examples: Google BigQuery ML, diverse workloads
Oracle Autonomous Database
• Best for enterprises leveraging AI
for smarter decision-making
37 Big Data @ DIS
3. Data Analytics
The process of examining large and varied data sets to
uncover hidden patterns, correlations, market trends,
and customer preferences.
38 Big Data @ DIS
The Goal of Data Analytics:
From Data to “Knowledge” or Action
Definition: the process of examining large and varied data sets to
uncover hidden patterns, correlations, market trends, and
customer preferences.
Pattern detection: Raw data patterns partial understanding
– “Show me sales by region by product category”
– “Show me clusters of documents by concept”
Given an observation: Hypothesis experiment over sample
significance
– “Behavioral factor F leads to higher risk of outcome O”
– Do statistical test, measure significance vs. null hypothesis
CORBA: Collect Extrapolate Recognize
Build Apply
39 Big Data @ DIS
What Does Big Data Analytics
Involve?
Acquisition, access – data may exist without being accessible (C)
Wrangling – data may be in the wrong form (CE)
Integration, representation – data relationships may not be captured (ER)
Cleaning, filtering – data may have variable quality (ER)
Hypothesizing, querying, analyzing, modeling – from data to info (ERB)
Understanding, iterating, exploring – helping build knowledge (A)
And: ethical obligations – need to protect data, follow good statistical
practices, present results in a non-misleading way (CERBA)
Examples: Netflix Movie, Amazon Product, Expedia Hotel
Recommendation, …
40 Big Data @ DIS
Big Data Analytics: From Data to
Action
41 Big Data @ DIS
Data Science / Data Analytics:
Beware Over-Hyped Expectations!
Data science myth: Data science reality:
• We’ll learn everything “bottom • We’ll typically rely on human
up” using fancy statistics and expertise to impose models
machine learning over the data, the features, etc.
• Basically we “turn the crank” • Deep learning can do feature
and out pop insights! selection – but why throw away
what we know!
Data + algorithms knowledge
Data + human insight +
algorithms + iteration
information knowledge
42
42 Big Data @ DIS
Data Science Application Process
What question are you answering?
What is the right scope of the project?
What data will you use?
What techniques are you going to try?
How will you evaluate your results?
What maintenance will be required?
Before we even get to machine learning, at
least 80-90% of DS companies work involves:
• Working with experts to understand the
domain, assumptions, questions, etc.
• Trying to catalog and make sense of the
data sources
• Wrangling, extracting, and integrating the
data
43
• Cleaning the wrangled data
Big Data @ DIS
4. Big Data Stack
44 Big Data @ DIS
Big Data platform: six key imperatives
1. Discover, Explore, and Navigate Big • Scalable storage solutions for images,
videos, and logs.
Data Sources
• Federated discovery, search, and 4. Analyze Data in Motion
• Stream computing technologies such as
navigation. Apache Kafka and Flink.
• Ability to access structured and • Real-time event processing for IoT and
unstructured data from various transaction monitoring.
sources. • Low-latency decision-making capabilities.
• Metadata management and indexing 5. Rich Library of Analytical Functions
for efficient retrieval. and Tools
2. Extreme Performance – Run • In-database analytics libraries for machine
Analytics Closer to Data learning and deep learning.
• Massively parallel processing (MPP) analytic • Data visualization and reporting tools.
appliances. • AI-driven insights and automation.
• Distributed computing frameworks like 6. Integrate and Govern All Data
Apache Spark. Sources
• Optimized query engines for real-time and • Data integration platforms ensuring seamless
batch analytics. data flow across systems.
3. Manage and Analyze Unstructured • Data quality management, security policies,
Data and lifecycle governance.
• Hadoop ecosystem, including HDFS, • Master Data Management (MDM) to unify
MapReduce, and text analytics. enterprise data.
• Natural Language Processing (NLP) and
sentiment analysis.
45 Big Data @ DIS
Big Data Stack Components
1. Data Sources 4. Data Processing
• Structured Data (RDBMS, Data • Batch Processing (Apache Spark,
Warehouses) Hadoop MapReduce)
• Semi-Structured Data (JSON, XML, • Real-Time Processing (Apache Storm,
Logs) Apache Flink)
• Unstructured Data (Text, Images, 5. Data Analytics
Videos, Social Media) • Machine Learning (TensorFlow, Scikit-Learn)
2. Data Ingestion • Business Intelligence (Tableau, Power BI)
• Batch Processing (ETL, Apache • Search & Query (Elasticsearch, Apache Drill)
Sqoop, Apache Flume) 6. Data Visualization
• Streaming Processing (Apache Kafka, • Reporting Tools (Tableau, Google Data
Apache NiFi) Studio)
• Dashboards (Grafana, Kibana)
3. Data Storage
• Relational Databases (MySQL, 7. Data Security & Governance
PostgreSQL) • Data Encryption & Access Control
• Compliance & Auditing (GDPR, HIPAA)
• NoSQL Databases (MongoDB,
Cassandra, HBase)
• Data Lakes (Hadoop HDFS, Amazon
S3)
46 Big Data @ DIS
Big Data Stack
47 Big Data @ DIS
Target stack on this course
48 Big Data @ DIS
Big Data Tools & Frameworks
Apache Hadoop: Distributed storage and processing framework.
Apache Spark: Fast data processing engine for large-scale data
analytics.
Cloudera Data Platform: Comprehensive data management platform.
Coalesce: Data
transformation platform for
building data pipelines.
Other Tools:
– NoSQL databases (e.g.,
MongoDB, Cassandra).
– Data visualization tools
(e.g., Tableau, Power BI).
49 Big Data @ DIS
50 Big Data @ DIS
51 Big Data @ DIS
52 Big Data @ DIS
5. Big Data Potential Applications &
Lanscape
53 Big Data @ DIS
Potential Applications of Big Data
Healthcare & Medicine – Education Market Trends – Analyzing
enrollment patterns and future workforce
– Predictive Disease Analytics – Early
needs.
diagnosis of diseases like cancer and heart
conditions. – Research & Scientific Discovery –
Accelerating breakthroughs in various
– Medical Image Analysis – AI-powered
disciplines using data analytics.
interpretation of X-rays, MRIs, and CT scans.
– Genomic Data Processing – Personalized Economy & Business
medicine and drug discovery. – Market & Consumer Analytics –
– Epidemiology & Pandemic Management – Predicting trends and customer
Tracking and predicting disease outbreaks behavior.
(e.g., COVID-19).
– Financial Risk Management – Fraud
– Hospital & Resource Management –
Optimizing hospital bed occupancy and detection, credit scoring, and market
medical supply chains. analysis.
Education & Research – Supply Chain Optimization – Real-time
tracking, demand forecasting, and
– Personalized Learning – Adaptive learning
systems and AI-powered tutoring. inventory management.
– Academic Performance Prediction – – Personalized Marketing – Targeted
Identifying at-risk students and improving advertising and recommendation
teaching methods. systems.
– Institutional Decision-Making – Data-driven – Stock Market Predictions – Analyzing
policymaking for schools and universities.
financial data for investment strategies.
54 Big Data @ DIS
Potential Applications …
Society & Public Services communications data.
– Smart Cities – Traffic optimization, waste – Border Security & Immigration Control –
management, and public safety Detecting illegal activities and managing
improvements. migration patterns.
– Crime Prediction & Prevention – Identifying – Counterterrorism & Crime Prevention –
crime patterns and predicting high-risk areas. Analyzing global threat networks and
– Social Media Analysis – Tracking public suspicious transactions.
sentiment and misinformation detection. Environment & Sustainability
– Disaster Management – Real-time – Climate Change Monitoring – Analyzing
monitoring of natural disasters and temperature trends and carbon emissions.
emergency response planning. – Natural Disaster Prediction – Early warning
– Employment & Labor Market Analysis – systems for earthquakes, floods, and
Predicting job market trends and workforce hurricanes.
planning. – Agriculture & Precision Farming – Optimizing
National Security & Defense crop yields and resource use.
– Cybersecurity & Threat Intelligence – – Wildlife & Biodiversity Conservation –
Identifying cyber threats and anomalies in Tracking endangered species and
real-time. deforestation patterns.
– Military Strategy & Operations – Predictive – Water & Air Quality Management –
analytics for tactical planning and logistics. Monitoring pollution levels and ensuring
– Surveillance & Intelligence Gathering – regulatory compliance.
Analyzing satellite imagery and
55 Big Data @ DIS
When to consider Big Data Solution
Data volume is growing rapidly: You’re Performance issues: when existing
limited by your current platform or relational databases struggle with
environment because you can’t query speed and performance.
process the amount of data that you – A financial firm processing massive
want to process. stock market data streams.
– A retail business with millions of Advanced analytics and AI integration:
customer transactions daily. when machine learning, predictive
Need for real-time data processing: analytics, or deep learning is required.
when real-time insights are critical for – Personalized marketing campaigns
decision-making. based on user behavior.
– Fraud detection in banking or real-time Need for scalability and flexibility:
recommendations in e-commerce. when data workloads fluctuate and
Unstructured or multi-format data: You require dynamic scaling.
want to involve new sources of data in – Cloud-based Big Data platforms for
the analytics, but you can’t, because it startups and enterprises.
doesn’t fit into schema-defined rows Data-driven decision making: when
and columns without sacrificing fidelity organizations want to leverage data to
or the richness of the data gain a competitive edge.
– Social media sentiment analysis or – Healthcare providers optimizing patient
medical image processing. treatment plans using data analytics.
56 Big Data @ DIS
The 2017 Big Data Landscape
57 Big Data @ DIS
58 Big Data @ DIS
The 2024 MAD (ML, AI & Data)
Landscape
https://mad.firstmark.com/
59 Big Data @ DIS
Hype Cycle for Data Management
2022
60 Big Data @ DIS
Hype Cycle for Data Management
2023
61 Big Data @ DIS
Hype Cycle for Data Management
2024
62 Big Data @ DIS
63 Big Data @ DIS
6. Big Data Jobs
64 Big Data @ DIS
Big Data Jobs
Data scientists: collect, analyze, manage, structure and interpret large
volumes of data from a range of sources. Data scientists then use
reporting tools to pinpoint patterns, trends and interrelationships between
the various data sets.
Big data engineer & architects: create the underpinning software
architecture; design, build, and manage the infrastructure and scalable
data management systems that data scientists need to perform their
analysis; outline business objectives and transform them into data-
processing workflows; can be found across industries.
Big data developers: apply their deep understanding of technologies such
as Hadoop and Apache Spark with programming languages such as
Java, Python and Scala to process data. By drawing on deep
proficiencies in functional programming paradigms, they can effectively
ingest data into broader big data platform ecosystems.
65 Big Data @ DIS
Big Data Jobs…
Big data analysts: detect and analyze actionable data, such as hidden
trends and patterns. By fusing these findings with their in-depth
knowledge of the market in which their organizations operate, they can
help leaders formulate informed strategic business decisions.
Big data specialists: interrogate, ingest, analyze and transform complex
sets of data. This ensures the necessary data is made available to the
other team members who use it to uncover actionable insights and
provide recommendations to improve business outcomes.
…
66 Big Data @ DIS
67 Big Data @ DIS
Skills required for Big Data Analytics
Store and process
– Large scale databases
– Software Engineering
– System/network Engineering
Analyse and model
– Reasoning
– Knowledge Representation
– Multimedia Retrieval
– Modelling and Simulation
– Machine Learning
– Information Retrieval
Understand and design
– Decision theory
– Visual analytics
– Perception Cognition
68 Big Data @ DIS