0% found this document useful (0 votes)
15 views68 pages

1 Introduction

The document provides an overview of Big Data, defining it through the 3Vs: Volume, Variety, and Velocity, and discusses its sources and implications for various sectors. It outlines different data models and technologies used in Big Data analytics, including RDBMS, NoSQL, NewSQL, and specialized databases like Graph and Vector DBs. The document emphasizes the importance of real-time data processing and the growing demand for Big Data skills in the job market.

Uploaded by

Việt Hưng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views68 pages

1 Introduction

The document provides an overview of Big Data, defining it through the 3Vs: Volume, Variety, and Velocity, and discusses its sources and implications for various sectors. It outlines different data models and technologies used in Big Data analytics, including RDBMS, NoSQL, NewSQL, and specialized databases like Graph and Vector DBs. The document emphasizes the importance of real-time data processing and the growing demand for Big Data skills in the job market.

Uploaded by

Việt Hưng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

BigData Techniques and Technologies

Introduction to Big Data

NGUYỄN Ngọc Hoá


Department of Information Systems
VNU University of Engineering and Technology

Hoa.Nguyen@vnu.edu.vn
Outline
1. Definitions

2. SOTA Data Models

3. Data Analytics

4. Big Data Stack

5. Big Data Potential Applications & Landscape

6. Big Data Jobs

2 Big Data @ DIS


1. Definitions

3 Big Data @ DIS


Big Data
 Big in Big Data refers to:
– Big size is the primary definition.
– Big complexity rather than big volume. it can be small and not all
large datasets are big data
– size matters... but so does accessibility, interoperability and
reusability
 define Big Data using 3Vs; namely:
– Volume, Variety, Velocity

4 Big Data @ DIS


5 Big Data @ DIS
Big Data: A buzz word?

6 Big Data @ DIS


7 Big Data @ DIS
8 Big Data @ DIS
Where Big Data Comes From?
 Big Data is not specific application – More Type of data (variety of data)
type, but rather a trend –or even a – Faster Ingest of data (velocity of data)
collection of trends- napping – More Accessibility of data (internet,
multiple application types instruments , …)
 Data growing in multiple ways – Data Growth and availability exceeds
– More data (volume of data ) organization ability to make intelligent
decision based on it

9 Big Data @ DIS


Who is generating Big Data

10 Big Data @ DIS


Processes 40 EB a day (2023)
Search Index 100 EB (2023)
Perform 8.5B searches/day (2023)
How much data?
Crawls 20B web pages a day (2023)

19 Hadoop clusters: 600


PB, 40k servers (9/2015)

550 PB on 50k+ servers


Hadoop: 10K nodes, 150K
running 15k apps (2024)
cores, 150 PB (4/2014)

1,000 PB data in Hive + LHC: ~15 PB a year


4 PB/day (2024)

S3: 2T objects, 1.1M


request/second (4/2013)
LSST: 6-10 PB a year
640K ought to be (~2020)
enough for
anybody.
SKA: 0.3 – 1.5 EB
per year (~2020)

11 Big Data @ DIS


From http://www.umiacs.umd.edu/~jimmylin/
How much data?
Batch – More Compute
 Airbus A350: Equipped with Management
~6,000 sensors, Level

generating ~2.5 TBs of data Planning


per day. Level

 Autonomous vehicles: Supervision


Level
Generate 4~6 TBs of data
per day. Control
Level
 Smart factories: Produce
~500 TBs of data per day. Field
Level
 IoT data is projected to reach
73.1 ZB by 2025.
(Source: IDC)

Stream – More Data

12 Big Data @ DIS


Volume, Variety, and Velocity
 Aggregation that used to be measured in petabytes (PB) is
now referenced by a term: zettabytes (ZB).
– A zettabyte is a trillion gigabytes (GB)
– or a billion terabytes
 in 2010, we crossed the 1 ZB marker, and at the end of
2025 that number was estimated to be 175 ZB (source IDC)
– Google: 100 PB/day process, 15000 PB storage
– eBay: 100 TB/day, 90 PB storage
– Baidu: 10-100 TB/day, 2000 PB storage
– Facebook: 600 TB/day, 300 PB storage
– Spotify: 2.2 TB/day, 100 PB storage

13 Big Data @ DIS


Volume, Variety, and Velocity
 Different types: single application can be
generating/collecting many types of data
– Relational data (relation/transactions/legacy data)
– Text data (Web)
– Semi-structured data (XML)
– Graph data: social network, semantic web, …
– Streaming data, …
 Different sources:
– User shopping behaviors from Shoppe/Lazada/Tiki/Amazon …
– Product reviews from different provider websites
To extract knowledge, all these types of data need to be
linked together
– Trying to capture all of the data that pertains to our decision-making process.
– Making sense out of unstructured data, such as opinion, or analysing images.

14 Big Data @ DIS


A global view of linked big data

15 Big Data @ DIS


Volume, Variety, and Velocity
 The rate at which data arrives at the enterprise and is
processed or well understood
 In other terms “How long does it take you to do something
about it or know it has even arrived?
 Data is generated fast and need to be processed fast
 Late decisions  missing opportunities

 Examples:
– e-Promotions: based on your current location, your purchase history,
what you like  send promotions right now for the store next to you
– Healthcare monitoring: sensors monitoring your activities and body 
any abnormal measurements require immediate reaction
– Disaster management and response

16 Big Data @ DIS


Realtime Analytics/Decision Requirements

 Today, it is possible using real-


time analytics to optimize Like
buttons across both website
and on Facebook.
 FaceBook use anonymised data to
show the number of times people:
– saw Like buttons,
– clicked Like buttons,
– saw Like stories on Facebook,
– and clicked Like stories to visit a given
website.

17 Big Data @ DIS


Extensions: 6Vs of Big Data

 Volume: The amount of data being  Veracity: The quality of big data can
generated is growing rapidly and be uncertain, making it important to
becoming increasingly large, making validate and clean the data before
it difficult to store and process using using it for analysis.
traditional methods.  Value: Despite its challenges, big
 Variety: Big data comes in many data holds the potential to deliver
different formats, including valuable insights and drive business
structured, semi-structured, and results, making it an important asset
unstructured data, such as text, for organizations.
images, videos, and sensor data.  Variability: refers to how often this
 Velocity: Big data is generated and change happens. Big Data helps in
processed at a high speed, requiring managing these drifts of data that
real-time processing capabilities. benefit organizations to come up with
the latest products.
18 Big Data @ DIS
Other V’s
 Visibility/Visualization: after big data being processed, we
need a way to present the data in a manner that is readable
and accessible.
 Viscosity: describe the latency or lag time in the data relative
to the event being described. We found that this is just as
easily understood as an element of Velocity.
 Virality: defined by some users as the rate at which the data
spreads; how often it is picked up and repeated by other
users or events.
 Volatility: refers to how long data is valid and how long it
should be stored. You need to determine at what point data
is no longer relevant to the current analysis.
 …
19 Big Data @ DIS
Who cares Big Data?
 Government
 Finance, Banking
 Manufacture
 Education
 Health
 Traffic
 IoT
 ….

20 Big Data @ DIS


How to deal with Big Data?
Advice From Jim Gray (Turing 98):
1. Analysing big data requires scale‐out solutions not scale-up
solutions
2. Move the analysis to the data.
3. Work with scientists to find the most common “20 queries” and make
them fast.
4. Go from “working to working.”

21 Big Data @ DIS


20 Common Queries
1. Data retrieval & search • Find co-occurring items (e.g., products
• Find a specific record by ID (e.g., retrieve a frequently bought together).
user, product, or experiment result). • Detect missing or mismatched records
• Retrieve a list of recent records (e.g., (e.g., students registered for a course but
latest 100 transactions, newest publications). missing from attendance logs).
• Full-text search on large text fields (e.g., 5. Ranking & Sorting
search scientific papers for a keyword). • Find the top N results (e.g., top 10 best-
• Find records with partial matches (e.g., selling books).
autocomplete suggestions for search terms). • Find anomalies/outliers (e.g., detect
2. Aggregation & Statistics unusually high spending in credit card
• Compute total count (e.g., how many users transactions).
made purchases this month?). 7. Geospatial Processing
• Find the sum, average, min, max of a field • Find records within a location radius (e.g.,
(e.g., average sensor temperature, highest all hospitals within 10km of a given point).
sales). • Find the nearest neighbors (e.g., closest
• Group by and aggregate (e.g., total sales weather stations to a given location).
per region, number of cases per category). 8. Machine Learning & Anomaly Detection
3. Time-Based Analysis • Find similar records (e.g., customers with
• Find records in a given time range (e.g., similar purchasing behavior).
transactions from Jan 1–Feb 1). • Detect duplicate records (e.g., merge
• Compare trends over time (e.g., sales duplicate user profiles).
growth from last year to this year). • Find sudden spikes or drops (e.g., unusual
• Compute moving averages over time (e.g., traffic surge on a website).
rolling 7-day average of website visits). • Detect missing expected values (e.g.,
4. Join & Relationships expected daily report missing for a date).
22 • Join two or more datasets (e.g., link
customer data with purchase history). Big Data @ DIS
Why study big data technologies?
 Hot topic in both research and industry
 Highly demanded in real world
 A promising future career
– Research and development of big data systems: distributed systems
(e.g. Hadoop), visualization tools, data warehouse/lake, OLAP, data
integration, data quality control, …
– Big data applications: social marketing, healthcare, …
– Data analysis: to get values out of big data, such as discovering and
applying patterns, predictive analysis, business intelligence, privacy
and security, …

23 Big Data @ DIS


2. SOTA Data Models

24 Big Data @ DIS


Data Is Driving Everything

 “Big data”  “Deep learning”


 “Data science”  “Statistical analysis”
 “Data lakes”  “Biomedical informatics”
 “Visual analytics”  “Business analytics”

Lots of trends in pursuit of the same goals!


Discovery, models, decision-making, …
Data Needs to Be Modeled, Cleaned, and Linked!

25 Big Data @ DIS


Most Scenarios: Lots of “Medium
Data” that Isn’t Ready for Analytics
 Other than in the Web and in monitoring scenarios – we
typically don’t have all of the data in one place

– In different systems
– Bringing in public datasets
– Requiring access to Twitter APIs etc.

 Also, it’s often not in a form where:


– It’s clean and regular – e.g., we may have missing values, spurious
values, etc.
– The features we want to use to make predictions are immediately 26

available to us

26 Big Data @ DIS


Data vs Structured Data
Structural relationships are
sometimes important features
Images
Data +
feature
Genes extraction,
wrangling

Text

 Goal: raw data  structured data


– Fields, entities, objects, machine learning features
– May be very regular or semi-structured

27 Ultimately, goal is data  information  knowledge


Big Data @ DIS
Linked Data: Find Patterns in Connectivity
(Clusters, Paths, …)

28 Big Data @ DIS


Knowledge Graphs
Classes, subclasses, instances, and properties

29 Big Data @ DIS


Dynamic Data: Track over Time,
Forecast the Future

30 Big Data @ DIS


Tabular (Relational) Data and Joins /
Lookups (eg to Web Services)
New York Taxi Data

Reverse
Geocode
Data

Street View

31 Big Data @ DIS


SOTA Data Models
 RDBMS, NoSQL, NewSQL, Graph DB, Realtime DB, Vector
DB, and GPU DB.
Data Model Best For Examples
RDBMS Structured data, Transactions MySQL, PostgreSQL
Scalability, Semi-structured
NoSQL MongoDB, Cassandra
data
High-performance Google Spanner,
NewSQL
transactions CockroachDB
Graph DB Relationship-heavy data Neo4j, TigerGraph
Realtime DB Instant data updates Firebase, Apache Ignite
Vector DB AI, Similarity search Pinecone, FAISS
GPU DB Big data, Real-time analytics Kinetica, OmniSci
Automated insights,
AI-Driven DB Google BigQuery ML
Optimization
Multi-Model DB Handling diverse workloads ArangoDB, MarkLogic

32 Big Data @ DIS


Traditional Relational Databases
(RDBMS)
 Structured and tabular format (rows & columns)
 Uses SQL for data manipulation
 ACID-compliant (Atomicity, Consistency, Isolation, Durability)
 Examples: MySQL, PostgreSQL, Oracle, SQL Server
 Best suited for structured data and transactional applications

33 Big Data @ DIS


NoSQL vs NewSQL Databases
NoSQL NewSQL
 Designed for scalability and  Combines the consistency of
flexibility RDBMS with the scalability of
 Four main types: NoSQL
– Key-Value Stores (Redis,  ACID compliance with high
DynamoDB) performance
– Document Stores (MongoDB,  Examples: Google Spanner,
CouchDB) CockroachDB, TiDB
– Column-Family Stores  Best for large-scale transactional
(Cassandra, HBase)
applications
– Graph Databases (Neo4j,
ArangoDB)
 Suitable for semi-structured and
unstructured data

34 Big Data @ DIS


Graph & Vector Databases
GraphDB VectorDB
• Optimized for handling • Stores high-dimensional vector
relationships and connected data data
• Uses nodes and edges to • Used in AI, machine learning,
represent entities and recommendation systems
relationships • Examples: Pinecone, FAISS,
• Examples: Neo4j, ArangoDB, Weaviate
TigerGraph • Best for similarity search, NLP,
• Ideal for social networks, fraud and computer vision
detection, recommendation
engines

35 Big Data @ DIS


Realtime & GPU Databases
Realtime DB GPU DB
• Supports low-latency and high- • Uses GPUs for parallel
throughput data processing processing of massive datasets
• Often used in financial trading, • Accelerates analytics and AI
gaming, real-time analytics workloads
• Examples: Firebase, Apache • Examples: Kinetica, BlazingDB,
Ignite, SingleStore OmniSci
• Best for applications requiring • Suitable for real-time big data
instant data access and updates processing and deep learning

36 Big Data @ DIS


AI-Driver & Multi-Model Databases
AI-Driven DB Multi-Model DB
• Integrates AI and machine • Supports multiple data models
learning for automation and within a single system
optimization • Benefits:
• Features: • Flexibility to handle diverse data
• Automated indexing and query types
optimization • Reduces data silos and simplifies
• Predictive analytics and anomaly architecture
detection • Examples: ArangoDB, MarkLogic,
• Self-healing and auto-tuning OrientDB
capabilities
• Best for applications requiring
• Examples: Google BigQuery ML, diverse workloads
Oracle Autonomous Database
• Best for enterprises leveraging AI
for smarter decision-making

37 Big Data @ DIS


3. Data Analytics

The process of examining large and varied data sets to


uncover hidden patterns, correlations, market trends,
and customer preferences.

38 Big Data @ DIS


The Goal of Data Analytics:
From Data to “Knowledge” or Action
 Definition: the process of examining large and varied data sets to
uncover hidden patterns, correlations, market trends, and
customer preferences.
 Pattern detection: Raw data  patterns  partial understanding
– “Show me sales by region by product category”
– “Show me clusters of documents by concept”
 Given an observation: Hypothesis  experiment over sample 
significance
– “Behavioral factor F leads to higher risk of outcome O”
– Do statistical test, measure significance vs. null hypothesis

 CORBA: Collect  Extrapolate  Recognize 


Build  Apply
39 Big Data @ DIS
What Does Big Data Analytics
Involve?
 Acquisition, access – data may exist without being accessible (C)

 Wrangling – data may be in the wrong form (CE)

 Integration, representation – data relationships may not be captured (ER)

 Cleaning, filtering – data may have variable quality (ER)

 Hypothesizing, querying, analyzing, modeling – from data to info (ERB)

 Understanding, iterating, exploring – helping build knowledge (A)

And: ethical obligations – need to protect data, follow good statistical


practices, present results in a non-misleading way (CERBA)

Examples: Netflix Movie, Amazon Product, Expedia Hotel


Recommendation, …
40 Big Data @ DIS
Big Data Analytics: From Data to
Action

41 Big Data @ DIS


Data Science / Data Analytics:
Beware Over-Hyped Expectations!
Data science myth: Data science reality:
• We’ll learn everything “bottom • We’ll typically rely on human
up” using fancy statistics and expertise to impose models
machine learning over the data, the features, etc.
• Basically we “turn the crank” • Deep learning can do feature
and out pop insights! selection – but why throw away
what we know!
Data + algorithms  knowledge
Data + human insight +
algorithms + iteration 
information  knowledge
42

42 Big Data @ DIS


Data Science Application Process
 What question are you answering?
 What is the right scope of the project?
 What data will you use?
 What techniques are you going to try?
 How will you evaluate your results?
 What maintenance will be required?
Before we even get to machine learning, at
least 80-90% of DS companies work involves:
• Working with experts to understand the
domain, assumptions, questions, etc.
• Trying to catalog and make sense of the
data sources
• Wrangling, extracting, and integrating the
data
43
• Cleaning the wrangled data
Big Data @ DIS
4. Big Data Stack

44 Big Data @ DIS


Big Data platform: six key imperatives
1. Discover, Explore, and Navigate Big • Scalable storage solutions for images,
videos, and logs.
Data Sources
• Federated discovery, search, and 4. Analyze Data in Motion
• Stream computing technologies such as
navigation. Apache Kafka and Flink.
• Ability to access structured and • Real-time event processing for IoT and
unstructured data from various transaction monitoring.
sources. • Low-latency decision-making capabilities.
• Metadata management and indexing 5. Rich Library of Analytical Functions
for efficient retrieval. and Tools
2. Extreme Performance – Run • In-database analytics libraries for machine
Analytics Closer to Data learning and deep learning.
• Massively parallel processing (MPP) analytic • Data visualization and reporting tools.
appliances. • AI-driven insights and automation.
• Distributed computing frameworks like 6. Integrate and Govern All Data
Apache Spark. Sources
• Optimized query engines for real-time and • Data integration platforms ensuring seamless
batch analytics. data flow across systems.
3. Manage and Analyze Unstructured • Data quality management, security policies,
Data and lifecycle governance.
• Hadoop ecosystem, including HDFS, • Master Data Management (MDM) to unify
MapReduce, and text analytics. enterprise data.
• Natural Language Processing (NLP) and
sentiment analysis.
45 Big Data @ DIS
Big Data Stack Components
1. Data Sources 4. Data Processing
• Structured Data (RDBMS, Data • Batch Processing (Apache Spark,
Warehouses) Hadoop MapReduce)
• Semi-Structured Data (JSON, XML, • Real-Time Processing (Apache Storm,
Logs) Apache Flink)
• Unstructured Data (Text, Images, 5. Data Analytics
Videos, Social Media) • Machine Learning (TensorFlow, Scikit-Learn)
2. Data Ingestion • Business Intelligence (Tableau, Power BI)
• Batch Processing (ETL, Apache • Search & Query (Elasticsearch, Apache Drill)
Sqoop, Apache Flume) 6. Data Visualization
• Streaming Processing (Apache Kafka, • Reporting Tools (Tableau, Google Data
Apache NiFi) Studio)
• Dashboards (Grafana, Kibana)
3. Data Storage
• Relational Databases (MySQL, 7. Data Security & Governance
PostgreSQL) • Data Encryption & Access Control
• Compliance & Auditing (GDPR, HIPAA)
• NoSQL Databases (MongoDB,
Cassandra, HBase)
• Data Lakes (Hadoop HDFS, Amazon
S3)
46 Big Data @ DIS
Big Data Stack

47 Big Data @ DIS


Target stack on this course

48 Big Data @ DIS


Big Data Tools & Frameworks
 Apache Hadoop: Distributed storage and processing framework.
 Apache Spark: Fast data processing engine for large-scale data
analytics.
 Cloudera Data Platform: Comprehensive data management platform.
 Coalesce: Data
transformation platform for
building data pipelines.
 Other Tools:
– NoSQL databases (e.g.,
MongoDB, Cassandra).
– Data visualization tools
(e.g., Tableau, Power BI).

49 Big Data @ DIS


50 Big Data @ DIS
51 Big Data @ DIS
52 Big Data @ DIS
5. Big Data Potential Applications &
Lanscape

53 Big Data @ DIS


Potential Applications of Big Data
 Healthcare & Medicine – Education Market Trends – Analyzing
enrollment patterns and future workforce
– Predictive Disease Analytics – Early
needs.
diagnosis of diseases like cancer and heart
conditions. – Research & Scientific Discovery –
Accelerating breakthroughs in various
– Medical Image Analysis – AI-powered
disciplines using data analytics.
interpretation of X-rays, MRIs, and CT scans.
– Genomic Data Processing – Personalized  Economy & Business
medicine and drug discovery. – Market & Consumer Analytics –
– Epidemiology & Pandemic Management – Predicting trends and customer
Tracking and predicting disease outbreaks behavior.
(e.g., COVID-19).
– Financial Risk Management – Fraud
– Hospital & Resource Management –
Optimizing hospital bed occupancy and detection, credit scoring, and market
medical supply chains. analysis.
 Education & Research – Supply Chain Optimization – Real-time
tracking, demand forecasting, and
– Personalized Learning – Adaptive learning
systems and AI-powered tutoring. inventory management.
– Academic Performance Prediction – – Personalized Marketing – Targeted
Identifying at-risk students and improving advertising and recommendation
teaching methods. systems.
– Institutional Decision-Making – Data-driven – Stock Market Predictions – Analyzing
policymaking for schools and universities.
financial data for investment strategies.
54 Big Data @ DIS
Potential Applications …
 Society & Public Services communications data.
– Smart Cities – Traffic optimization, waste – Border Security & Immigration Control –
management, and public safety Detecting illegal activities and managing
improvements. migration patterns.
– Crime Prediction & Prevention – Identifying – Counterterrorism & Crime Prevention –
crime patterns and predicting high-risk areas. Analyzing global threat networks and
– Social Media Analysis – Tracking public suspicious transactions.
sentiment and misinformation detection.  Environment & Sustainability
– Disaster Management – Real-time – Climate Change Monitoring – Analyzing
monitoring of natural disasters and temperature trends and carbon emissions.
emergency response planning. – Natural Disaster Prediction – Early warning
– Employment & Labor Market Analysis – systems for earthquakes, floods, and
Predicting job market trends and workforce hurricanes.
planning. – Agriculture & Precision Farming – Optimizing
 National Security & Defense crop yields and resource use.
– Cybersecurity & Threat Intelligence – – Wildlife & Biodiversity Conservation –
Identifying cyber threats and anomalies in Tracking endangered species and
real-time. deforestation patterns.
– Military Strategy & Operations – Predictive – Water & Air Quality Management –
analytics for tactical planning and logistics. Monitoring pollution levels and ensuring
– Surveillance & Intelligence Gathering – regulatory compliance.
Analyzing satellite imagery and

55 Big Data @ DIS


When to consider Big Data Solution
 Data volume is growing rapidly: You’re  Performance issues: when existing
limited by your current platform or relational databases struggle with
environment because you can’t query speed and performance.
process the amount of data that you – A financial firm processing massive
want to process. stock market data streams.
– A retail business with millions of  Advanced analytics and AI integration:
customer transactions daily. when machine learning, predictive
 Need for real-time data processing: analytics, or deep learning is required.
when real-time insights are critical for – Personalized marketing campaigns
decision-making. based on user behavior.
– Fraud detection in banking or real-time  Need for scalability and flexibility:
recommendations in e-commerce. when data workloads fluctuate and
 Unstructured or multi-format data: You require dynamic scaling.
want to involve new sources of data in – Cloud-based Big Data platforms for
the analytics, but you can’t, because it startups and enterprises.
doesn’t fit into schema-defined rows  Data-driven decision making: when
and columns without sacrificing fidelity organizations want to leverage data to
or the richness of the data gain a competitive edge.
– Social media sentiment analysis or – Healthcare providers optimizing patient
medical image processing. treatment plans using data analytics.

56 Big Data @ DIS


The 2017 Big Data Landscape

57 Big Data @ DIS


58 Big Data @ DIS
The 2024 MAD (ML, AI & Data)
Landscape

https://mad.firstmark.com/
59 Big Data @ DIS
Hype Cycle for Data Management
2022

60 Big Data @ DIS


Hype Cycle for Data Management
2023

61 Big Data @ DIS


Hype Cycle for Data Management
2024

62 Big Data @ DIS


63 Big Data @ DIS
6. Big Data Jobs

64 Big Data @ DIS


Big Data Jobs
 Data scientists: collect, analyze, manage, structure and interpret large
volumes of data from a range of sources. Data scientists then use
reporting tools to pinpoint patterns, trends and interrelationships between
the various data sets.

 Big data engineer & architects: create the underpinning software


architecture; design, build, and manage the infrastructure and scalable
data management systems that data scientists need to perform their
analysis; outline business objectives and transform them into data-
processing workflows; can be found across industries.

 Big data developers: apply their deep understanding of technologies such


as Hadoop and Apache Spark with programming languages such as
Java, Python and Scala to process data. By drawing on deep
proficiencies in functional programming paradigms, they can effectively
ingest data into broader big data platform ecosystems.
65 Big Data @ DIS
Big Data Jobs…
 Big data analysts: detect and analyze actionable data, such as hidden
trends and patterns. By fusing these findings with their in-depth
knowledge of the market in which their organizations operate, they can
help leaders formulate informed strategic business decisions.

 Big data specialists: interrogate, ingest, analyze and transform complex


sets of data. This ensures the necessary data is made available to the
other team members who use it to uncover actionable insights and
provide recommendations to improve business outcomes.
 …

66 Big Data @ DIS


67 Big Data @ DIS
Skills required for Big Data Analytics
 Store and process
– Large scale databases
– Software Engineering
– System/network Engineering
 Analyse and model
– Reasoning
– Knowledge Representation
– Multimedia Retrieval
– Modelling and Simulation
– Machine Learning
– Information Retrieval
 Understand and design
– Decision theory
– Visual analytics
– Perception Cognition

68 Big Data @ DIS

You might also like