n
si
oc d
at va I2
Pr e
es
C
a nc
Ad
CH
D
Big data Introduction
g
Bi
Real cases and facts: Big data Tsunami !!!
Big data use case
Big Data & industries
Big Data Vs
Exercises
MUST, FSB, Anis Ben Aicha 1
n
si
oc d
at va I2
Pr e
es
C
a nc
Ad
D
g
Bi
Jeff Reed (2017), Data Analytics: Applicable Data Analysis to Advance Any Business Using
the Power of Data Driven Analytics
MUST FSB, Anis Ben Aicha 2
n
si
oc d
Real cases and facts !!
at va I2
Pr e
es
C
a nc
Ad
Every 02 day we create information as much we did from the beginning of
time until 2003
D
g
Over 90% of all data in the world was created in the past 2 years
Bi
Amount of digital information in 2020 = 40 zettabytes (10^21 bytes, 2^70
Bytes)
The Amount of data doubles every 1,2 years
Every minute: 204 million emails, 1,8 million Facebook likes,
Google: processes 40000 search queries per second = 3,5 10^9 per day
Youtube: 100 hours videos are uploaded per minute
One day created data: if they are burning in DVD reach the moon
Largest volume of data: AT&T 312 Terabytes
1,570 new websites per minute
Companies monitor “twitter sentiment analysis”: 12 Terabytes per day
More than 50 10^9 connected devices
MUST FSB, Anis Ben Aicha 3
n
si
oc d
Data tsunami
at va I2
Pr e
es
C
a nc
Ad
We are witnessing a tsunami of data:
D
- Huge volumes
g
Bi
- Data of different types and formats
- New data with increasing speeds
The challenges:
- Capturing, transporting, and moving the data
- Managing the data the hardware involved, and the software
- Processing: managing & programming to provide insight into the data
- Storing - safeguarding and securing
MUST FSB, Anis Ben Aicha 4
n
si
oc d
Data tsunami
at va I2
Pr e
es
C
a nc
Ad
D
g
Bi
MUST FSB, Anis Ben Aicha 5
n
si
oc d
Big Data examples
at va I2
Pr e
es
C
a nc
• Science • Large scale eCommerce
Ad
• Astronomy • Government
D
• Atmospheric science • Regular government business and
g
• Genomics commerce needs
Bi
• Biogeochemical • Military and homeland security
• Biological surveillance
• Social • ……
• Social networks
• Social data
• Medical records
• Commercial
• Web / event / database logs
• Sensor networks
• Internet text and documents
• Internet search indexing
• Photographic archives
• Video / audio archives
MUST FSB, Anis Ben Aicha 6
n
si
oc d
Big Data examples: use case (Financial )
at va I2
Pr e
es
C
a nc
Ad
• Problem:
Manage the several Petabytes of data which is growing at 40-100% per
D
g
year under increasing pressure to prevent frauds and complaints to
Bi
regulators
• How big data analytics can help:
Fraud detection
Credit issuance
Risk management
360° view of the Customer
MUST FSB, Anis Ben Aicha 7
n
si
oc d
Big Data examples: use case (Financial )
at va I2
Pr e
es
C
a nc
• Problem (Visa Card fraud)
Ad
Credit card fraud costs a lot of money per year
D
Fraud schemes are constantly changing
g
Bi
Understanding the fraud pattern months after the fact is only partially helpful
Fraud detection models need to evolve faster
• If only Visa could …
Reinvent how to detect the fraud patterns
Stop new fraud patterns before they can rack-up significant losses
Solution
Revolutionize the speed of detection
Visa loaded two years of test records, or 73 billion transactions,
amounting to 36 terabytes of data into Hadoop - the processing time fell
from one month with traditional methods to a mere 13 minutes
MUST FSB, Anis Ben Aicha 8
n
si
oc d
Big Data examples: use case (Healthcare )
at va I2
Pr e
es
C
a nc
• Problem:
Ad
Vast quantities of real-time information are starting to come from wireless
monitoring devices that postoperative patients and those with chronic diseases
D
are wearing at home and in their daily lives.
g
Bi
Example: The U.S. produces 1.2 billion clinical care documents each year.
These documents contain information about a patient’s medical history,
doctor’s visits, hospital visits, previous treatments, procedures, test results and
prescription medications.
• How big data analytics can help:
Epidemic early warning
Intensive Care Unit and remote monitoring
A Complete Picture of Patients for Effective Care
An Accurate Patient Profile for Correct Care
A Growing Data Laboratory for Precise and Practice-Based Care
The Data Is In: 3 Ways Analytics Will Improve Healthcare
http://dataconomy.com/the-data-is-in-3-ways-analytics-will-improve-healthcare
MUST FSB, Anis Ben Aicha 9
n
si
oc d
Big Data examples: use case (Healthcare )
at va I2
Pr e
es
C
a nc
Ad
D
g
Bi
MUST FSB, Anis Ben Aicha 10
n
si
oc d
Big Data examples: use case (Telecommunications)
at va I2
Pr e
es
C
a nc
• Problem:
Ad
Legacy systems are used to gain insights from internally generated data facing
issues of high storage costs, long data loading time, and long administration
D
g
processing times…
Bi
• How big data analytics can help:
Combat fraud
Churn prediction
Geomapping / marketing
Network monitoring
MUST FSB, Anis Ben Aicha 11
n
si
oc d
Big Data examples: use case (transportation)
at va I2
Pr e
es
C
a nc
Ad
D
g
Bi
• Problem:
Traffic congestion has been increased worldwide as a result of increased
urbanization and population growth reducing the efficiency of transportation
infrastructure and increasing travel time and fuel consumption.
• How big data analytics can help:
Urban planning & monitoring
Real time analysis to weather and traffic congestion data
streams to identify traffic patterns reducing transportation costs.
MUST FSB, Anis Ben Aicha 12
n
si
oc d
Big Data examples: use case (Retails & social media)
at va I2
Pr e
es
C
a nc
• Problem:
Ad
Retailers want to use “big data” to predict trends, prepare for demand, pinpoint
customers, optimize pricing & promotions, and monitor real-time analytics &
D
results by combining data from web browsing patterns, social media, industry
g
Bi
forecasts, existing customer records, etc huge amount of data
• How big data analytics can help:
Access social media to gain insight
Federate data between Big Data and RDBMs
Apply graph analysis to the available data
Work to understand demand and engage
customers
The Impact of Big Data on The Retail Sector: Examples And Use-Cases
https://www.datapine.com/blog/big-data-in-retail-examples/
MUST FSB, Anis Ben Aicha 13
n
si
oc d
Big Data examples: use case (Retails & social media)
at va I2
Pr e
es
C
a nc
Ad
D
g
Bi
• Path analysis
• Connectivity analysis
• Community analysis
• Centrality analysis
MUST FSB, Anis Ben Aicha 14
n
si
oc d
Big Data & industries
at va I2
Pr e
es
C
a nc
• Problem:
Ad
The world of production will become more and more networked
until everything is interlinked with everything else. The complexity
D
g
of production and supplier networks has grow enormously. Previously,
Bi
networks and processes were limited to one factory, but the boundaries of
individual factories will most likely no longer exist in favor of the
interconnect of multiple factories or even geographical regions..
• How big data analytics can help:
The Internet of Things (IoT)
Industry 4.0
MUST FSB, Anis Ben Aicha 15
n
si
oc d
Big Data & industries
at va I2
Pr e
es
C
a nc
• Fourth industrial revolution
Ad
Industry 1.0: Water/steam power
D
Industry 2.0: Electric power
g
Bi
Industry 3.0: Computing power
Industry 4:0: Internet of Things (IoT) power
MUST FSB, Anis Ben Aicha 16
n
si
oc d
Big Data & industries
at va I2
Pr e
es
C
a nc
Ad
D
• The Eras of Data
g
Bi
0 Flat files
1 Relational Databases (RBDMs) - 1970s - OLTP (Online Transactional
processing)
2 Data Warehouses - 1990s - OLAP (Online Analytical processing) or
DSS (Decision Support Systems) workloads
3 Big Data - 2000s - Batch, with a movement towards Real-time
MUST FSB, Anis Ben Aicha 17
n
si
oc d
Big Data & industries
at va I2
Pr e
es
C
a nc
Ad
D
g
Bi
MUST FSB, Anis Ben Aicha 18
n
si
oc d
Big Data & industries
at va I2
Pr e
es
C
a nc
Ad
• Different types of data
D
• Each of them require different tools and techniques.
g
Bi
• The main categories of data:
• Structured
• Semi-Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and image
• Streaming
MUST FSB, Anis Ben Aicha 19
n
si
oc d
Big Data & industries
at va I2
Pr e
es
C
a nc
Ad
D
g
Bi
How ???
MUST FSB, Anis Ben Aicha 20
n
si
oc d
Big Data & industries
at va I2
Pr e
es
C
a nc
big data platform
Ad
D
g
Bi
MUST FSB, Anis Ben Aicha 21
n
si
oc d
Big Data & industries
at va I2
Pr e
es
C
a nc
big data Architecture
Ad
D
g
Bi
MUST FSB, Anis Ben Aicha 22
n
si
oc d
Big Data Vs
at va I2
Pr e
es
C
a nc
4 classic dimensions of big data
Ad
D
g
Bi
MUST FSB, Anis Ben Aicha 23
n
Big Data Vs
si
oc d
at va I2
Pr e
es
V1: Volume
C
a nc
Ad
• Big data's main attribute is its huge volume, which has been collected
through several sources.
D
• Data is collected from diverse sources: business transactions, social
g
Bi
media, sensors, surfing history etc.
• Data is often measured in gigabytes or terabytes. However, many analyses
indicate that the total amount of big data generated to date is measured in
Zettabytes the enormous amount of data that is accessible for company
research and analysis.
• Data is expanding dramatically with each new day: Every minute, data
worth millions of TBs is generated globally through Facebook, tweets,
instant messaging, emails, mobile usage, product evaluations, etc.
Hundreds of new Twitter accounts are established every minute, tens of
thousands of apps are downloaded, and thousands of fresh tweets and
advertisements are published Every two years, the quantity of big data
generated globally will double.
MUST FSB, Anis Ben Aicha 24
n
Big Data Vs
si
oc d
at va I2
Pr e
es
V1: Volume
C
a nc
Ad
D
• Traditional database technology cannot meet the demand for effective data
g
management, including storage and analysis, as the volume of data is
Bi
increasing at the speed of light.
• Adoption of modern tools like Hadoop and MongoDB on a wide scale is
crucial right now. To make it easier to store and analyze this massive
amount of big data across several databases, they utilize distributed
systems.
• The modern era now has a wider range of opportunities because to the
information explosion.
MUST FSB, Anis Ben Aicha 25
n
Big Data Vs
si
oc d
at va I2
Pr e
es
V2: Variety
C
a nc
• Big data is collected and created in various formats and sources. It includes
Ad
structured data as well as unstructured data like text, multimedia, social
D
media, business reports etc.
g
Bi
• Structured data: Traditional data management and analysis techniques
may be used to store and analyze structured data, such as bank records,
demographic data, inventory databases, company data, and product data
streams.
• Unstructured data contains information that has been collected, such as
photos, tweets or Facebook status updates, discussions through instant
messenger, blogs, videos uploaded, voice recordings, and sensor
data.There is no clear pattern in these kinds of data. Unstructured data
frequently reflects human ideas, sentiments, and emotions that are
sometimes difficult to articulate in precise terms.
MUST FSB, Anis Ben Aicha 26
n
Big Data Vs
si
oc d
at va I2
Pr e
es
V2: Variety
C
a nc
Ad
D
g
Bi
• One of the main objectives of big data is to collect all this unstructured data
and analyze it using the appropriate technology
• Variety of data definitely helps to get insights from different set of samples,
users and demographics.
It helps to bring different perspective to same information.
It also allows analyzing and understanding the impact of different form
and sources of data collection from a ‘larger picture’ point of view.
MUST FSB, Anis Ben Aicha 27
n
Big Data Vs
si
oc d
at va I2
Pr e
es
V3: Velocity
C
a nc
• Speed is one of the key drivers for success in company business. Fast turn-
Ad
around is one of the pre-requisites to stay alive in this fierce competition.
D
Expectations of quick results and quick deliverables are pressing to a great
g
Bi
extent.
• In these situations, it becomes essential to quickly collect and analyze huge
amounts of heterogeneous data in order to make accurate decisions.
• Low velocity of even high quality of data may hinder the decision making of
a business.
• Velocity is the speed or frequency at which data is collected in various forms
and from different sources for processing.
• It ranges from batch updates, to periodic to real-time flow of the data.
MUST FSB, Anis Ben Aicha 28
n
Big Data Vs
si
oc d
at va I2
Pr e
es
V4: Veracity
C
a nc
Ad
• It is very likely that the vast amounts of data include some ambiguity.
D
g
Bi
• Big data has to be filtered for clean and pertinent information if we want to
provide the company insights that will help it grow The used data as an
input should be properly prepared, conformed, verified, and made
consistent in order to make reliable judgments.
• Causes: There are several causes of data contamination, including incorrect
references or associations, waste data, fake data, data entry mistakes or
typos (primarily in structured data), etc.
• In automated data collection, analysis, report generation, and decision
making process, it is inevitable to have a foolproof system in place to avoid
any lapses.
MUST FSB, Anis Ben Aicha 29
n
si
oc d
Big Data Vs
at va I2
Pr e
es
C
a nc
More Vs
Ad
D
• Volume - how much data is there?
g
• Velocity - how quickly is the data being created, moved, or accessed?
Bi
• Variety - how many different types of sources are there?
• Veracity - can we trust the data?
• Validity - is the data accurate and correct?
• Viability - is the data relevant to the use case at hand?
• Volatility - how often does the data change?
• Vulnerability - can we keep the data secure?
• Visualization - how can the data be presented to the user?
• Value - can this data produce a meaningful return on investment
Understanding the Many V’s of Healthcare Big Data Analytics
https://healthitanalytics.com/news/understanding-the-many-vs-of-healthcare-big-data-
analytics
MUST FSB, Anis Ben Aicha 30
n
si
oc d
Exercises
at va I2
Pr e
es
C
a nc
Exercise 1:
Ad
Analyze the following use cases with the respect of four V
D
g
• Case 1: Facebook
Bi
• Case 2: Skype
• Case 3: Fraud detection in banking transactions
• Case 4: Jumia
MUST FSB, Anis Ben Aicha 31
n
si
oc d
Exercises
at va I2
Pr e
es
C
a nc
Exercise 2:
Ad
D
- Problem Statement: Health organizations, such as the World Health
g
Organization (WHO) and the Centers for Disease Control and Prevention
Bi
(CDC), need to monitor and predict disease outbreaks to take timely preventive
actions. Traditional methods of disease surveillance may not provide real-time
insights.
1- What are the constraints that have to be faced by a big data solution
2- Propose an architecture of big data solution
3- What are expected benefits
MUST FSB, Anis Ben Aicha 32
n
si
oc d
Annexe A: Byte multiples
at va I2
Pr e
es
C
a nc
Ad
D
g
Bi
MUST FSB, Anis Ben Aicha 33
n
si
oc d
Annexe B: OLTP Vs OLAP
at va I2
Pr e
es
C
a nc
Ad
D
g
Bi
https://bugssiufam.wixsite.com/bugados/single-post/2016/08/18/OLTPOnline-Transaction-
Processing-e-OLAPOnline-Analytical-Processing
MUST FSB, Anis Ben Aicha 34
n
si
oc d
Annexe: Data professions
at va I2
Pr e
es
C
a nc
Data engineer profile (skills)
Ad
Networking (infrastructure, administration, security, …)
System administration
D
Programming languages (Python, Java, Scala, etc.)
g
Bi
Scripting languages (Bash, shell scripting)
Database technologies (SQL, NoSQL, data warehousing)
Cloud computing platforms (AWS, Azure, GCP)
Big data technologies (Hadoop, Spark, Kafka)
Data modeling and ETL (Extract, Transform, Load) tools
Problem-solving and analytical skills
Data Responsibilities:
Designing and building data pipelines
Developing and maintaining data storage solutions
Data cleaning and preparation
Building data processing tools and scripts
Monitoring and performance optimization
MUST FSB, Anis Ben Aicha 35
n
si
oc d
Annexe: Data professions
at va I2
Pr e
es
C
a nc
Data scientist profile (skills)
Ad
Programming languages (Python, R, SQL)
Statistics and probability
D
Machine learning algorithms and libraries (e.g., TensorFlow, Scikit-learn)
g
Bi
Data visualization tools (e.g., Tableau, Power BI)
Database technologies (SQL, NoSQL)
Cloud computing platforms (AWS, Azure, GCP)
Strong analytical and problem-solving skills
Excellent communication and presentation skills
Curiosity and passion for data
Creativity and critical thinking
Team player with strong collaboration skills
Data scientist Responsibilities
Formulating data-driven questions and hypotheses
Data acquisition and wrangling
Exploratory data analysis (EDA)
Modeling and machine learning
Data visualization and storytelling
Evaluation and interpretation
Collaboration and communication
MUST FSB, Anis Ben Aicha 36
n
si
oc d
Annexe: Data professions
at va I2
Pr e
es
C
a nc
Machine Learning profile (skills)
Ad
Programming languages (Python, Java, C++, etc.)
Machine learning libraries and frameworks (TensorFlow, PyTorch, etc.)
D
Deep learning expertise for complex model
g
Bi
Cloud computing platforms (AWS, Azure, GCP)
Software engineering concepts and principles
Data engineering tools and pipelines
Version control systems (Git)
DevOps, MlOps,
Data Responsibilities
Deployment and monitoring
Software engineering and automation
Data engineering and infrastructure
MUST FSB, Anis Ben Aicha 37