Data Engineering
DAT204
Chapter 1
Introduction to Data Engineering
By Dr. Ghazi Al-Naymat
1
Data Science vs Data Engineering
• Data Science is the field of study that
combines domain expertise, programming
skills, and knowledge of mathematics and
statistics to extract meaningful insights from
data.
• Data engineering deals with a variety of
data formats, storage, data extraction,
and transformation.
2
Data Science vs Data Engineering
Source: https://e2eml.school/data_science_archetypes.html 5
Question & Answer
Q A
Data engineers are usually more technical
What are the differences
with strong data warehousing and
between Data Scientist programming backgrounds. Data
and Data Engineer? scientists tend to be more mathematical,
but there is a lot of crossover between the
roles, notably in programming, as
machine learning models usually require
writing small applications and heavy data
manipulation.
4
DE Position from DS
Data Engineer Responsibilities
• Analyzing and organizing raw data
• Raw data describes data in its most basic
• Building data systems and pipelines
• Data pipelines refer to the design of systems for processing and storing data. These systems
capture, cleanse, transform and route data to destination systems
• Data engineers build data pipelines that enable the organization to collect data points from
millions of users and process the results in near real-time
• Evaluating business needs and objectives
• To make raw data useful to the organization, data engineers must understand business
objectives
• Data engineer should understand business requirements and where data fits into the business
model so they can build a data ecosystem that serves the organization’s needs
6
Data Engineer Responsibilities
• Interpreting trends and patterns
• Performing complex data analysis to find trends and patterns and reporting on the results in
the form of dashboards, reports and data visualizations
• Preparing data for prescriptive and predictive modeling
• Data engineers must ensure the data is complete (no missing values), has been cleansed, and
that rules have been established for outliers (eliminate, ignore, average out, and so on)
• Building algorithms and prototypes
• Data pipelines represent an automated set of actions that extract data from various sources for
analysis and visualization. These processes are powered by algorithms
• Developing analytical tools and programs
7
A Bit more on Data Engineering
How do Cleaning & organizing data - 60%
Collecting data sets - 19%
Data Scientists
Mining data for patterns -- 9%
spend their Refining algorithms - 4%
time? Building training sets - 3%
Other - 5% Source: CrowdFlower
Gartner estimates that poor quality of data costs an average organization
$13.5 million per year, and yet data governance problems
— which all organizations suffer from — are worsening.
Data Engineering Skills
Coding. Proficiency in coding languages is essential to this role, so
consider taking courses to learn and practice your skills. Common
programming languages include SQL, NoSQL, Python, Java, R, and
Scala.
Source: https://www.coursera.org/articles/what-does-a-data-engineer-do-and-how-do-i-become-one 9
Why Python?
1
0
Why Python?
1. Python is easy and simple: does not need any hardcore programmer to put effort into it
2. Python is efficient: It performs the bulky task using fewer line of code.
3. Python has diverse libraries and framework: it is an abundant source of
libraries and frameworks.
4. Python is versatile: One can implement Python on almost all software, actions, infrastructure,
etc.
5. Python has a vast community: The community is largely supporting Python learners.
6. Python is portable and extensible: work on any other platforms without making any significant
changes.
7. Python is flexible: Developers can choose the programming style between OOPs and scripting.
8. Documentation: It has an attractive feature of documentation, lessons, and tutorial
11
Data Engineering Skills
Coding. Proficiency in coding languages is essential to this role, so
consider taking courses to learn and practice your skills. Common
programming languages include SQL, NoSQL, Python, Java, R, and
Scala.
Relational and non-relational databases. Databases rank among
the most common solutions for data storage. Data Engineer should be
familiar with both relational and non-relational databases, and how
they work.
12
Relational and Non-relational Databases
• Relational database is a collection of data items with pre-defined
relationships between them. These items are organized as a set of
tables with columns and rows. Tables are used to hold information
about the objects to be represented in the database.
• Non-relational database is a database that does not use the tabular
schema of rows and columns. NoSQL stands for Not only SQL.
Google’s Bigtable is a great example of a NoSQL data stores built on
top of the Google File System (GFS).
• Documents
• Semistructured data
• Large and unstructured data → Results of internet of things (IoT), social
networks, the rise of AI.
13
Example 1
Example 2
Data Engineering Skills
ETL (extract, transform, and load) systems. ETL is the process by
which you’ll move data from databases and other sources into a
single repository, like a data warehouse.
Data storage. Not all types of data should be stored the same way,
especially when it comes to big data. As you design data solutions for
a company, you’ll want to know when to use a data lake versus a data
warehouse, for example.
16
The ETL process
17
Data Extraction
• Extract Data → Getting Data
• Data is copied or exported from source locations to a staging area.
• The data can come from virtually any structured or unstructured
source—SQL or NoSQL servers, CRM and ERP systems, text and
document files, emails, web pages, and more.
18
Data Transformation
• 80% of data science work is data preparation; 75% of data scientists
find this to be the most boring aspect of the job.
• Raw data is transformed to be useful for analysis and to fit the
schema of the eventual target data warehouse.
• Data engineer brings their skill in manipulating data to a project, such
as:
• Filtering, cleansing, de-duplicating, validating, and authenticating the data.
• Performing calculations, translations, or summaries based on the raw data.
• Formatting the data into tables or joined tables to match the schema of the target data
warehouse.
19
Data Loading
• Store data to the target location.
• Load is one of the more basic and less exciting data engineering job functions because it
is literally just the movement and storage of data.
• Data Engineer sometimes swap the load and transform steps around to be ( ELT) when
dealing with big data technologies such as Hadoop/Spark because the extraction process
is a lot cheaper to run and spread the processing burden across multiple machines
(cluster).
• Data Engineer have the experience of choosing from many data storage options that are
available both in the cloud and on-premise, including NoSQL databases, relational data
stores, and data warehouses (Data lake).
• Cloud computing → separates storage and computational machines, meaning you
can simply scale down (switch off) expensive machines that are used for processing
data without affecting the stored data.
• Data lake is a centralized repository that allows you to store all your structured and
unstructured data at any scale. You can store your data as-is, without having first to
structure the data and run different types of analytics—from dashboards and
visualizations to big data processing, real-time analytics, and machine learning to
guide better decisions.
20
Data Warehouse
Data lake
Data Warehouses vs Data Lake
Source: https://aws.amazon.com/ar/big-data/datalakes-and-analytics/what-is-a-data-lake/ 19
Data Engineering Skills
Machine learning. While machine learning is more the concern of data
scientists, it can be helpful to have a grasp of the basic concepts to better
understand the needs of data scientists on your team.
Big data tools. Data engineers don’t just work with regular data. They’re
often tasked with managing big data. Tools and technologies are evolving
and vary by company, but some popular ones include Hadoop, MongoDB,
and Kafka.
Cloud computing. You’ll need to understand cloud storage and cloud
computing as companies increasingly trade physical servers for cloud
services. Beginners may consider a course in Amazon Web Services
(AWS) or Google Cloud.
Data security. While some companies might have dedicated data security
teams, many data engineers are still tasked with securely managing and
storing data to protect it from loss or theft.
24
Data Engineering
Open source tools