0% found this document useful (0 votes)

91 views27 pages

Lecture 1.1 - Introduction To DE

Uploaded by

zakiamine97

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

91 views27 pages

Lecture 1.1 - Introduction To DE

Uploaded by

zakiamine97

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Data Engineering

DAT204
Chapter 1
Introduction to Data Engineering
By Dr. Ghazi Al-Naymat

1
Data Science vs Data Engineering
• Data Science is the field of study that
combines domain expertise, programming
skills, and knowledge of mathematics and
statistics to extract meaningful insights from
data.

• Data engineering deals with a variety of

data formats, storage, data extraction,
and transformation.

2
Data Science vs Data Engineering

Source: https://e2eml.school/data_science_archetypes.html 5
Question & Answer

Q A

Data engineers are usually more technical

What are the differences
with strong data warehousing and
between Data Scientist programming backgrounds. Data
and Data Engineer? scientists tend to be more mathematical,
but there is a lot of crossover between the
roles, notably in programming, as
machine learning models usually require
writing small applications and heavy data
manipulation.

4
DE Position from DS
Data Engineer Responsibilities
• Analyzing and organizing raw data
• Raw data describes data in its most basic
• Building data systems and pipelines
• Data pipelines refer to the design of systems for processing and storing data. These systems
capture, cleanse, transform and route data to destination systems
• Data engineers build data pipelines that enable the organization to collect data points from
millions of users and process the results in near real-time
• Evaluating business needs and objectives
• To make raw data useful to the organization, data engineers must understand business
objectives
• Data engineer should understand business requirements and where data fits into the business
model so they can build a data ecosystem that serves the organization’s needs

6
Data Engineer Responsibilities
• Interpreting trends and patterns
• Performing complex data analysis to find trends and patterns and reporting on the results in
the form of dashboards, reports and data visualizations
• Preparing data for prescriptive and predictive modeling
• Data engineers must ensure the data is complete (no missing values), has been cleansed, and
that rules have been established for outliers (eliminate, ignore, average out, and so on)
• Building algorithms and prototypes
• Data pipelines represent an automated set of actions that extract data from various sources for
analysis and visualization. These processes are powered by algorithms
• Developing analytical tools and programs

7
A Bit more on Data Engineering

How do Cleaning & organizing data - 60%

Collecting data sets - 19%
Data Scientists
Mining data for patterns -- 9%
spend their Refining algorithms - 4%
time? Building training sets - 3%
Other - 5% Source: CrowdFlower

Gartner estimates that poor quality of data costs an average organization

$13.5 million per year, and yet data governance problems
— which all organizations suffer from — are worsening.
Data Engineering Skills
Coding. Proficiency in coding languages is essential to this role, so
consider taking courses to learn and practice your skills. Common
programming languages include SQL, NoSQL, Python, Java, R, and
Scala.

Source: https://www.coursera.org/articles/what-does-a-data-engineer-do-and-how-do-i-become-one 9
Why Python?

1
0
Why Python?
1. Python is easy and simple: does not need any hardcore programmer to put effort into it
2. Python is efficient: It performs the bulky task using fewer line of code.
3. Python has diverse libraries and framework: it is an abundant source of
libraries and frameworks.
4. Python is versatile: One can implement Python on almost all software, actions, infrastructure,
etc.
5. Python has a vast community: The community is largely supporting Python learners.
6. Python is portable and extensible: work on any other platforms without making any significant
changes.
7. Python is flexible: Developers can choose the programming style between OOPs and scripting.
8. Documentation: It has an attractive feature of documentation, lessons, and tutorial

11
Data Engineering Skills
Coding. Proficiency in coding languages is essential to this role, so
consider taking courses to learn and practice your skills. Common
programming languages include SQL, NoSQL, Python, Java, R, and
Scala.

Relational and non-relational databases. Databases rank among

the most common solutions for data storage. Data Engineer should be
familiar with both relational and non-relational databases, and how
they work.

12
Relational and Non-relational Databases
• Relational database is a collection of data items with pre-defined
relationships between them. These items are organized as a set of
tables with columns and rows. Tables are used to hold information
about the objects to be represented in the database.

• Non-relational database is a database that does not use the tabular

schema of rows and columns. NoSQL stands for Not only SQL.
Google’s Bigtable is a great example of a NoSQL data stores built on
top of the Google File System (GFS).
• Documents
• Semistructured data
• Large and unstructured data → Results of internet of things (IoT), social
networks, the rise of AI.
13
Example 1
Example 2
Data Engineering Skills
ETL (extract, transform, and load) systems. ETL is the process by
which you’ll move data from databases and other sources into a
single repository, like a data warehouse.

Data storage. Not all types of data should be stored the same way,
especially when it comes to big data. As you design data solutions for
a company, you’ll want to know when to use a data lake versus a data
warehouse, for example.

16
The ETL process

17
Data Extraction
• Extract Data → Getting Data
• Data is copied or exported from source locations to a staging area.
• The data can come from virtually any structured or unstructured
source—SQL or NoSQL servers, CRM and ERP systems, text and
document files, emails, web pages, and more.

18
Data Transformation
• 80% of data science work is data preparation; 75% of data scientists
find this to be the most boring aspect of the job.
• Raw data is transformed to be useful for analysis and to fit the
schema of the eventual target data warehouse.
• Data engineer brings their skill in manipulating data to a project, such
as:
• Filtering, cleansing, de-duplicating, validating, and authenticating the data.
• Performing calculations, translations, or summaries based on the raw data.
• Formatting the data into tables or joined tables to match the schema of the target data
warehouse.

19
Data Loading
• Store data to the target location.
• Load is one of the more basic and less exciting data engineering job functions because it
is literally just the movement and storage of data.
• Data Engineer sometimes swap the load and transform steps around to be ( ELT) when
dealing with big data technologies such as Hadoop/Spark because the extraction process
is a lot cheaper to run and spread the processing burden across multiple machines
(cluster).
• Data Engineer have the experience of choosing from many data storage options that are
available both in the cloud and on-premise, including NoSQL databases, relational data
stores, and data warehouses (Data lake).
• Cloud computing → separates storage and computational machines, meaning you
can simply scale down (switch off) expensive machines that are used for processing
data without affecting the stored data.
• Data lake is a centralized repository that allows you to store all your structured and
unstructured data at any scale. You can store your data as-is, without having first to
structure the data and run different types of analytics—from dashboards and
visualizations to big data processing, real-time analytics, and machine learning to
guide better decisions.
20
Data Warehouse
Data lake
Data Warehouses vs Data Lake

Source: https://aws.amazon.com/ar/big-data/datalakes-and-analytics/what-is-a-data-lake/ 19
Data Engineering Skills
Machine learning. While machine learning is more the concern of data
scientists, it can be helpful to have a grasp of the basic concepts to better
understand the needs of data scientists on your team.
Big data tools. Data engineers don’t just work with regular data. They’re
often tasked with managing big data. Tools and technologies are evolving
and vary by company, but some popular ones include Hadoop, MongoDB,
and Kafka.
Cloud computing. You’ll need to understand cloud storage and cloud
computing as companies increasingly trade physical servers for cloud
services. Beginners may consider a course in Amazon Web Services
(AWS) or Google Cloud.
Data security. While some companies might have dedicated data security
teams, many data engineers are still tasked with securely managing and
storing data to protect it from loss or theft.
24
Data Engineering
Open source tools

Study Notes - Lesson 1 - 7 PDF
No ratings yet
Study Notes - Lesson 1 - 7 PDF
25 pages
Data Science Life Cycle Sheet
No ratings yet
Data Science Life Cycle Sheet
191 pages
Machine Learning Beginners Guide
No ratings yet
Machine Learning Beginners Guide
25 pages
Presentation On ML
No ratings yet
Presentation On ML
469 pages
Lecture+Notes (Upgrad)
No ratings yet
Lecture+Notes (Upgrad)
5 pages
Using Python Libraries-1
No ratings yet
Using Python Libraries-1
61 pages
EPGP+in+Machine+Learning+ +AI+Brochure
No ratings yet
EPGP+in+Machine+Learning+ +AI+Brochure
24 pages
Data Science Capstone Project
No ratings yet
Data Science Capstone Project
21 pages
7712-Artificial Intelligence and Deep Learning
No ratings yet
7712-Artificial Intelligence and Deep Learning
228 pages
U02Lecture07 Classification
100% (1)
U02Lecture07 Classification
56 pages
How To Learn Machine Learning Algorithms For Interviews
No ratings yet
How To Learn Machine Learning Algorithms For Interviews
16 pages
Shanthi ML
No ratings yet
Shanthi ML
26 pages
Data Analytics Certificate Glossary
No ratings yet
Data Analytics Certificate Glossary
23 pages
Building Your Network: Brian Tracy
100% (1)
Building Your Network: Brian Tracy
8 pages
wk3. Data-ETL
No ratings yet
wk3. Data-ETL
61 pages
Machine Learning Basics Guide
100% (1)
Machine Learning Basics Guide
124 pages
Machine Learning Notes Btech
No ratings yet
Machine Learning Notes Btech
3 pages
19 Storytelling PDF
No ratings yet
19 Storytelling PDF
64 pages
5 Data Science Project Lifecycle
No ratings yet
5 Data Science Project Lifecycle
33 pages
100 Days Data Analyst Learning Roadmap
No ratings yet
100 Days Data Analyst Learning Roadmap
6 pages
Pyspark 30 Days
No ratings yet
Pyspark 30 Days
32 pages
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
No ratings yet
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
15 pages
Data Science Interview
0% (1)
Data Science Interview
32 pages
RhinoPython CheetSheet
100% (1)
RhinoPython CheetSheet
1 page
AI Fastest Path
No ratings yet
AI Fastest Path
20 pages
Real Estate ML Project Guide
No ratings yet
Real Estate ML Project Guide
20 pages
Beginner's Guide to ML Models
No ratings yet
Beginner's Guide to ML Models
43 pages
41 Essential Machine Learning Interview Questions: 18 Mins Read
No ratings yet
41 Essential Machine Learning Interview Questions: 18 Mins Read
21 pages
ML - Full Slides Srikanth Allamshatty
No ratings yet
ML - Full Slides Srikanth Allamshatty
369 pages
Amazon Data Engineer Interview Questions
0% (1)
Amazon Data Engineer Interview Questions
5 pages
Modules and Packages in Python
No ratings yet
Modules and Packages in Python
24 pages
Data Science For Public Policy Springer Series in The Data Sciences 1st Ed 2021 Jeffrey C Chen Download
No ratings yet
Data Science For Public Policy Springer Series in The Data Sciences 1st Ed 2021 Jeffrey C Chen Download
88 pages
Evaluating Large Language Model Systems 1712222276
No ratings yet
Evaluating Large Language Model Systems 1712222276
66 pages
CDC Python Learning Hierarchy
No ratings yet
CDC Python Learning Hierarchy
3 pages
Data Science M-1 Notes
No ratings yet
Data Science M-1 Notes
34 pages
Deep Reinforcement Learning in Action 1st Edition Alexander Zai PDF Download
100% (2)
Deep Reinforcement Learning in Action 1st Edition Alexander Zai PDF Download
56 pages
Data Science Interview QnAs by CloudyML
No ratings yet
Data Science Interview QnAs by CloudyML
21 pages
Data Science For Business
No ratings yet
Data Science For Business
18 pages
H2o Training Day
No ratings yet
H2o Training Day
180 pages
100-Machine-Learning-Interview-Questions-and-Answers (Downloaded From Internet)
No ratings yet
100-Machine-Learning-Interview-Questions-and-Answers (Downloaded From Internet)
24 pages
Sequential Patterns The GSP Algorithm
No ratings yet
Sequential Patterns The GSP Algorithm
10 pages
Databricks Guide To Agent Systems
No ratings yet
Databricks Guide To Agent Systems
16 pages
Machine Learning Lab Manual 7
100% (1)
Machine Learning Lab Manual 7
8 pages
Bo Xi r2 Query Builder Training
No ratings yet
Bo Xi r2 Query Builder Training
51 pages
Hypothesis Testing Spinning The Wheel
No ratings yet
Hypothesis Testing Spinning The Wheel
1 page
Resources - AI Agents For Business The Leaders Guide To AI Agents
No ratings yet
Resources - AI Agents For Business The Leaders Guide To AI Agents
2 pages
AI & ML Interview Preparation
No ratings yet
AI & ML Interview Preparation
15 pages
Amex Fraud Analyst QNA
No ratings yet
Amex Fraud Analyst QNA
3 pages
Class Material - 1
No ratings yet
Class Material - 1
66 pages
Data Science Leadership Pathways
No ratings yet
Data Science Leadership Pathways
30 pages
AA Banking PDF
No ratings yet
AA Banking PDF
23 pages
Top 50 Machine Learning Interview Q A
No ratings yet
Top 50 Machine Learning Interview Q A
13 pages
Machine Learning Models: by Mayuri Bhandari
No ratings yet
Machine Learning Models: by Mayuri Bhandari
48 pages
Learn PySpark: Build Python-Based Machine Learning and Deep Learning Models 1st Edition Pramod Singh Instant Download
No ratings yet
Learn PySpark: Build Python-Based Machine Learning and Deep Learning Models 1st Edition Pramod Singh Instant Download
120 pages
500 - Projects of ML and DL
No ratings yet
500 - Projects of ML and DL
9 pages
Five Rules For Leading in A Digital World
No ratings yet
Five Rules For Leading in A Digital World
6 pages
APSCHE Approved Short Term Internship Proposal by ExcelR
No ratings yet
APSCHE Approved Short Term Internship Proposal by ExcelR
4 pages
Data Engineering Unit-1
No ratings yet
Data Engineering Unit-1
16 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
13 pages
5-Entity Types and Entity Set
No ratings yet
5-Entity Types and Entity Set
65 pages
N Gram Data Structure in Information Retrieval Systems
No ratings yet
N Gram Data Structure in Information Retrieval Systems
8 pages
Lecture 11
No ratings yet
Lecture 11
17 pages
DP - SAP - Backup - Maximum Number of Processess Exceeded Error
No ratings yet
DP - SAP - Backup - Maximum Number of Processess Exceeded Error
2 pages
Assignment 12
No ratings yet
Assignment 12
17 pages
Skrip Imr652
No ratings yet
Skrip Imr652
3 pages
DDMS Part-1
No ratings yet
DDMS Part-1
35 pages
Bi Question Bank
No ratings yet
Bi Question Bank
2 pages
Car Inventory
No ratings yet
Car Inventory
62 pages
SQL Injection Ultimate Tutorial
100% (1)
SQL Injection Ultimate Tutorial
20 pages
Oracle Business Analytics Warehouse Data Model Reference Version 7
No ratings yet
Oracle Business Analytics Warehouse Data Model Reference Version 7
1 page
CSA Database Monitoring Guide
No ratings yet
CSA Database Monitoring Guide
49 pages
Chapter 4
No ratings yet
Chapter 4
23 pages
Semi-Structured Documents Mining - A Review and Comparison
No ratings yet
Semi-Structured Documents Mining - A Review and Comparison
10 pages
Database Management System V1
No ratings yet
Database Management System V1
8 pages
Data Analytics Consulting Guide
No ratings yet
Data Analytics Consulting Guide
12 pages
Kid Toy
No ratings yet
Kid Toy
6 pages
MongoDB Performance Best Practices
No ratings yet
MongoDB Performance Best Practices
15 pages
Disks, Memories & Buffer Management: "The Two Offices of Memory Are Collection and Distribution." - Samuel Johnson
No ratings yet
Disks, Memories & Buffer Management: "The Two Offices of Memory Are Collection and Distribution." - Samuel Johnson
28 pages
Optimize InfoCube Performance
No ratings yet
Optimize InfoCube Performance
2 pages
DBMS Concepts for IT Students
No ratings yet
DBMS Concepts for IT Students
29 pages
Unit-V: Database Management System
No ratings yet
Unit-V: Database Management System
5 pages
Imp QP
No ratings yet
Imp QP
2 pages
BBA MIS Lab Manual
No ratings yet
BBA MIS Lab Manual
11 pages
CS 4305 DBMS L9
No ratings yet
CS 4305 DBMS L9
37 pages
0 Kts II Bcom - CA - Notes
No ratings yet
0 Kts II Bcom - CA - Notes
8 pages
Business Intelligence Overview
No ratings yet
Business Intelligence Overview
8 pages
ITC556 Sample Exam + Solutions
60% (5)
ITC556 Sample Exam + Solutions
11 pages
Unit #3 - Data Warehouse and Data Mining
No ratings yet
Unit #3 - Data Warehouse and Data Mining
70 pages
DBMS Unit-4
No ratings yet
DBMS Unit-4
20 pages

Lecture 1.1 - Introduction To DE

Uploaded by

Lecture 1.1 - Introduction To DE

Uploaded by

Data Engineering

• Data engineering deals with a variety of

Data engineers are usually more technical

How do Cleaning & organizing data - 60%

Gartner estimates that poor quality of data costs an average organization

Relational and non-relational databases. Databases rank among

• Non-relational database is a database that does not use the tabular

You might also like