0% found this document useful (0 votes)

41 views3 pages

SQL Important Revision

The document outlines SQL operations involving two tables, Table1 and Table2, including various types of joins. It also discusses best practices and pitfalls in data processing, such as batch vs streaming, columnar formats, partitioning, schema evolution, change-data-capture, and data lineage. Additionally, it provides practical exercises to enhance understanding of these concepts in data engineering.

Uploaded by

Shashwat Dev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views3 pages

SQL Important Revision

Uploaded by

Shashwat Dev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 3

CREATE TABLE Table1 (

A INT
);

-- Insert the data

INSERT INTO Table1 (A) VALUES
(1),
(1),
(1),
(1),
(2),
(2),
(NULL);

select * from Table1;

CREATE TABLE Table2 (

B INT
);

-- Insert the data

INSERT INTO Table2 (B) VALUES
(1),
(1),
(2),
(2),
(3),
(4),
(NULL),
(NULL);

select * from Table2;

SELECT A.*,B.*
from Table1 as a
left join Table2 as b
on a.A=b.B;

SELECT A.*,B.*
from Table2 as a
left join Table1 as b
on a.B=b.A;

SELECT A.*,B.*
from Table1 as a
FULL OUTER join Table2 as b
on a.A=b.B;

SELECT A.*,B.*
from Table1 as a
join Table2 as b
on a.A=b.B;

→ 1. Batch vs Streaming
• Batch gives cheaper compute, streaming gives fresher dashboards—pick wrong and
costs or SLAs explode.
• Typical interview Q: “How would late-arriving events hurt a daily aggregation?”
• How to practice: build a Spark batch that writes to S3, then re-implement with
Kafka + Flink; compare cost and latency.

→ 2. Columnar Formats (Parquet, ORC)

• column pruning + compression → 90 % fewer bytes scanned.
• Pitfall: writing tiny Parquet files kills performance (“small-file problem”).
• How to practice: run `EXPLAIN ANALYZE` on the same SQL over CSV and Parquet;
watch scan time and I/O stats.

→ 3. Partitioning & Clustering

• scanners skip folders instead of rows.
• Pitfall: too many partitions = millions of tiny tasks.
• How to practice: load a year of logs, add `dt=YYYY-MM-DD` partition, then bucket
by `user_id`; observe query fan-out.

→ 4. Schema Evolution
• product teams add columns weekly; brittle jobs break.
• Interview Q: “How would you backfill a non-nullable column without downtime?”
• How to practice: evolve an Avro schema in Confluent; validate consumer
compatibility modes.

→ 5. Change-Data-Capture (CDC)
• keeps a warehouse fresh without nightly full dumps.
• Pitfall: out-of-order updates if binlog markers aren’t respected.
• How to practice: stream MySQL binlog with Debezium to Kafka, pipe to PostgreSQL,
verify row-level parity.

→ 6. Watermarks & Late Data

• windows must close, but not too early.
• Interview Q: “How do you charge users if events arrive 2 hours late?”
• How to practice: simulate late messages in Flink, set `allowedLateness`, and
inspect state size.

→ 7. Idempotent Writes
• retries shouldn’t double your revenue numbers.
• How to practice: write to a Delta Lake table with `MERGE`/UPSERT semantics,
replay the same Kafka topic twice, verify counts.

→ 8. Data Lineage & Metadata

• “What changed this metric?” must never be a guess.
• How to practice: hook OpenLineage or Marquez into Airflow; track field-level
lineage for one KPI.

→ 9. Orchestration (Airflow, Dagster)

• DAGs give retries, SLA alerts, and one-click backfills.
• Interview Q: “How would you prevent a daily job from firing twice?”
• How to practice: migrate two shell scripts into Airflow with `depends_on_past`
and `max_active_runs=1`.

→ 10. Incremental vs Full Loads

• full reloads waste bandwidth, but increments need ordering guarantees.
• How to practice: switch a Redshift copy from full S3 pull to `manifest`-based
incremental; compare runtimes.

Senior Data Engineer Qs
No ratings yet
Senior Data Engineer Qs
7 pages
Venu Data Engineering Training in Hyderabad 1
No ratings yet
Venu Data Engineering Training in Hyderabad 1
8 pages
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
No ratings yet
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
14 pages
Barclays Data Engineer Interview Questions
No ratings yet
Barclays Data Engineer Interview Questions
17 pages
Azure de and Fabric de Full Edited
No ratings yet
Azure de and Fabric de Full Edited
7 pages
Spark Optimization Techniques
No ratings yet
Spark Optimization Techniques
7 pages
Senior Data Engineer Qna
No ratings yet
Senior Data Engineer Qna
4 pages
Data Engineer Questions
No ratings yet
Data Engineer Questions
10 pages
Advanced SQL Topics in Snowflake
No ratings yet
Advanced SQL Topics in Snowflake
4 pages
Data Engineering For Beginners
No ratings yet
Data Engineering For Beginners
129 pages
Data Engineering Interview Prep
No ratings yet
Data Engineering Interview Prep
6 pages
Data Engineering System Design
No ratings yet
Data Engineering System Design
37 pages
Mastercard Data Engineer Interview Questions
No ratings yet
Mastercard Data Engineer Interview Questions
16 pages
AWS DataEngineering
100% (1)
AWS DataEngineering
23 pages
ETL Question and Answers
No ratings yet
ETL Question and Answers
6 pages
Py Spark
No ratings yet
Py Spark
7 pages
Roadmap For Data Engineering
No ratings yet
Roadmap For Data Engineering
33 pages
From Data To Insights Course Summary
No ratings yet
From Data To Insights Course Summary
67 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
3 pages
Data Engineer Interview Questions With Examples
No ratings yet
Data Engineer Interview Questions With Examples
8 pages
SQL Name Swap Query
No ratings yet
SQL Name Swap Query
6 pages
Important Interview Qa
No ratings yet
Important Interview Qa
13 pages
Data Engineer Preparation
No ratings yet
Data Engineer Preparation
5 pages
TB-Data Engineering - Syllabus-2024
No ratings yet
TB-Data Engineering - Syllabus-2024
4 pages
Infosys Data Engineering Questions and Answers - 2025
No ratings yet
Infosys Data Engineering Questions and Answers - 2025
25 pages
Big Data Introduction
No ratings yet
Big Data Introduction
5 pages
Azure Daywise Track - XLSX - Azure Track
No ratings yet
Azure Daywise Track - XLSX - Azure Track
12 pages
Spark DataFrame Best Practices
No ratings yet
Spark DataFrame Best Practices
10 pages
Prashanth Snowflake Data Engg
No ratings yet
Prashanth Snowflake Data Engg
5 pages
Streaming with Apache Flink
No ratings yet
Streaming with Apache Flink
232 pages
BigQuery SQL Optimization Guide
No ratings yet
BigQuery SQL Optimization Guide
27 pages
IBM PySpark CheatSheet
No ratings yet
IBM PySpark CheatSheet
2 pages
Big Data Training in Chennai - Big Data Course in Chennai
No ratings yet
Big Data Training in Chennai - Big Data Course in Chennai
1 page
Stream Processing - Hands-On With Apache Flink (Giannis Polyzos) (Z-Library)
No ratings yet
Stream Processing - Hands-On With Apache Flink (Giannis Polyzos) (Z-Library)
234 pages
Query Execution for DB Engineers
No ratings yet
Query Execution for DB Engineers
25 pages
Data Engineering Agenda
No ratings yet
Data Engineering Agenda
19 pages
Course Handout - 21CSE372P - Mastering Cloud Data Services and Analytics With AWS, Azure, and GCP - VF-1
No ratings yet
Course Handout - 21CSE372P - Mastering Cloud Data Services and Analytics With AWS, Azure, and GCP - VF-1
18 pages
Aws Qna
No ratings yet
Aws Qna
6 pages
Pyspark STAR Questions
No ratings yet
Pyspark STAR Questions
21 pages
Databricks Certified Data Engineer Associate Exam Guide 25
No ratings yet
Databricks Certified Data Engineer Associate Exam Guide 25
10 pages
Engineeringdataquality 191108212359 PDF
No ratings yet
Engineeringdataquality 191108212359 PDF
50 pages
Databricks Certified Data Engineer Associate Exam Guide
No ratings yet
Databricks Certified Data Engineer Associate Exam Guide
7 pages
Mastering Databricks Data Engineering-AWS-Azure
No ratings yet
Mastering Databricks Data Engineering-AWS-Azure
6 pages
BigData - Recent Interview Q's
No ratings yet
BigData - Recent Interview Q's
25 pages
Spark Optimization Case Study Cleaned
No ratings yet
Spark Optimization Case Study Cleaned
7 pages
Ankit Data Engineer Resume
No ratings yet
Ankit Data Engineer Resume
8 pages
Q2
No ratings yet
Q2
2 pages
Data Engineering Roadmap
No ratings yet
Data Engineering Roadmap
2 pages
Toc D&a Azure Aws
No ratings yet
Toc D&a Azure Aws
12 pages
PDF Data Engineering Interview Questions and Answers
No ratings yet
PDF Data Engineering Interview Questions and Answers
18 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
Data Engineer Resume: Sailaja Reddy
No ratings yet
Data Engineer Resume: Sailaja Reddy
6 pages
ETL Processes Using PySpark
67% (3)
ETL Processes Using PySpark
7 pages
Mucharla Shiva Kumar Goud - Leaddata Engineer
No ratings yet
Mucharla Shiva Kumar Goud - Leaddata Engineer
5 pages
Big Data Hadoop & Spark Course
No ratings yet
Big Data Hadoop & Spark Course
30 pages
Bigdata Resume Ramaiah
No ratings yet
Bigdata Resume Ramaiah
3 pages
Data Engineering Roadmap For Freshers & Resources
No ratings yet
Data Engineering Roadmap For Freshers & Resources
6 pages
Hadoop Cluster Setup on EC2 Guide
No ratings yet
Hadoop Cluster Setup on EC2 Guide
5 pages
CAS CS 460/660 Introduction To Database Systems Functional Dependencies and Normal Forms
No ratings yet
CAS CS 460/660 Introduction To Database Systems Functional Dependencies and Normal Forms
38 pages
Computer Applications Exam Guide
No ratings yet
Computer Applications Exam Guide
6 pages
MuleSoft DataWeave Functions
No ratings yet
MuleSoft DataWeave Functions
4 pages
College Admission Management System-1
No ratings yet
College Admission Management System-1
18 pages
Bancilhon - 1992 - The O2 Object-Oriented Database System
No ratings yet
Bancilhon - 1992 - The O2 Object-Oriented Database System
1 page
Computers - Storage & Memory Devices: Chapter-1
No ratings yet
Computers - Storage & Memory Devices: Chapter-1
30 pages
PLSQL Introduction 1
No ratings yet
PLSQL Introduction 1
110 pages
DB2 UDB For OS390 and ZOS V8 Installation Guide
No ratings yet
DB2 UDB For OS390 and ZOS V8 Installation Guide
630 pages
PDF 1Z0 071
No ratings yet
PDF 1Z0 071
183 pages
Apache Ranger Auditing
No ratings yet
Apache Ranger Auditing
17 pages
Pallavi Resume
No ratings yet
Pallavi Resume
4 pages
AWS Cloud Services FAQ
No ratings yet
AWS Cloud Services FAQ
6 pages
SQL Server 2008 Replication Technical Case Study
No ratings yet
SQL Server 2008 Replication Technical Case Study
44 pages
KSRTC Ticket Booking Test Plan
No ratings yet
KSRTC Ticket Booking Test Plan
2 pages
Power BI Developer Mode
No ratings yet
Power BI Developer Mode
35 pages
Entity Relationship Diagram
No ratings yet
Entity Relationship Diagram
1 page
Big Data Modeling and Management Systems Final
No ratings yet
Big Data Modeling and Management Systems Final
105 pages
Multiple AWR Reports
No ratings yet
Multiple AWR Reports
2 pages
Configuration Management ServiceNow Updated
No ratings yet
Configuration Management ServiceNow Updated
20 pages
Main PDF 6 - MySQL Database
No ratings yet
Main PDF 6 - MySQL Database
48 pages
Coursera LAWRREQLJ9ED
No ratings yet
Coursera LAWRREQLJ9ED
1 page
How To Automate SSIS and SQL Agent Job Deployments
No ratings yet
How To Automate SSIS and SQL Agent Job Deployments
34 pages
Pay Bill
No ratings yet
Pay Bill
40 pages
Forver
No ratings yet
Forver
4 pages
Addis Ababa Ethiopia Introduction To Emerging Technologies (EMTE 1012) Work Sheet
82% (11)
Addis Ababa Ethiopia Introduction To Emerging Technologies (EMTE 1012) Work Sheet
4 pages
Computer Studies Paper II Swed 2025
No ratings yet
Computer Studies Paper II Swed 2025
4 pages
Computer Science Paper 2 HL Markscheme
No ratings yet
Computer Science Paper 2 HL Markscheme
26 pages
Excel BI Techniques Course
No ratings yet
Excel BI Techniques Course
6 pages
12 3rd Term Scheme - 7c6cf0fd Cfa4 4a83 A271 Ffdc39b69f95
No ratings yet
12 3rd Term Scheme - 7c6cf0fd Cfa4 4a83 A271 Ffdc39b69f95
6 pages

SQL Important Revision

Uploaded by

SQL Important Revision

Uploaded by

CREATE TABLE Table1 (

-- Insert the data

select * from Table1;

CREATE TABLE Table2 (

-- Insert the data

select * from Table2;

→ 2. Columnar Formats (Parquet, ORC)

→ 3. Partitioning & Clustering

→ 6. Watermarks & Late Data

→ 8. Data Lineage & Metadata

→ 9. Orchestration (Airflow, Dagster)

→ 10. Incremental vs Full Loads

You might also like