0% found this document useful (0 votes)
41 views3 pages

SQL Important Revision

The document outlines SQL operations involving two tables, Table1 and Table2, including various types of joins. It also discusses best practices and pitfalls in data processing, such as batch vs streaming, columnar formats, partitioning, schema evolution, change-data-capture, and data lineage. Additionally, it provides practical exercises to enhance understanding of these concepts in data engineering.

Uploaded by

Shashwat Dev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views3 pages

SQL Important Revision

The document outlines SQL operations involving two tables, Table1 and Table2, including various types of joins. It also discusses best practices and pitfalls in data processing, such as batch vs streaming, columnar formats, partitioning, schema evolution, change-data-capture, and data lineage. Additionally, it provides practical exercises to enhance understanding of these concepts in data engineering.

Uploaded by

Shashwat Dev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

CREATE TABLE Table1 (

A INT
);

-- Insert the data


INSERT INTO Table1 (A) VALUES
(1),
(1),
(1),
(1),
(2),
(2),
(NULL);

select * from Table1;

CREATE TABLE Table2 (


B INT
);

-- Insert the data


INSERT INTO Table2 (B) VALUES
(1),
(1),
(2),
(2),
(3),
(4),
(NULL),
(NULL);

select * from Table2;

SELECT A.*,B.*
from Table1 as a
left join Table2 as b
on a.A=b.B;

SELECT A.*,B.*
from Table2 as a
left join Table1 as b
on a.B=b.A;

SELECT A.*,B.*
from Table1 as a
FULL OUTER join Table2 as b
on a.A=b.B;

SELECT A.*,B.*
from Table1 as a
join Table2 as b
on a.A=b.B;

→ 1. Batch vs Streaming
• Batch gives cheaper compute, streaming gives fresher dashboards—pick wrong and
costs or SLAs explode.
• Typical interview Q: “How would late-arriving events hurt a daily aggregation?”
• How to practice: build a Spark batch that writes to S3, then re-implement with
Kafka + Flink; compare cost and latency.

→ 2. Columnar Formats (Parquet, ORC)


• column pruning + compression → 90 % fewer bytes scanned.
• Pitfall: writing tiny Parquet files kills performance (“small-file problem”).
• How to practice: run `EXPLAIN ANALYZE` on the same SQL over CSV and Parquet;
watch scan time and I/O stats.

→ 3. Partitioning & Clustering


• scanners skip folders instead of rows.
• Pitfall: too many partitions = millions of tiny tasks.
• How to practice: load a year of logs, add `dt=YYYY-MM-DD` partition, then bucket
by `user_id`; observe query fan-out.

→ 4. Schema Evolution
• product teams add columns weekly; brittle jobs break.
• Interview Q: “How would you backfill a non-nullable column without downtime?”
• How to practice: evolve an Avro schema in Confluent; validate consumer
compatibility modes.

→ 5. Change-Data-Capture (CDC)
• keeps a warehouse fresh without nightly full dumps.
• Pitfall: out-of-order updates if binlog markers aren’t respected.
• How to practice: stream MySQL binlog with Debezium to Kafka, pipe to PostgreSQL,
verify row-level parity.

→ 6. Watermarks & Late Data


• windows must close, but not too early.
• Interview Q: “How do you charge users if events arrive 2 hours late?”
• How to practice: simulate late messages in Flink, set `allowedLateness`, and
inspect state size.

→ 7. Idempotent Writes
• retries shouldn’t double your revenue numbers.
• How to practice: write to a Delta Lake table with `MERGE`/UPSERT semantics,
replay the same Kafka topic twice, verify counts.

→ 8. Data Lineage & Metadata


• “What changed this metric?” must never be a guess.
• How to practice: hook OpenLineage or Marquez into Airflow; track field-level
lineage for one KPI.

→ 9. Orchestration (Airflow, Dagster)


• DAGs give retries, SLA alerts, and one-click backfills.
• Interview Q: “How would you prevent a daily job from firing twice?”
• How to practice: migrate two shell scripts into Airflow with `depends_on_past`
and `max_active_runs=1`.

→ 10. Incremental vs Full Loads


• full reloads waste bandwidth, but increments need ordering guarantees.
• How to practice: switch a Redshift copy from full S3 pull to `manifest`-based
incremental; compare runtimes.

You might also like