CREATE TABLE Table1 (
A INT
);
-- Insert the data
INSERT INTO Table1 (A) VALUES
(1),
(1),
(1),
(1),
(2),
(2),
(NULL);
select * from Table1;
CREATE TABLE Table2 (
B INT
);
-- Insert the data
INSERT INTO Table2 (B) VALUES
(1),
(1),
(2),
(2),
(3),
(4),
(NULL),
(NULL);
select * from Table2;
SELECT A.*,B.*
from Table1 as a
left join Table2 as b
on a.A=b.B;
SELECT A.*,B.*
from Table2 as a
left join Table1 as b
on a.B=b.A;
SELECT A.*,B.*
from Table1 as a
FULL OUTER join Table2 as b
on a.A=b.B;
SELECT A.*,B.*
from Table1 as a
join Table2 as b
on a.A=b.B;
→ 1. Batch vs Streaming
• Batch gives cheaper compute, streaming gives fresher dashboards—pick wrong and
costs or SLAs explode.
• Typical interview Q: “How would late-arriving events hurt a daily aggregation?”
• How to practice: build a Spark batch that writes to S3, then re-implement with
Kafka + Flink; compare cost and latency.
→ 2. Columnar Formats (Parquet, ORC)
• column pruning + compression → 90 % fewer bytes scanned.
• Pitfall: writing tiny Parquet files kills performance (“small-file problem”).
• How to practice: run `EXPLAIN ANALYZE` on the same SQL over CSV and Parquet;
watch scan time and I/O stats.
→ 3. Partitioning & Clustering
• scanners skip folders instead of rows.
• Pitfall: too many partitions = millions of tiny tasks.
• How to practice: load a year of logs, add `dt=YYYY-MM-DD` partition, then bucket
by `user_id`; observe query fan-out.
→ 4. Schema Evolution
• product teams add columns weekly; brittle jobs break.
• Interview Q: “How would you backfill a non-nullable column without downtime?”
• How to practice: evolve an Avro schema in Confluent; validate consumer
compatibility modes.
→ 5. Change-Data-Capture (CDC)
• keeps a warehouse fresh without nightly full dumps.
• Pitfall: out-of-order updates if binlog markers aren’t respected.
• How to practice: stream MySQL binlog with Debezium to Kafka, pipe to PostgreSQL,
verify row-level parity.
→ 6. Watermarks & Late Data
• windows must close, but not too early.
• Interview Q: “How do you charge users if events arrive 2 hours late?”
• How to practice: simulate late messages in Flink, set `allowedLateness`, and
inspect state size.
→ 7. Idempotent Writes
• retries shouldn’t double your revenue numbers.
• How to practice: write to a Delta Lake table with `MERGE`/UPSERT semantics,
replay the same Kafka topic twice, verify counts.
→ 8. Data Lineage & Metadata
• “What changed this metric?” must never be a guess.
• How to practice: hook OpenLineage or Marquez into Airflow; track field-level
lineage for one KPI.
→ 9. Orchestration (Airflow, Dagster)
• DAGs give retries, SLA alerts, and one-click backfills.
• Interview Q: “How would you prevent a daily job from firing twice?”
• How to practice: migrate two shell scripts into Airflow with `depends_on_past`
and `max_active_runs=1`.
→ 10. Incremental vs Full Loads
• full reloads waste bandwidth, but increments need ordering guarantees.
• How to practice: switch a Redshift copy from full S3 pull to `manifest`-based
incremental; compare runtimes.