Skip to content

Human-readable optimizations#12

Open
peterboncz wants to merge 2 commits into
mainfrom
pb/human-readable-optimizations
Open

Human-readable optimizations#12
peterboncz wants to merge 2 commits into
mainfrom
pb/human-readable-optimizations

Conversation

@peterboncz

@peterboncz peterboncz commented Jun 15, 2026

Copy link
Copy Markdown

Make LPTS-generated SQL compact and human-readable

Motivation

LPTS emitted one CTE per logical-plan operator with opaque, fully-prefixed column names, e.g.:

WITH scan_0 (t0_a, t0_d, t0_b) AS (SELECT a, d, b FROM memory.main.t WHERE a>0), aggregate_1 (t2_d, t3_aggregate_0) AS (SELECT t0_d, sum(t0_a) FROM scan_0 GROUP BY t0_d), order_2 (...) AS (...), limit_3 (...) AS
(...) SELECT t2_d AS d, t3_aggregate_0 AS aggregate_0 FROM limit_3;

This branch reworks serialization so the output becomes (hopefully) better readable while staying exactly round-trip-equivalent, and while keeping execution order legible (the plan is cut only at joins, set ops, and subqueries):

WITH
t0_scan (a, d) AS ( 
    SELECT a, d
    FROM   memory.main.t
    WHERE  (a > 0)   
) 
SELECT   d, sum(a) AS sum_a 
FROM     t0_scan 
GROUP BY d
HAVING   ((sum(a)) > (1))
ORDER BY sum(a) ASC NULLS LAST
LIMIT    5;    

What changed

  1. Pipeline fusion (lpts_merge_pipeline, default ON) — a maximal chain of single-child pipeline operators (Limit → OrderBy → Filter → Project → Aggregate → Project → Filter, plus a pushdown-free base-table scan) is
    folded into one flat SELECT .. FROM .. WHERE .. GROUP BY .. HAVING .. ORDER BY .. LIMIT .. via whole-identifier expression substitution, instead of one CTE each. Cuts the plan at joins/set-ops/subqueries; scans with a pushed-down predicate stay their own CTE. New setting;
    legacy one-CTE-per-operator output is still available with SET lpts_merge_pipeline=false.
  2. Semantic column identifiers — scans use the physical name; pure pass-throughs reuse the source name; user aliases win; aggregates are named <func>_<col> (count_star, count_distinct_b, sum_a); only genuine
    computations fall back to scalar_N. At serialization, a tX_ name collapses to its bare form when that bare name is globally unique and a safe unquoted identifier, so prefixes remain only where needed to
    disambiguate (e.g. across joins/self-joins/keywords).
  3. Final-SELECT inlining — the redundant closing SELECT … FROM <last_cte> is removed; the last CTE's body becomes the result directly (re-aliased to the user-facing names), so trivial queries emit no WITH at all.
  4. Pretty-printing — each clause on its own line, keyword left-aligned at the block indent with the expression area at +8 (or +10 with GROUP BY/ORDER BY); comma/AND lists wrap at width 100, continuing at the expression column and never splitting an expression; WITH on its own line, CTE definitions at column 0, bodies indented 4. Built on a structured SelectParts that backs both the single-line and pretty renderers. CTE names are t<index>_<operator>.
  5. Join build-side marker — the right (inner) side of a join, which is always materialized in its own CTE, is suffixed _materialized_for_join (DuckDB output only). This is done to help DuckDB users understand which CTEs will be materialized in memory.

Testing

  • Full unit suite green (added merge_pipeline, identifiers, pretty_print test files and a join-suffix check; updated dialect tests for the new whitespace and added a Postgres "no suffix" assertion).
  • Every change verified against the SQLStorm TPC-H round-trip benchmark (--tpch_sf 0.001) with incorrect=0 throughout; success rose monotonically: 15724 → 15727 → 15729 → 15757 → 15760 / 17036 (92.51%).

unambiguous about execution order, and itry getting  human-readable column names.

how?

combine multiple operators with obvious execution order in one "block"
for Limit-OrderBy-Project-Aggr-Project-Filter-Scan
- we only include Scan if it does not have predicate pushdown.
- we accep[t any subsequence of this of size > 1 (simplify)

rename the CTE blocks tX_operator such that tX is very visible
(column names may have tx_ prefix and that is the CTE that introdduces it)

when serializing, swallow "AS x" in a SELECT x AS x

use tx_something as the standard column expression name, but try to keep it semantic
- Scan injects the column name (t1_colname)
- operators that pass through keep passing through
- if the user introduced an alias with "AS alias", use it: t3_alias
- for aggregations, use t4_sum_name (* becomes star, distinct itself)

finally, try to remove tx_ from tx_name by a check at the end that
name is a unique name in the query.
Refactor serialization around a structured SelectParts (BuildSelectParts)
shared by the single-line and pretty renderers, and make the default output
pretty-printed:

- Pretty-printing: each clause (SELECT / FROM / WHERE / GROUP BY / HAVING /
ORDER BY / LIMIT) on its own line, keyword left-aligned at the block indent
with the expression area at +8 (or +10 when GROUP BY/ORDER BY is present so
the 8-char keywords fit). Comma/AND-separated lists wrap at width 100,
continuing at the expression column and never breaking inside an expression.
WITH sits on its own line, CTE definitions start at column 0, and CTE bodies
are indented 4. CTE bodies are named by the header (no redundant body AS).

- HAVING fusion: a filter directly above an aggregate now folds into the
merged block's HAVING clause (admitted only when a GROUP BY/aggregate is
present) instead of becoming a separate filter CTE.

- Joins: the right (inner) side, which is always materialized in its own CTE,
is suffixed with _materialized_for_join. DuckDB output only; other dialects
keep the plain name.

- Tests: add pretty_print/identifiers/merge_pipeline suites and a join-suffix
check; update the dialect tests' LIKE patterns for the new whitespace and
add a Postgres "no suffix" assertion.
@peterboncz peterboncz requested a review from ila June 17, 2026 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant