Skip to content

fix(web): mechanical repair of GFM tables with off-by-one separator rows#902

Open
heavygee wants to merge 5 commits into
tiann:mainfrom
heavygee:feat/gfm-table-repair
Open

fix(web): mechanical repair of GFM tables with off-by-one separator rows#902
heavygee wants to merge 5 commits into
tiann:mainfrom
heavygee:feat/gfm-table-repair

Conversation

@heavygee

@heavygee heavygee commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a remark plugin (`remarkRepairTables`) that silently fixes the dominant broken-table pattern in agent output: the GFM separator row has fewer pipe-delimited cells than the header row.

`remark-gfm` follows the GFM spec and truncates the table to the separator column count, silently dropping header and data cells. This plugin reads the original source via `file.value` + `node.position` offsets, detects the column mismatch, pads the separator with `---` cells, and re-parses just that table block. Only broken tables are touched; valid tables, prose with pipes, and code spans are unaffected.

The plugin runs immediately after `remarkGfm` in both `MARKDOWN_PLUGINS` and `MARKDOWN_PLUGINS_WITH_BREAKS`, where `node.position` source offsets are still accurate.

Analysis of real session data: 975 apparent table blocks over 7 days; 879 flagged for separator mismatch; 744 (84.6%) were false positives (inline pipes in prose/shell). The separator off-by-one pattern accounted for ~94 of 135 confirmed broken tables.

Test plan

  • `bun typecheck` — clean (cli / hub / web)
  • `bun run test:web` — 10 new table-repair tests pass (valid pass-through, off-by-one, off-by-N, no surrounding pipes, alignment hints preserved, header-only table, multiple mixed tables, prose/code-span false-positive immunity), no regressions in 974 total tests
  • Structural column-count assertions on all repair test cases (not just content presence)

Disclosure

This PR was developed with assistance from Claude Sonnet 4.6 (claude-sonnet-4-6) acting as a peer agent in a multi-agent workflow.

heavygee and others added 2 commits June 14, 2026 02:05
…rows

Adds a remark plugin (remarkRepairTables) that runs after remark-gfm and
silently fixes the dominant broken-table pattern seen in agent output:
the separator row has fewer pipe-delimited cells than the header row.

remark-gfm follows the GFM spec and silently truncates the table to the
separator column count, dropping header and data cells. This plugin reads
the original source via file.value position data, detects the mismatch,
pads the separator row, and re-parses the corrected block so all columns
are preserved.

Analysis of 7 days of session data: 975 apparent table blocks, 879 flagged
broken. Of those, 744 (84.6%) were false positives (inline pipes in prose
and shell commands). The separator off-by-one pattern accounted for the
majority of genuine failures (~94 of 135 real broken tables).

The plugin is wired into MARKDOWN_PLUGINS and MARKDOWN_PLUGINS_WITH_BREAKS,
immediately after remarkGfm where position data is available.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
- Remove unused parseTableCols helper
- Assert alignment markers (:-- / --:) are preserved in repaired separator
- Add header-only table test (header + broken separator, no data rows)

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Raw pipe splitting rewrites valid escaped-pipe tables — countSourceCells() treats every | byte as a column boundary, but GFM allows escaped pipes inside a cell. A valid two-column table like | A\|B | C | / |---|---| is detected as a three-column header, so the plugin pads the delimiter and re-parses it as a different table shape. Evidence: web/src/lib/remark-repair-tables.ts:36.

Summary

  • Review mode: initial
  • Found one correctness regression in the new table repair path.

Testing

  • Not run (automation): bun is not installed in this runner environment.

Comment thread web/src/lib/remark-repair-tables.ts Outdated
// Strip optional surrounding pipes before splitting
const inner = (trimmed.startsWith('|') ? trimmed.slice(1) : trimmed)
const stripped = inner.endsWith('|') ? inner.slice(0, -1) : inner
return stripped.split('|').length

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] countSourceCells() treats every raw | as a delimiter, but valid GFM table cells can contain escaped pipes. For example, | A\|B | C | with a two-cell separator is already a valid two-column table; this helper counts three header cells, pads the separator, and changes the rendered table shape. Please split only on unescaped pipe delimiters (and add a regression test for escaped pipes in headers/cells).

Suggested fix:

function splitTableRowCells(line: string): string[] {
    const trimmed = line.trim()
    const body = trimmed.startsWith('|') ? trimmed.slice(1) : trimmed
    const source = body.endsWith('|') ? body.slice(0, -1) : body
    const cells: string[] = []
    let current = ''
    let escaped = false

    for (const ch of source) {
        if (escaped) {
            current += ch
            escaped = false
            continue
        }
        if (ch === '\\') {
            current += ch
            escaped = true
            continue
        }
        if (ch === '|') {
            cells.push(current)
            current = ''
            continue
        }
        current += ch
    }

    cells.push(current)
    return cells
}

function countSourceCells(line: string): number {
    return splitTableRowCells(line).length
}

heavygee and others added 2 commits June 14, 2026 02:13
…tion

- Drop repairCount from visitTables — increment was never read at call site
- Add per-row cell count assertion to the 3-column repair test to catch
  structural regressions that content-presence checks would miss

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
… tests

Off-by-N (4-column), alignment-hints, and header-only tests now verify
each output row has the correct number of cells, not just content presence.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Escaped header pipes trigger false repairs — countSourceCells splits every |, so a valid two-column GFM table like | A \| B | C | with |---|---| is treated as three source columns. The plugin then pads the separator and re-parses, changing an already-valid table by adding a bogus empty column. Evidence web/src/lib/remark-repair-tables.ts:36
    Suggested fix:
    function countSourceCells(line: string): number {
        const trimmed = line.trim()
        const inner = trimmed.startsWith('|') ? trimmed.slice(1) : trimmed
        const stripped = inner.endsWith('|') ? inner.slice(0, -1) : inner
    
        let cells = 1
        let escaped = false
        for (const ch of stripped) {
            if (escaped) {
                escaped = false
                continue
            }
            if (ch === '\\') {
                escaped = true
                continue
            }
            if (ch === '|') cells++
        }
        return cells
    }

Summary

  • Review mode: initial
  • One issue found: valid tables with escaped literal pipes can be misclassified and rewritten. Residual risk remains around broader GFM table tokenization edge cases unless covered by tests.

Testing

  • Not run (automation; Bun unavailable in this runner PATH)

HAPI Bot

Comment thread web/src/lib/remark-repair-tables.ts Outdated
// Strip optional surrounding pipes before splitting
const inner = (trimmed.startsWith('|') ? trimmed.slice(1) : trimmed)
const stripped = inner.endsWith('|') ? inner.slice(0, -1) : inner
return stripped.split('|').length

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Escaped literal pipes in a header cell are valid GFM (\|), but this helper counts every pipe byte. That makes a valid two-column table such as | A \| B | C | / |---|---| look like three source columns, so the repair path pads the delimiter row and re-parses it as a different table with an extra empty column.

Suggested fix:

function countSourceCells(line: string): number {
    const trimmed = line.trim()
    const inner = trimmed.startsWith('|') ? trimmed.slice(1) : trimmed
    const stripped = inner.endsWith('|') ? inner.slice(0, -1) : inner

    let cells = 1
    let escaped = false
    for (const ch of stripped) {
        if (escaped) {
            escaped = false
            continue
        }
        if (ch === '\\') {
            escaped = true
            continue
        }
        if (ch === '|') cells++
    }
    return cells
}

…airs

\| inside a GFM table cell is a literal pipe character, not a cell
delimiter. The previous split('|') approach miscounted cells in headers
like | A \| B | C |, treating a valid 2-column table as 3-column and
padding the separator unnecessarily.

Replaces the split with a character-scan that tracks escape state.
Adds a test asserting the separator column count stays at 2 for tables
with escaped pipes in the header.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Repaired tables lose parser-extension nodes — parseTableBlock() builds a fresh parser with only remarkParse + remarkGfm, while the app markdown pipeline also has parser extensions such as remarkMath. When a broken table contains inline math, the original parse would produce math nodes, but the repaired replacement table is parsed as plain text, so rehypeKatex no longer renders it. Evidence web/src/lib/remark-repair-tables.ts:74
    Suggested fix:
    import type { Processor } from "unified"
    
    function parseTableBlock(processor: Processor, source: string): Table | null {
        const tree = processor.parse(source) as Root
        return tree.children.find((node): node is Table => node.type === "table") ?? null
    }
    
    export default function remarkRepairTables(this: Processor) {
        const processor = this
        return (tree: Root, file: VFile) => {
            const source = String(file.value)
            visitTables(tree as unknown as Parameters<typeof visitTables>[0], null, 0, source, processor)
        }
    }

Questions

  • None.

Summary

  • Review mode: follow-up after new commits
  • The previous escaped-pipe finding was addressed. One remaining issue: the repair path reparses with a reduced markdown parser, which can regress rendering for tables that also use enabled syntax extensions. Residual risk is around parser-extension coverage for repaired table blocks.

Testing

  • Not run (automation; bun is unavailable in this runner PATH)

HAPI Bot


/** Re-parse a repaired raw table string and extract the first table node. */
function parseTableBlock(source: string): Table | null {
const tree = unified().use(remarkParse).use(remarkGfm).parse(source) as Root

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] parseTableBlock() reparses the repaired table with a fresh processor that only installs remarkParse and remarkGfm. The app-level markdown parser also has syntax extensions registered by later plugins, notably remarkMath; because this replacement node is parsed without those extensions, a broken table that contains $x$/$$...$$ cells is repaired but those cells become plain text instead of math nodes, so rehypeKatex will not render them.

Suggested fix:

import type { Processor } from "unified"

function parseTableBlock(processor: Processor, source: string): Table | null {
    const tree = processor.parse(source) as Root
    return tree.children.find((node): node is Table => node.type === "table") ?? null
}

export default function remarkRepairTables(this: Processor) {
    const processor = this
    return (tree: Root, file: VFile) => {
        const source = String(file.value)
        visitTables(tree as unknown as Parameters<typeof visitTables>[0], null, 0, source, processor)
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant