fix(web): mechanical repair of GFM tables with off-by-one separator rows#902
fix(web): mechanical repair of GFM tables with off-by-one separator rows#902heavygee wants to merge 5 commits into
Conversation
…rows Adds a remark plugin (remarkRepairTables) that runs after remark-gfm and silently fixes the dominant broken-table pattern seen in agent output: the separator row has fewer pipe-delimited cells than the header row. remark-gfm follows the GFM spec and silently truncates the table to the separator column count, dropping header and data cells. This plugin reads the original source via file.value position data, detects the mismatch, pads the separator row, and re-parses the corrected block so all columns are preserved. Analysis of 7 days of session data: 975 apparent table blocks, 879 flagged broken. Of those, 744 (84.6%) were false positives (inline pipes in prose and shell commands). The separator off-by-one pattern accounted for the majority of genuine failures (~94 of 135 real broken tables). The plugin is wired into MARKDOWN_PLUGINS and MARKDOWN_PLUGINS_WITH_BREAKS, immediately after remarkGfm where position data is available. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
- Remove unused parseTableCols helper - Assert alignment markers (:-- / --:) are preserved in repaired separator - Add header-only table test (header + broken separator, no data rows) via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
There was a problem hiding this comment.
Findings
- [Major] Raw pipe splitting rewrites valid escaped-pipe tables —
countSourceCells()treats every|byte as a column boundary, but GFM allows escaped pipes inside a cell. A valid two-column table like| A\|B | C |/|---|---|is detected as a three-column header, so the plugin pads the delimiter and re-parses it as a different table shape. Evidence:web/src/lib/remark-repair-tables.ts:36.
Summary
- Review mode: initial
- Found one correctness regression in the new table repair path.
Testing
- Not run (automation):
bunis not installed in this runner environment.
| // Strip optional surrounding pipes before splitting | ||
| const inner = (trimmed.startsWith('|') ? trimmed.slice(1) : trimmed) | ||
| const stripped = inner.endsWith('|') ? inner.slice(0, -1) : inner | ||
| return stripped.split('|').length |
There was a problem hiding this comment.
[MAJOR] countSourceCells() treats every raw | as a delimiter, but valid GFM table cells can contain escaped pipes. For example, | A\|B | C | with a two-cell separator is already a valid two-column table; this helper counts three header cells, pads the separator, and changes the rendered table shape. Please split only on unescaped pipe delimiters (and add a regression test for escaped pipes in headers/cells).
Suggested fix:
function splitTableRowCells(line: string): string[] {
const trimmed = line.trim()
const body = trimmed.startsWith('|') ? trimmed.slice(1) : trimmed
const source = body.endsWith('|') ? body.slice(0, -1) : body
const cells: string[] = []
let current = ''
let escaped = false
for (const ch of source) {
if (escaped) {
current += ch
escaped = false
continue
}
if (ch === '\\') {
current += ch
escaped = true
continue
}
if (ch === '|') {
cells.push(current)
current = ''
continue
}
current += ch
}
cells.push(current)
return cells
}
function countSourceCells(line: string): number {
return splitTableRowCells(line).length
}…tion - Drop repairCount from visitTables — increment was never read at call site - Add per-row cell count assertion to the 3-column repair test to catch structural regressions that content-presence checks would miss via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
… tests Off-by-N (4-column), alignment-hints, and header-only tests now verify each output row has the correct number of cells, not just content presence. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
There was a problem hiding this comment.
Findings
- [Major] Escaped header pipes trigger false repairs —
countSourceCellssplits every|, so a valid two-column GFM table like| A \| B | C |with|---|---|is treated as three source columns. The plugin then pads the separator and re-parses, changing an already-valid table by adding a bogus empty column. Evidenceweb/src/lib/remark-repair-tables.ts:36
Suggested fix:function countSourceCells(line: string): number { const trimmed = line.trim() const inner = trimmed.startsWith('|') ? trimmed.slice(1) : trimmed const stripped = inner.endsWith('|') ? inner.slice(0, -1) : inner let cells = 1 let escaped = false for (const ch of stripped) { if (escaped) { escaped = false continue } if (ch === '\\') { escaped = true continue } if (ch === '|') cells++ } return cells }
Summary
- Review mode: initial
- One issue found: valid tables with escaped literal pipes can be misclassified and rewritten. Residual risk remains around broader GFM table tokenization edge cases unless covered by tests.
Testing
- Not run (automation; Bun unavailable in this runner PATH)
HAPI Bot
| // Strip optional surrounding pipes before splitting | ||
| const inner = (trimmed.startsWith('|') ? trimmed.slice(1) : trimmed) | ||
| const stripped = inner.endsWith('|') ? inner.slice(0, -1) : inner | ||
| return stripped.split('|').length |
There was a problem hiding this comment.
[Major] Escaped literal pipes in a header cell are valid GFM (\|), but this helper counts every pipe byte. That makes a valid two-column table such as | A \| B | C | / |---|---| look like three source columns, so the repair path pads the delimiter row and re-parses it as a different table with an extra empty column.
Suggested fix:
function countSourceCells(line: string): number {
const trimmed = line.trim()
const inner = trimmed.startsWith('|') ? trimmed.slice(1) : trimmed
const stripped = inner.endsWith('|') ? inner.slice(0, -1) : inner
let cells = 1
let escaped = false
for (const ch of stripped) {
if (escaped) {
escaped = false
continue
}
if (ch === '\\') {
escaped = true
continue
}
if (ch === '|') cells++
}
return cells
}…airs
\| inside a GFM table cell is a literal pipe character, not a cell
delimiter. The previous split('|') approach miscounted cells in headers
like | A \| B | C |, treating a valid 2-column table as 3-column and
padding the separator unnecessarily.
Replaces the split with a character-scan that tracks escape state.
Adds a test asserting the separator column count stays at 2 for tables
with escaped pipes in the header.
via [HAPI](https://hapi.run)
Co-Authored-By: HAPI <noreply@hapi.run>
There was a problem hiding this comment.
Findings
- [Major] Repaired tables lose parser-extension nodes —
parseTableBlock()builds a fresh parser with onlyremarkParse+remarkGfm, while the app markdown pipeline also has parser extensions such asremarkMath. When a broken table contains inline math, the original parse would produce math nodes, but the repaired replacement table is parsed as plain text, sorehypeKatexno longer renders it. Evidenceweb/src/lib/remark-repair-tables.ts:74
Suggested fix:import type { Processor } from "unified" function parseTableBlock(processor: Processor, source: string): Table | null { const tree = processor.parse(source) as Root return tree.children.find((node): node is Table => node.type === "table") ?? null } export default function remarkRepairTables(this: Processor) { const processor = this return (tree: Root, file: VFile) => { const source = String(file.value) visitTables(tree as unknown as Parameters<typeof visitTables>[0], null, 0, source, processor) } }
Questions
- None.
Summary
- Review mode: follow-up after new commits
- The previous escaped-pipe finding was addressed. One remaining issue: the repair path reparses with a reduced markdown parser, which can regress rendering for tables that also use enabled syntax extensions. Residual risk is around parser-extension coverage for repaired table blocks.
Testing
- Not run (automation;
bunis unavailable in this runner PATH)
HAPI Bot
|
|
||
| /** Re-parse a repaired raw table string and extract the first table node. */ | ||
| function parseTableBlock(source: string): Table | null { | ||
| const tree = unified().use(remarkParse).use(remarkGfm).parse(source) as Root |
There was a problem hiding this comment.
[Major] parseTableBlock() reparses the repaired table with a fresh processor that only installs remarkParse and remarkGfm. The app-level markdown parser also has syntax extensions registered by later plugins, notably remarkMath; because this replacement node is parsed without those extensions, a broken table that contains $x$/$$...$$ cells is repaired but those cells become plain text instead of math nodes, so rehypeKatex will not render them.
Suggested fix:
import type { Processor } from "unified"
function parseTableBlock(processor: Processor, source: string): Table | null {
const tree = processor.parse(source) as Root
return tree.children.find((node): node is Table => node.type === "table") ?? null
}
export default function remarkRepairTables(this: Processor) {
const processor = this
return (tree: Root, file: VFile) => {
const source = String(file.value)
visitTables(tree as unknown as Parameters<typeof visitTables>[0], null, 0, source, processor)
}
}
Summary
Adds a remark plugin (`remarkRepairTables`) that silently fixes the dominant broken-table pattern in agent output: the GFM separator row has fewer pipe-delimited cells than the header row.
`remark-gfm` follows the GFM spec and truncates the table to the separator column count, silently dropping header and data cells. This plugin reads the original source via `file.value` + `node.position` offsets, detects the column mismatch, pads the separator with `---` cells, and re-parses just that table block. Only broken tables are touched; valid tables, prose with pipes, and code spans are unaffected.
The plugin runs immediately after `remarkGfm` in both `MARKDOWN_PLUGINS` and `MARKDOWN_PLUGINS_WITH_BREAKS`, where `node.position` source offsets are still accurate.
Analysis of real session data: 975 apparent table blocks over 7 days; 879 flagged for separator mismatch; 744 (84.6%) were false positives (inline pipes in prose/shell). The separator off-by-one pattern accounted for ~94 of 135 confirmed broken tables.
Test plan
Disclosure
This PR was developed with assistance from Claude Sonnet 4.6 (claude-sonnet-4-6) acting as a peer agent in a multi-agent workflow.