Skip to content

alganet/pars

Repository files navigation

pars

Pure POSIX shell parser generator. Reads Extended BNF grammars and generates table-driven parsers as shell scripts with zero external dependencies.

Features

  • Pure shell -- PATH= is cleared; no external commands used at all
  • 13 shells tested -- ash, bash, dash, ksh93, loksh, mksh, oksh, yash, zsh (multiple versions each; see pshells.txt)
  • Zero dependencies -- only shell builtins and command -p fallbacks
  • Table-driven LL(1) state machine with character-level dispatch
  • Precedence climbing for binary, unary, postfix, and ternary operators
  • AST output as V<n>/X<n> shell variable assignments
  • Built-in emitter reconstructs source from AST (round-trip support)
  • Error reporting with line:column and visual caret

Quick Start

1. Write a grammar

Here is bnf/sexp.bnf, the simplest included grammar (S-expressions):

#!name sexp
#!prefix Sx
#!whitespace skip

value = list | str | number | symbol ;
list = '(' { value } ')' ;
str = '"' strcontent '"' ;
strcontent = [^"]* ;
number = [0-9.]+ ;
symbol = [a-zA-Z_+\-*!?]+ ;

2. Generate a parser

All commands should be run from the project root directory:

sh entrypoint.sh bnf_parser < bnf/sexp.bnf | sh entrypoint.sh bnf_gen generated/lang/sexp

The pipeline is: bnf_parser parses the grammar into AST, then bnf_gen generates the parser module (parser.sh, unast.sh, reast.sh) into the target directory.

3. Parse input

$ echo '(+ 1 2)' | sh entrypoint.sh lang_sexp_parser
X0='Sa 2'
X2='Sc 3 4 5'
V3='+'
X3='Sg'
V4='1'
X4='Sf'
V5='2'
X5='Sf'

Each X<n> line defines a node: X<n>='<type> [<child> ...]'. Each V<n> line holds the text value of a node.

Reading the output above: node 0 (document, type Sa) has child 2. Node 2 (list, type Sc) has children 3, 4, 5 -- the symbol +, number 1, and number 2.

4. Round-trip (parse then reconstruct)

$ echo '(+ 1 2)' | sh entrypoint.sh lang_sexp_parser | sh entrypoint.sh lang_sexp_unast
(+ 1 2)

5. Regenerate all parsers from grammars

sh bnf/regen.sh

This regenerates every generated/lang/<name>/ parser from its corresponding bnf/<name>.bnf.

Grammar Reference

Rule Syntax

Rules have the form:

name = body ;

Names are [a-zA-Z_][a-zA-Z0-9_]*. Each rule defines a parsing rule that becomes a state in the generated parser.

Comments

Lines starting with # (that are not directives) are comments and ignored. Comments work anywhere in the grammar -- between rules, after rule terminators, or mid-rule after a token:

# This is a comment
value = 'true'   # inline comment
       | 'false' ;

Elements

Syntax Name Meaning
'x' or "kw" Terminal Match literal character(s)
name Identifier Reference another rule
a | b Alternation Match one of several alternatives
a b Sequence Match elements in order
[ expr ] Optional Match zero or one time (space after [)
{ expr } Repetition Match zero or more times
( expr ) Grouping Override precedence
[a-z] Character class Match character set (no space after [)
[^...] Negated class Match any character NOT in set
* Star Zero or more (postfix quantifier)
+ Plus One or more (postfix quantifier)
? Question Zero or one (postfix quantifier)

The [ character is disambiguated by what follows it:

  • [ expr ] (space after [) is an optional group
  • [a-z] (no space) is a character class

Directives

Directives are lines starting with #! at the top of the grammar file.

#!name <name>

Sets the grammar name. Used in generated function names: <name>_parser, <name>_unast, <name>_reast.

#!name json

#!prefix <prefix>

Sets the 1-2 character prefix for state codes. Each rule gets a unique 2-character code starting with this prefix (e.g., J produces Ja, Jb, Jc, ...). Must be unique across parsers loaded together.

#!prefix J

#!whitespace <mode>

Controls whitespace handling. Modes:

  • token -- skip spaces, tabs, and newlines between tokens (default)
  • skip -- like token, plus whitespace around precedence climbing operators
  • line -- skip spaces and tabs between tokens (not newlines)
#!whitespace skip

#!string <rule> [<close> [escape]]

Marks a rule as a string accumulator. The generated parser uses a fast path to bulk-accumulate characters between delimiters.

  • Default close delimiter is ", default escape mode is JSON-style (\ + "/\///n/r/t/b/f/u)
  • Specify <close> to use a different delimiter
  • Specify escape for simple backslash escaping (any char after \ is literal)
# JSON strings: "..." with JSON escape sequences
#!string string

# Single-quoted strings: '...' with simple \ escapes
#!string sqstr ' escape

The rule itself should define only the delimiters:

string = '"' '"' ;
sqstr = "'" "'" ;

Content between delimiters is accumulated by the fast-path, not parsed character-by-character.

#!number <rule>

Marks a rule as a number accumulator. The generated parser uses a fast path to bulk-accumulate digit characters, similar to #!string for strings.

A rule literally named number is auto-detected, so this directive is only needed if your number rule has a different name (e.g., #!number digits).

#!number digits

#!validate

Enable RFC 8259 number format validation: [-](0|[1-9]\d*)[.\d+][eE[+-]\d+]. Works in conjunction with #!number.

#!validate number

#!strict

Enable strict mode: reject trailing commas in arrays/objects, require colons in members.

#!strict

#!comment <start> [<end>]

Define comments to be skipped during parsing. Supports both line comments and block comments:

# Line comment: skip from start to end of line
#!comment #
#!comment //

# Block comment: skip from start to end delimiter
#!comment /* */

Multi-character start tokens work (e.g., //). For block comments, provide the end delimiter as a second argument. Only one #!comment directive is supported per grammar.

#!keywords <ident_rule> <KW1> <KW2> ...

Define case-insensitive keywords. The first argument is the identifier rule that keywords fall back to. Keywords are stored uppercase internally and matched case-insensitively.

#!keywords ident SELECT FROM WHERE AND OR NOT NULL TRUE FALSE

When input matches a keyword, it produces a keyword node type. When it does not match any keyword, it falls back to the identifier rule.

#!precedence <rule> <op> <prec> <assoc> ...

Define binary operator precedence climbing. Arguments are repeated triples of operator, precedence level (integer), and associativity (left or right).

#!precedence expr + 1 left - 1 left * 2 left / 2 left ** 3 right

Higher precedence numbers bind tighter. The <rule> must have atom as its body (e.g., expr = atom ;). The generator creates the precedence-climbing loop automatically.

#!unary <rule> <op> <prec>

Define a unary prefix operator.

#!unary expr - 4
#!unary expr ! 4

#!postfix <rule> <open> [<inner> [<close>]]

Define postfix operators like function calls, member access, or suffix operators.

# Function call: expr(args)
#!postfix expr ( arg_list )

# Member access: expr.ident
#!postfix expr . ident

# Suffix operator (no inner rule or close): expr++
#!postfix expr ++

When no inner rule or close token is provided, the postfix node is immediately closed after creation (the operand is the sole child).

#!ternary <rule> <open> <sep> <prec>

Define a ternary conditional operator.

#!ternary expr ? : 0

#!extern <rule>

Mark a rule as externally implemented. The generator skips code generation for this rule; you provide the implementation yourself.

Example Grammars

S-expressions (minimal)

#!name sexp
#!prefix Sx
#!whitespace skip

value = list | str | number | symbol ;
list = '(' { value } ')' ;
str = '"' strcontent '"' ;
strcontent = [^"]* ;
number = [0-9.]+ ;
symbol = [a-zA-Z_+\-*!?]+ ;

JSON (data format with validation)

#!name json
#!prefix J
#!whitespace skip
#!string string
#!validate number
#!strict

value = object | array | string | number | 'true' | 'false' | 'null' ;
object = '{' [ member { ',' member } ] '}' ;
member = string ':' value ;
array = '[' [ value { ',' value } ] ']' ;
string = '"' '"' ;
number = [0-9eE.+\-]+ ;

Sample parse:

$ printf '{"a": 1}' | sh entrypoint.sh lang_json_parser
X0='Jd 1'
X1='Jo 2'
X2='Jm 3 4'
V3='a'
X3='Js'
V4='1'
X4='Jn'

Node 0 (Jd, document) has child 1 (Jo, object). The object contains node 2 (Jm, member) with children: node 3 (Js, string "a") and node 4 (Jn, number 1).

Expressions (operators, precedence, postfix, ternary)

#!name expr
#!prefix Ex
#!whitespace skip
#!precedence expr + 1 left - 1 left * 2 left / 2 left ** 3 right
#!postfix expr . ident
#!postfix expr ( arg_list )
#!unary expr - 4
#!unary expr ! 4
#!ternary expr ? : 0
#!string dqstr
#!string sqstr ' escape

expr = atom ;
atom = ident | number | group | dqstr | sqstr ;
ident = [a-zA-Z_]+ ;
group = '(' expr ')' ;
number = [0-9]+ ;
arg_list = expr { ',' expr } ;
dqstr = '"' '"' ;
sqstr = "'" "'" ;

Sample parse showing precedence:

$ printf '2 + 3 * 4' | sh entrypoint.sh lang_expr_parser
X0='Ea 1'
X1='Eb 4'
V4='+'
X4='El 3 7'
V3='2'
X3='Ef'
V7='*'
X7='El 6 9'
V6='3'
X6='Ef'
V9='4'
X9='Ef'

Node 4 (El, binary op +) has children 3 (number 2) and 7. Node 7 (El, binary op *) has children 6 (number 3) and 9 (number 4). The * binds tighter than +, as expected from the precedence declaration.

AST Format

Parsers output AST as shell variable assignments on stdout:

V<n>='<value>'    -- text content of node n
X<n>='<type> [<child_indices>...]'  -- node type and children
  • Node 0 is always the root (document node)
  • V<n> is only present if the node has text content
  • X<n> is always present; the first word is the type code, remaining words are child node indices
  • Type codes are 2-character strings matching the #!prefix + a letter (e.g., Ja = JSON document, Jf = JSON string, Jg = JSON number)

Evaluating AST output

Since the output is valid shell, you can eval it to load the tree:

eval "$(printf '{"a": 1}' | sh entrypoint.sh lang_json_parser)"
echo "$X0"   # Jd 1
echo "$X1"   # Jo 2
echo "$V3"   # a

Emitter (AST to source)

Each parser module includes an emitter that reconstructs source from AST:

  • <name>_unast -- reads AST assignments from stdin, prints source
  • <name>_reast -- round-trip: parse stdin then emit source
# Reconstruct from AST
echo '{"a": 1}' | sh entrypoint.sh lang_json_parser | sh entrypoint.sh lang_json_unast

# Round-trip in one command
echo '{"a": 1}' | sh entrypoint.sh lang_json_reast

Using Parsers as Libraries

Use entrypoint.sh to load modules, then call parser functions directly:

. ./core/header.sh
. ./core/use.sh
_LIB=./modules:./generated

use ast_engine
use lang_json_parser
use lang_json_unast

# Now call parser functions directly
echo '42' | lang_json_parser     # prints AST
echo '42' | lang_json_reast      # round-trip

The use function loads modules by converting underscores to paths (e.g., lang_json_parser loads modules/lang/json/parser.sh) and tracks loaded modules to avoid double-sourcing.

Testing

Run tests for a single parser

sh tests_entrypoint.sh json

Filter tests by name

sh tests_entrypoint.sh json "string"    # only run tests matching "string"

Multi-shell testing via Docker

sh ptest                   # all test files, all shells
sh ptest json bnf          # specific test modules

Without arguments, ptest runs all test files (infrastructure, hand-written parsers, and generated parsers).

This uses alganet/shell-versions:all to run tests across 13 shells defined in pshells.txt.

Test functions

Tests use the framework in test/sh.sh:

Function Usage Purpose
is is "name" "input" "expected_ast" Assert parse output matches
rt rt "name" "input" ["expected"] Assert round-trip reconstructs input (or expected)
fail fail "name" "input" Assert that parsing fails
err err "name" "input" "msg" line col Assert specific error message and position

Example test file (test/json.sh):

#!/bin/sh
set -euf

_mod_add _MOD core_header; _mod_add _MOD core_use

. "$_DIR/test/sh.sh"
use ast_engine
use lang_json_parser
use lang_json_unast

_PARS=lang_json_parser _UNAST=lang_json_unast _PRINTIN=_printn1

test_json () {
	PASS=0 FAIL=0 SKIP=0 FILTER="${1:-}"

	is "null" 'null' "X0='Jd 1'
X1='Ju'"

	rt "object" '{"a": 1}'

	fail "trailing comma" '{"a": 1,}'

	_printr1 "# $PASS passed, $FAIL failed, $SKIP skipped"
}

Tests are wrapped in a test_<name> function and loaded via tests_entrypoint.sh, which handles module setup and invocation.

Shell Compatibility

Tested across 13 shell versions via Docker:

Shell Versions
ash (BusyBox) 1.26.2, 1.37.0
bash 2.05b.13, 5.3.9
dash 0.5.5.1, 0.5.13
ksh93 93u+m v1.0.3, v1.0.10
loksh 6.8.1
mksh R45
oksh 6.8.1
yash 2.60
zsh 5.9

Key portability mechanisms:

  • PATH= cleared so only builtins are used
  • shopt -s expand_aliases for bash; setopt options for zsh
  • alias local=typeset fallback for ksh/mksh
  • Output detection cascade: printf > print > echo > command -p printf
  • core/footer.sh fixes ksh93 recursive function scoping

See INTERNALS.md for details.

Additional

C89 Compiler (modules/lang/c89/compiler.sh)

Reads C89 AST from stdin, emits x86-64 ELF binary to stdout.

Grammars (bnf/)

File Purpose
*.bnf Grammar definitions for generated parsers
regen.sh Regenerate all generated/lang/<name>/ parsers from grammars

Infrastructure

File Purpose
ptest Multi-shell test runner (Docker)
pshells.txt Shell version manifest for ptest

License

ISC

About

Experimental prototype. UNMAINTAINED.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors