Pure POSIX shell parser generator. Reads Extended BNF grammars and generates table-driven parsers as shell scripts with zero external dependencies.
- Pure shell --
PATH=is cleared; no external commands used at all - 13 shells tested -- ash, bash, dash, ksh93, loksh, mksh, oksh, yash, zsh
(multiple versions each; see
pshells.txt) - Zero dependencies -- only shell builtins and
command -pfallbacks - Table-driven LL(1) state machine with character-level dispatch
- Precedence climbing for binary, unary, postfix, and ternary operators
- AST output as
V<n>/X<n>shell variable assignments - Built-in emitter reconstructs source from AST (round-trip support)
- Error reporting with line:column and visual caret
Here is bnf/sexp.bnf, the simplest included grammar (S-expressions):
#!name sexp
#!prefix Sx
#!whitespace skip
value = list | str | number | symbol ;
list = '(' { value } ')' ;
str = '"' strcontent '"' ;
strcontent = [^"]* ;
number = [0-9.]+ ;
symbol = [a-zA-Z_+\-*!?]+ ;
All commands should be run from the project root directory:
sh entrypoint.sh bnf_parser < bnf/sexp.bnf | sh entrypoint.sh bnf_gen generated/lang/sexpThe pipeline is: bnf_parser parses the grammar into AST, then bnf_gen
generates the parser module (parser.sh, unast.sh, reast.sh) into the target
directory.
$ echo '(+ 1 2)' | sh entrypoint.sh lang_sexp_parser
X0='Sa 2'
X2='Sc 3 4 5'
V3='+'
X3='Sg'
V4='1'
X4='Sf'
V5='2'
X5='Sf'Each X<n> line defines a node: X<n>='<type> [<child> ...]'.
Each V<n> line holds the text value of a node.
Reading the output above: node 0 (document, type Sa) has child 2.
Node 2 (list, type Sc) has children 3, 4, 5 -- the symbol +, number 1,
and number 2.
$ echo '(+ 1 2)' | sh entrypoint.sh lang_sexp_parser | sh entrypoint.sh lang_sexp_unast
(+ 1 2)sh bnf/regen.shThis regenerates every generated/lang/<name>/ parser from its corresponding
bnf/<name>.bnf.
Rules have the form:
name = body ;
Names are [a-zA-Z_][a-zA-Z0-9_]*. Each rule defines a parsing rule that
becomes a state in the generated parser.
Lines starting with # (that are not directives) are comments and ignored.
Comments work anywhere in the grammar -- between rules, after rule
terminators, or mid-rule after a token:
# This is a comment
value = 'true' # inline comment
| 'false' ;
| Syntax | Name | Meaning |
|---|---|---|
'x' or "kw" |
Terminal | Match literal character(s) |
name |
Identifier | Reference another rule |
a | b |
Alternation | Match one of several alternatives |
a b |
Sequence | Match elements in order |
[ expr ] |
Optional | Match zero or one time (space after [) |
{ expr } |
Repetition | Match zero or more times |
( expr ) |
Grouping | Override precedence |
[a-z] |
Character class | Match character set (no space after [) |
[^...] |
Negated class | Match any character NOT in set |
* |
Star | Zero or more (postfix quantifier) |
+ |
Plus | One or more (postfix quantifier) |
? |
Question | Zero or one (postfix quantifier) |
The [ character is disambiguated by what follows it:
[ expr ](space after[) is an optional group[a-z](no space) is a character class
Directives are lines starting with #! at the top of the grammar file.
Sets the grammar name. Used in generated function names: <name>_parser,
<name>_unast, <name>_reast.
#!name json
Sets the 1-2 character prefix for state codes. Each rule gets a unique
2-character code starting with this prefix (e.g., J produces Ja, Jb,
Jc, ...). Must be unique across parsers loaded together.
#!prefix J
Controls whitespace handling. Modes:
token-- skip spaces, tabs, and newlines between tokens (default)skip-- liketoken, plus whitespace around precedence climbing operatorsline-- skip spaces and tabs between tokens (not newlines)
#!whitespace skip
Marks a rule as a string accumulator. The generated parser uses a fast path to bulk-accumulate characters between delimiters.
- Default close delimiter is
", default escape mode is JSON-style (\+"/\///n/r/t/b/f/u) - Specify
<close>to use a different delimiter - Specify
escapefor simple backslash escaping (any char after\is literal)
# JSON strings: "..." with JSON escape sequences
#!string string
# Single-quoted strings: '...' with simple \ escapes
#!string sqstr ' escape
The rule itself should define only the delimiters:
string = '"' '"' ;
sqstr = "'" "'" ;
Content between delimiters is accumulated by the fast-path, not parsed character-by-character.
Marks a rule as a number accumulator. The generated parser uses a fast path
to bulk-accumulate digit characters, similar to #!string for strings.
A rule literally named number is auto-detected, so this directive is only
needed if your number rule has a different name (e.g., #!number digits).
#!number digits
Enable RFC 8259 number format validation:
[-](0|[1-9]\d*)[.\d+][eE[+-]\d+]. Works in conjunction with #!number.
#!validate number
Enable strict mode: reject trailing commas in arrays/objects, require colons in members.
#!strict
Define comments to be skipped during parsing. Supports both line comments and block comments:
# Line comment: skip from start to end of line
#!comment #
#!comment //
# Block comment: skip from start to end delimiter
#!comment /* */
Multi-character start tokens work (e.g., //). For block comments, provide
the end delimiter as a second argument. Only one #!comment directive is
supported per grammar.
Define case-insensitive keywords. The first argument is the identifier rule that keywords fall back to. Keywords are stored uppercase internally and matched case-insensitively.
#!keywords ident SELECT FROM WHERE AND OR NOT NULL TRUE FALSE
When input matches a keyword, it produces a keyword node type. When it does not match any keyword, it falls back to the identifier rule.
Define binary operator precedence climbing. Arguments are repeated triples
of operator, precedence level (integer), and associativity (left or right).
#!precedence expr + 1 left - 1 left * 2 left / 2 left ** 3 right
Higher precedence numbers bind tighter. The <rule> must have atom as its
body (e.g., expr = atom ;). The generator creates the precedence-climbing
loop automatically.
Define a unary prefix operator.
#!unary expr - 4
#!unary expr ! 4
Define postfix operators like function calls, member access, or suffix operators.
# Function call: expr(args)
#!postfix expr ( arg_list )
# Member access: expr.ident
#!postfix expr . ident
# Suffix operator (no inner rule or close): expr++
#!postfix expr ++
When no inner rule or close token is provided, the postfix node is immediately closed after creation (the operand is the sole child).
Define a ternary conditional operator.
#!ternary expr ? : 0
Mark a rule as externally implemented. The generator skips code generation for this rule; you provide the implementation yourself.
#!name sexp
#!prefix Sx
#!whitespace skip
value = list | str | number | symbol ;
list = '(' { value } ')' ;
str = '"' strcontent '"' ;
strcontent = [^"]* ;
number = [0-9.]+ ;
symbol = [a-zA-Z_+\-*!?]+ ;
#!name json
#!prefix J
#!whitespace skip
#!string string
#!validate number
#!strict
value = object | array | string | number | 'true' | 'false' | 'null' ;
object = '{' [ member { ',' member } ] '}' ;
member = string ':' value ;
array = '[' [ value { ',' value } ] ']' ;
string = '"' '"' ;
number = [0-9eE.+\-]+ ;
Sample parse:
$ printf '{"a": 1}' | sh entrypoint.sh lang_json_parser
X0='Jd 1'
X1='Jo 2'
X2='Jm 3 4'
V3='a'
X3='Js'
V4='1'
X4='Jn'Node 0 (Jd, document) has child 1 (Jo, object). The object contains
node 2 (Jm, member) with children: node 3 (Js, string "a") and
node 4 (Jn, number 1).
#!name expr
#!prefix Ex
#!whitespace skip
#!precedence expr + 1 left - 1 left * 2 left / 2 left ** 3 right
#!postfix expr . ident
#!postfix expr ( arg_list )
#!unary expr - 4
#!unary expr ! 4
#!ternary expr ? : 0
#!string dqstr
#!string sqstr ' escape
expr = atom ;
atom = ident | number | group | dqstr | sqstr ;
ident = [a-zA-Z_]+ ;
group = '(' expr ')' ;
number = [0-9]+ ;
arg_list = expr { ',' expr } ;
dqstr = '"' '"' ;
sqstr = "'" "'" ;
Sample parse showing precedence:
$ printf '2 + 3 * 4' | sh entrypoint.sh lang_expr_parser
X0='Ea 1'
X1='Eb 4'
V4='+'
X4='El 3 7'
V3='2'
X3='Ef'
V7='*'
X7='El 6 9'
V6='3'
X6='Ef'
V9='4'
X9='Ef'Node 4 (El, binary op +) has children 3 (number 2) and 7.
Node 7 (El, binary op *) has children 6 (number 3) and 9 (number 4).
The * binds tighter than +, as expected from the precedence declaration.
Parsers output AST as shell variable assignments on stdout:
V<n>='<value>' -- text content of node n
X<n>='<type> [<child_indices>...]' -- node type and children
- Node 0 is always the root (document node)
V<n>is only present if the node has text contentX<n>is always present; the first word is the type code, remaining words are child node indices- Type codes are 2-character strings matching the
#!prefix+ a letter (e.g.,Ja= JSON document,Jf= JSON string,Jg= JSON number)
Since the output is valid shell, you can eval it to load the tree:
eval "$(printf '{"a": 1}' | sh entrypoint.sh lang_json_parser)"
echo "$X0" # Jd 1
echo "$X1" # Jo 2
echo "$V3" # aEach parser module includes an emitter that reconstructs source from AST:
<name>_unast-- reads AST assignments from stdin, prints source<name>_reast-- round-trip: parse stdin then emit source
# Reconstruct from AST
echo '{"a": 1}' | sh entrypoint.sh lang_json_parser | sh entrypoint.sh lang_json_unast
# Round-trip in one command
echo '{"a": 1}' | sh entrypoint.sh lang_json_reastUse entrypoint.sh to load modules, then call parser functions directly:
. ./core/header.sh
. ./core/use.sh
_LIB=./modules:./generated
use ast_engine
use lang_json_parser
use lang_json_unast
# Now call parser functions directly
echo '42' | lang_json_parser # prints AST
echo '42' | lang_json_reast # round-tripThe use function loads modules by converting underscores to paths
(e.g., lang_json_parser loads modules/lang/json/parser.sh) and tracks loaded
modules to avoid double-sourcing.
sh tests_entrypoint.sh jsonsh tests_entrypoint.sh json "string" # only run tests matching "string"sh ptest # all test files, all shells
sh ptest json bnf # specific test modulesWithout arguments, ptest runs all test files (infrastructure, hand-written
parsers, and generated parsers).
This uses alganet/shell-versions:all to run tests across 13 shells
defined in pshells.txt.
Tests use the framework in test/sh.sh:
| Function | Usage | Purpose |
|---|---|---|
is |
is "name" "input" "expected_ast" |
Assert parse output matches |
rt |
rt "name" "input" ["expected"] |
Assert round-trip reconstructs input (or expected) |
fail |
fail "name" "input" |
Assert that parsing fails |
err |
err "name" "input" "msg" line col |
Assert specific error message and position |
Example test file (test/json.sh):
#!/bin/sh
set -euf
_mod_add _MOD core_header; _mod_add _MOD core_use
. "$_DIR/test/sh.sh"
use ast_engine
use lang_json_parser
use lang_json_unast
_PARS=lang_json_parser _UNAST=lang_json_unast _PRINTIN=_printn1
test_json () {
PASS=0 FAIL=0 SKIP=0 FILTER="${1:-}"
is "null" 'null' "X0='Jd 1'
X1='Ju'"
rt "object" '{"a": 1}'
fail "trailing comma" '{"a": 1,}'
_printr1 "# $PASS passed, $FAIL failed, $SKIP skipped"
}Tests are wrapped in a test_<name> function and loaded via
tests_entrypoint.sh, which handles module setup and invocation.
Tested across 13 shell versions via Docker:
| Shell | Versions |
|---|---|
| ash (BusyBox) | 1.26.2, 1.37.0 |
| bash | 2.05b.13, 5.3.9 |
| dash | 0.5.5.1, 0.5.13 |
| ksh93 | 93u+m v1.0.3, v1.0.10 |
| loksh | 6.8.1 |
| mksh | R45 |
| oksh | 6.8.1 |
| yash | 2.60 |
| zsh | 5.9 |
Key portability mechanisms:
PATH=cleared so only builtins are usedshopt -s expand_aliasesfor bash;setoptoptions for zshalias local=typesetfallback for ksh/mksh- Output detection cascade:
printf>print>echo>command -p printf core/footer.shfixes ksh93 recursive function scoping
See INTERNALS.md for details.
Reads C89 AST from stdin, emits x86-64 ELF binary to stdout.
| File | Purpose |
|---|---|
*.bnf |
Grammar definitions for generated parsers |
regen.sh |
Regenerate all generated/lang/<name>/ parsers from grammars |
| File | Purpose |
|---|---|
ptest |
Multi-shell test runner (Docker) |
pshells.txt |
Shell version manifest for ptest |
ISC