pars

Pure POSIX shell parser generator. Reads Extended BNF grammars and generates table-driven parsers as shell scripts with zero external dependencies.

Features

Pure shell -- PATH= is cleared; no external commands used at all
13 shells tested -- ash, bash, dash, ksh93, loksh, mksh, oksh, yash, zsh (multiple versions each; see pshells.txt)
Zero dependencies -- only shell builtins and command -p fallbacks
Table-driven LL(1) state machine with character-level dispatch
Precedence climbing for binary, unary, postfix, and ternary operators
AST output as V<n>/X<n> shell variable assignments
Built-in emitter reconstructs source from AST (round-trip support)
Error reporting with line:column and visual caret

Quick Start

1. Write a grammar

Here is bnf/sexp.bnf, the simplest included grammar (S-expressions):

#!name sexp
#!prefix Sx
#!whitespace skip

value = list | str | number | symbol ;
list = '(' { value } ')' ;
str = '"' strcontent '"' ;
strcontent = [^"]* ;
number = [0-9.]+ ;
symbol = [a-zA-Z_+\-*!?]+ ;

2. Generate a parser

All commands should be run from the project root directory:

sh entrypoint.sh bnf_parser < bnf/sexp.bnf | sh entrypoint.sh bnf_gen generated/lang/sexp

The pipeline is: bnf_parser parses the grammar into AST, then bnf_gen generates the parser module (parser.sh, unast.sh, reast.sh) into the target directory.

3. Parse input

$ echo '(+ 1 2)' | sh entrypoint.sh lang_sexp_parser
X0='Sa 2'
X2='Sc 3 4 5'
V3='+'
X3='Sg'
V4='1'
X4='Sf'
V5='2'
X5='Sf'

Each X<n> line defines a node: X<n>='<type> [<child> ...]'. Each V<n> line holds the text value of a node.

Reading the output above: node 0 (document, type Sa) has child 2. Node 2 (list, type Sc) has children 3, 4, 5 -- the symbol +, number 1, and number 2.

4. Round-trip (parse then reconstruct)

$ echo '(+ 1 2)' | sh entrypoint.sh lang_sexp_parser | sh entrypoint.sh lang_sexp_unast
(+ 1 2)

5. Regenerate all parsers from grammars

sh bnf/regen.sh

This regenerates every generated/lang/<name>/ parser from its corresponding bnf/<name>.bnf.

Grammar Reference

Rule Syntax

Rules have the form:

name = body ;

Names are [a-zA-Z_][a-zA-Z0-9_]*. Each rule defines a parsing rule that becomes a state in the generated parser.

Comments

Lines starting with # (that are not directives) are comments and ignored. Comments work anywhere in the grammar -- between rules, after rule terminators, or mid-rule after a token:

# This is a comment
value = 'true'   # inline comment
       | 'false' ;

Elements

Syntax	Name	Meaning
`'x'` or `"kw"`	Terminal	Match literal character(s)
`name`	Identifier	Reference another rule
`a \| b`	Alternation	Match one of several alternatives
`a b`	Sequence	Match elements in order
`[ expr ]`	Optional	Match zero or one time (space after `[`)
`{ expr }`	Repetition	Match zero or more times
`( expr )`	Grouping	Override precedence
`[a-z]`	Character class	Match character set (no space after `[`)
`[^...]`	Negated class	Match any character NOT in set
`*`	Star	Zero or more (postfix quantifier)
`+`	Plus	One or more (postfix quantifier)
`?`	Question	Zero or one (postfix quantifier)

The [ character is disambiguated by what follows it:

[ expr ] (space after [) is an optional group
[a-z] (no space) is a character class

Directives

Directives are lines starting with #! at the top of the grammar file.

`#!name <name>`

Sets the grammar name. Used in generated function names: <name>_parser, <name>_unast, <name>_reast.

#!name json

`#!prefix <prefix>`

Sets the 1-2 character prefix for state codes. Each rule gets a unique 2-character code starting with this prefix (e.g., J produces Ja, Jb, Jc, ...). Must be unique across parsers loaded together.

#!prefix J

`#!whitespace <mode>`

Controls whitespace handling. Modes:

token -- skip spaces, tabs, and newlines between tokens (default)
skip -- like token, plus whitespace around precedence climbing operators
line -- skip spaces and tabs between tokens (not newlines)

#!whitespace skip

`#!string <rule> [<close> [escape]]`

Marks a rule as a string accumulator. The generated parser uses a fast path to bulk-accumulate characters between delimiters.

Default close delimiter is ", default escape mode is JSON-style (\ + "/\///n/r/t/b/f/u)
Specify <close> to use a different delimiter
Specify escape for simple backslash escaping (any char after \ is literal)

# JSON strings: "..." with JSON escape sequences
#!string string

# Single-quoted strings: '...' with simple \ escapes
#!string sqstr ' escape

The rule itself should define only the delimiters:

string = '"' '"' ;
sqstr = "'" "'" ;

Content between delimiters is accumulated by the fast-path, not parsed character-by-character.

`#!number <rule>`

Marks a rule as a number accumulator. The generated parser uses a fast path to bulk-accumulate digit characters, similar to #!string for strings.

A rule literally named number is auto-detected, so this directive is only needed if your number rule has a different name (e.g., #!number digits).

#!number digits

`#!validate`

Enable RFC 8259 number format validation: [-](0|[1-9]\d*)[.\d+][eE[+-]\d+]. Works in conjunction with #!number.

#!validate number

`#!strict`

Enable strict mode: reject trailing commas in arrays/objects, require colons in members.

#!strict

`#!comment <start> [<end>]`

Define comments to be skipped during parsing. Supports both line comments and block comments:

# Line comment: skip from start to end of line
#!comment #
#!comment //

# Block comment: skip from start to end delimiter
#!comment /* */

Multi-character start tokens work (e.g., //). For block comments, provide the end delimiter as a second argument. Only one #!comment directive is supported per grammar.

`#!keywords <ident_rule> <KW1> <KW2> ...`

Define case-insensitive keywords. The first argument is the identifier rule that keywords fall back to. Keywords are stored uppercase internally and matched case-insensitively.

#!keywords ident SELECT FROM WHERE AND OR NOT NULL TRUE FALSE

When input matches a keyword, it produces a keyword node type. When it does not match any keyword, it falls back to the identifier rule.

`#!precedence <rule> <op> <prec> <assoc> ...`

Define binary operator precedence climbing. Arguments are repeated triples of operator, precedence level (integer), and associativity (left or right).

#!precedence expr + 1 left - 1 left * 2 left / 2 left ** 3 right

Higher precedence numbers bind tighter. The <rule> must have atom as its body (e.g., expr = atom ;). The generator creates the precedence-climbing loop automatically.

`#!unary <rule> <op> <prec>`

Define a unary prefix operator.

#!unary expr - 4
#!unary expr ! 4

`#!postfix <rule> <open> [<inner> [<close>]]`

Define postfix operators like function calls, member access, or suffix operators.

# Function call: expr(args)
#!postfix expr ( arg_list )

# Member access: expr.ident
#!postfix expr . ident

# Suffix operator (no inner rule or close): expr++
#!postfix expr ++

When no inner rule or close token is provided, the postfix node is immediately closed after creation (the operand is the sole child).

`#!ternary <rule> <open> <sep> <prec>`

Define a ternary conditional operator.

#!ternary expr ? : 0

`#!extern <rule>`

Mark a rule as externally implemented. The generator skips code generation for this rule; you provide the implementation yourself.

Example Grammars

S-expressions (minimal)

#!name sexp
#!prefix Sx
#!whitespace skip

value = list | str | number | symbol ;
list = '(' { value } ')' ;
str = '"' strcontent '"' ;
strcontent = [^"]* ;
number = [0-9.]+ ;
symbol = [a-zA-Z_+\-*!?]+ ;

JSON (data format with validation)

#!name json
#!prefix J
#!whitespace skip
#!string string
#!validate number
#!strict

value = object | array | string | number | 'true' | 'false' | 'null' ;
object = '{' [ member { ',' member } ] '}' ;
member = string ':' value ;
array = '[' [ value { ',' value } ] ']' ;
string = '"' '"' ;
number = [0-9eE.+\-]+ ;

Sample parse:

$ printf '{"a": 1}' | sh entrypoint.sh lang_json_parser
X0='Jd 1'
X1='Jo 2'
X2='Jm 3 4'
V3='a'
X3='Js'
V4='1'
X4='Jn'

Node 0 (Jd, document) has child 1 (Jo, object). The object contains node 2 (Jm, member) with children: node 3 (Js, string "a") and node 4 (Jn, number 1).

Expressions (operators, precedence, postfix, ternary)

#!name expr
#!prefix Ex
#!whitespace skip
#!precedence expr + 1 left - 1 left * 2 left / 2 left ** 3 right
#!postfix expr . ident
#!postfix expr ( arg_list )
#!unary expr - 4
#!unary expr ! 4
#!ternary expr ? : 0
#!string dqstr
#!string sqstr ' escape

expr = atom ;
atom = ident | number | group | dqstr | sqstr ;
ident = [a-zA-Z_]+ ;
group = '(' expr ')' ;
number = [0-9]+ ;
arg_list = expr { ',' expr } ;
dqstr = '"' '"' ;
sqstr = "'" "'" ;

Sample parse showing precedence:

$ printf '2 + 3 * 4' | sh entrypoint.sh lang_expr_parser
X0='Ea 1'
X1='Eb 4'
V4='+'
X4='El 3 7'
V3='2'
X3='Ef'
V7='*'
X7='El 6 9'
V6='3'
X6='Ef'
V9='4'
X9='Ef'

Node 4 (El, binary op +) has children 3 (number 2) and 7. Node 7 (El, binary op *) has children 6 (number 3) and 9 (number 4). The * binds tighter than +, as expected from the precedence declaration.

AST Format

Parsers output AST as shell variable assignments on stdout:

V<n>='<value>'    -- text content of node n
X<n>='<type> [<child_indices>...]'  -- node type and children

Node 0 is always the root (document node)
V<n> is only present if the node has text content
X<n> is always present; the first word is the type code, remaining words are child node indices
Type codes are 2-character strings matching the #!prefix + a letter (e.g., Ja = JSON document, Jf = JSON string, Jg = JSON number)

Evaluating AST output

Since the output is valid shell, you can eval it to load the tree:

eval "$(printf '{"a": 1}' | sh entrypoint.sh lang_json_parser)"
echo "$X0"   # Jd 1
echo "$X1"   # Jo 2
echo "$V3"   # a

Emitter (AST to source)

Each parser module includes an emitter that reconstructs source from AST:

<name>_unast -- reads AST assignments from stdin, prints source
<name>_reast -- round-trip: parse stdin then emit source

# Reconstruct from AST
echo '{"a": 1}' | sh entrypoint.sh lang_json_parser | sh entrypoint.sh lang_json_unast

# Round-trip in one command
echo '{"a": 1}' | sh entrypoint.sh lang_json_reast

Using Parsers as Libraries

Use entrypoint.sh to load modules, then call parser functions directly:

. ./core/header.sh
. ./core/use.sh
_LIB=./modules:./generated

use ast_engine
use lang_json_parser
use lang_json_unast

# Now call parser functions directly
echo '42' | lang_json_parser     # prints AST
echo '42' | lang_json_reast      # round-trip

The use function loads modules by converting underscores to paths (e.g., lang_json_parser loads modules/lang/json/parser.sh) and tracks loaded modules to avoid double-sourcing.

Testing

Run tests for a single parser

sh tests_entrypoint.sh json

Filter tests by name

sh tests_entrypoint.sh json "string"    # only run tests matching "string"

Multi-shell testing via Docker

sh ptest                   # all test files, all shells
sh ptest json bnf          # specific test modules

Without arguments, ptest runs all test files (infrastructure, hand-written parsers, and generated parsers).

This uses alganet/shell-versions:all to run tests across 13 shells defined in pshells.txt.

Test functions

Tests use the framework in test/sh.sh:

Function	Usage	Purpose
`is`	`is "name" "input" "expected_ast"`	Assert parse output matches
`rt`	`rt "name" "input" ["expected"]`	Assert round-trip reconstructs input (or expected)
`fail`	`fail "name" "input"`	Assert that parsing fails
`err`	`err "name" "input" "msg" line col`	Assert specific error message and position

Example test file (test/json.sh):

#!/bin/sh
set -euf

_mod_add _MOD core_header; _mod_add _MOD core_use

. "$_DIR/test/sh.sh"
use ast_engine
use lang_json_parser
use lang_json_unast

_PARS=lang_json_parser _UNAST=lang_json_unast _PRINTIN=_printn1

test_json () {
	PASS=0 FAIL=0 SKIP=0 FILTER="${1:-}"

	is "null" 'null' "X0='Jd 1'
X1='Ju'"

	rt "object" '{"a": 1}'

	fail "trailing comma" '{"a": 1,}'

	_printr1 "# $PASS passed, $FAIL failed, $SKIP skipped"
}

Tests are wrapped in a test_<name> function and loaded via tests_entrypoint.sh, which handles module setup and invocation.

Shell Compatibility

Tested across 13 shell versions via Docker:

Shell	Versions
ash (BusyBox)	1.26.2, 1.37.0
bash	2.05b.13, 5.3.9
dash	0.5.5.1, 0.5.13
ksh93	93u+m v1.0.3, v1.0.10
loksh	6.8.1
mksh	R45
oksh	6.8.1
yash	2.60
zsh	5.9

Key portability mechanisms:

PATH= cleared so only builtins are used
shopt -s expand_aliases for bash; setopt options for zsh
alias local=typeset fallback for ksh/mksh
Output detection cascade: printf > print > echo > command -p printf
core/footer.sh fixes ksh93 recursive function scoping

See INTERNALS.md for details.

Additional

C89 Compiler (`modules/lang/c89/compiler.sh`)

Reads C89 AST from stdin, emits x86-64 ELF binary to stdout.

Grammars (`bnf/`)

File	Purpose
`*.bnf`	Grammar definitions for generated parsers
`regen.sh`	Regenerate all `generated/lang/<name>/` parsers from grammars

Infrastructure

File	Purpose
`ptest`	Multi-shell test runner (Docker)
`pshells.txt`	Shell version manifest for ptest

License

ISC

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
bnf		bnf
core		core
generated/lang		generated/lang
modules		modules
stage0-posix-1.9.1		stage0-posix-1.9.1
test		test
INTERNALS.md		INTERNALS.md
LICENSE		LICENSE
README.md		README.md
demo_hs.sh		demo_hs.sh
entrypoint.sh		entrypoint.sh
pshells.txt		pshells.txt
ptest		ptest

Folders and files

Latest commit

History

Repository files navigation

pars

Features

Quick Start

1. Write a grammar

2. Generate a parser

3. Parse input

4. Round-trip (parse then reconstruct)

5. Regenerate all parsers from grammars

Grammar Reference

Rule Syntax

Comments

Elements

Directives

#!name <name>

#!prefix <prefix>

#!whitespace <mode>

#!string <rule> [<close> [escape]]

#!number <rule>

#!validate

#!strict

#!comment <start> [<end>]

#!keywords <ident_rule> <KW1> <KW2> ...

#!precedence <rule> <op> <prec> <assoc> ...

#!unary <rule> <op> <prec>

#!postfix <rule> <open> [<inner> [<close>]]

#!ternary <rule> <open> <sep> <prec>

#!extern <rule>

Example Grammars

S-expressions (minimal)

JSON (data format with validation)

Expressions (operators, precedence, postfix, ternary)

AST Format

Evaluating AST output

Emitter (AST to source)

Using Parsers as Libraries

Testing

Run tests for a single parser

Filter tests by name

Multi-shell testing via Docker

Test functions

Shell Compatibility

Additional

C89 Compiler (modules/lang/c89/compiler.sh)

Grammars (bnf/)

Infrastructure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`#!name <name>`

`#!prefix <prefix>`

`#!whitespace <mode>`

`#!string <rule> [<close> [escape]]`

`#!number <rule>`

`#!validate`

`#!strict`

`#!comment <start> [<end>]`

`#!keywords <ident_rule> <KW1> <KW2> ...`

`#!precedence <rule> <op> <prec> <assoc> ...`

`#!unary <rule> <op> <prec>`

`#!postfix <rule> <open> [<inner> [<close>]]`

`#!ternary <rule> <open> <sep> <prec>`

`#!extern <rule>`

C89 Compiler (`modules/lang/c89/compiler.sh`)

Grammars (`bnf/`)

Packages