2 unstable releases
Uses new Rust 2024
| new 0.2.0 | Apr 13, 2026 |
|---|---|
| 0.1.0 | Apr 10, 2025 |
#42 in Parser tooling
59KB
1K
SLoC
ShiftKit
A generic LALR(1) parser generator with a flat, index-based AST representation.
Design Philosophy
Unlike traditional parser generators that produce recursive AST structures with Box<AstNode> references, ShiftKit takes a different approach: the parser returns a flat Vec of AST nodes, where nodes reference their children by index rather than by pointer.
This design offers several advantages:
- Memory efficiency: No pointer overhead, better cache locality
- Predictable memory layout: All nodes stored contiguously
- Easy traversal: Iterate through the entire AST with a simple loop
Key Concepts
Token Types
Tokens require unique IDs for grammar definitions via TokenType(u32).
Your custom token type must implement:
pub trait HasTokenType {
fn token_type(&self) -> TokenType;
}
This lets you define tokens however you want (enums, structs, etc.) while the parser uses token_type() to identify them during parsing.
AST Node Types - Grammar Only
AstNodeType(u32) is used only for defining grammar rules, not for the AST nodes themselves.
Key insight: The parser tracks which AstNodeType each node has internally based on which grammar rule created it.
Your AST nodes don't need to store or know their type - that's the parser's job.
This means your AST can be extremely simple:
#[derive(Debug, Clone)]
enum AstNode {
Number(i64),
BinOp(AstNodeId, BinOpType, AstNodeId),
}
You control precedence through grammar structure (e.g., separate VALUE, PRODUCT, SUM node types in the grammar), but all binary operations map to the same simple BinOp variant in your AST.
Reduction Results: New Nodes vs Pass-Through
Reduction functions receive indices corresponding to the grammar rule symbols and return a ReductionResult:
// Reduction function signature
type ReductionFn<T, A> = fn(indices: &[Index], tokens: &[T], nodes: &[A]) -> ReductionResult<A>;
// Index is either a token position or AST node position
pub enum Index {
Token(TokenId),
AstNode(AstNodeId),
}
// Result is either a new node or a forwarded node
pub enum ReductionResult<A> {
/// Create a new AST node and append it to the Vec
NewNode(A),
/// Reuse an existing AST node (for pass-through rules like `Sum -> Product`)
Forward(AstNodeId),
}
This avoids duplicating nodes in the Vec for pass-through grammar rules.
Putting It Together
The key design insights:
-
AstNodeTypefor precedence: The grammar uses types likeNODE_VALUE,NODE_PRODUCT,NODE_SUMto encode precedence, but your actual AST is simple (justNumberandBinOpin this example). -
Parser tracks types internally: When a reduction fires, the parser knows which
AstNodeTypeit produced (from the grammar rule's left-hand side). You never need to store or compute this in your AST. -
ReductionResultenables pass-through: Rules likeSum -> Productjust forward the existing node instead of cloning it, keeping the Vec compact. -
Simple AST, complex grammar: Control precedence and parsing behavior through grammar structure, while keeping your AST focused on semantics.
Index-Based AST Structure
When the parser processes input, it:
- Takes a slice of tokens as input
- Returns a
Vecof AST nodes - The last element in the Vec is the root/outermost node
- Child nodes appear earlier in the Vec
AST nodes reference their children using AstNodeId (index into the AST Vec) and can optionally reference TokenId (index into the token slice).
Usage
1. Define Your Token and AST Node Types
Tokens need HasTokenType implemented.
use shiftkit::{TokenType, HasTokenType, AstNodeId};
// Token type IDs (for grammar definition)
const TOKEN_NUMBER: TokenType = TokenType(0);
const TOKEN_PLUS: TokenType = TokenType(1);
const TOKEN_MINUS: TokenType = TokenType(2);
const TOKEN_STAR: TokenType = TokenType(3);
const TOKEN_SLASH: TokenType = TokenType(4);
const TOKEN_LPAREN: TokenType = TokenType(5);
const TOKEN_RPAREN: TokenType = TokenType(6);
// Your custom token type
#[derive(Debug, Clone)]
enum Token {
Number(i64),
Plus,
Minus,
Star,
Slash,
LParen,
RParen,
}
impl HasTokenType for Token {
fn token_type(&self) -> TokenType {
match self {
Self::Number(_) => TOKEN_NUMBER,
Self::Plus => TOKEN_PLUS,
Self::Minus => TOKEN_MINUS,
Self::Star => TOKEN_STAR,
Self::Slash => TOKEN_SLASH,
Self::LParen => TOKEN_LPAREN,
Self::RParen => TOKEN_RPAREN,
}
}
}
// Your custom AST node types
#[derive(Debug, Clone)]
enum BinOpType {
Add,
Sub,
Mul,
Div,
}
#[derive(Debug, Clone)]
enum AstNode {
Number(i64),
BinOp(AstNodeId, BinOpType, AstNodeId),
}
2. Define Production Rules
Control precedence through grammar structure, not AST structure:
use shiftkit::{Grammar, AstNodeType, ReductionResult, Index};
// AST node types (for grammar only - not stored in actual AST!)
const NODE_VALUE: AstNodeType = AstNodeType(0); // Atoms: numbers, parenthesized exprs
const NODE_PRODUCT: AstNodeType = AstNodeType(1); // *, / (higher precedence)
const NODE_SUM: AstNodeType = AstNodeType(2); // +, - (lower precedence)
let mut grammar = Grammar::new();
// Value -> NUMBER
grammar.add_rule(
NODE_VALUE,
&[TOKEN_NUMBER.into()],
|indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
let tok_id = indices[0].as_token_id();
let Token::Number(num) = tokens[tok_id] else { unreachable!() };
ReductionResult::NewNode(AstNode::Number(num))
}
);
// Value -> ( Sum )
grammar.add_rule(
NODE_VALUE,
&[TOKEN_LPAREN.into(), NODE_SUM.into(), TOKEN_RPAREN.into()],
|indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
// Just forward the inner expression - don't duplicate it!
ReductionResult::Forward(indices[1].as_ast_node_id())
}
);
// Product -> Value (pass-through)
grammar.add_rule(
NODE_PRODUCT,
&[NODE_VALUE.into()],
|indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
ReductionResult::Forward(indices[0].as_ast_node_id())
}
);
// Product -> Product * Value
grammar.add_rule(
NODE_PRODUCT,
&[NODE_PRODUCT.into(), TOKEN_STAR.into(), NODE_VALUE.into()],
|indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
let lhs = indices[0].as_ast_node_id();
let rhs = indices[2].as_ast_node_id();
ReductionResult::NewNode(AstNode::BinOp(lhs, BinOpType::Mul, rhs))
}
);
// Sum -> Product (pass-through)
grammar.add_rule(
NODE_SUM,
&[NODE_PRODUCT.into()],
|indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
ReductionResult::Forward(indices[0].as_ast_node_id())
}
);
// Sum -> Sum + Product
grammar.add_rule(
NODE_SUM,
&[NODE_SUM.into(), TOKEN_PLUS.into(), NODE_PRODUCT.into()],
|indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
let lhs = indices[0].as_ast_node_id();
let rhs = indices[2].as_ast_node_id();
ReductionResult::NewNode(AstNode::BinOp(lhs, BinOpType::Add, rhs))
}
);
Key points:
AstNodeTypevariants (NODE_VALUE,NODE_PRODUCT,NODE_SUM) encode precedence in the grammar- Your actual
AstNodeenum is simple - justNumberandBinOp - Use
ReductionResult::Forward()for pass-through rules to avoid duplicating nodes - Use
ReductionResult::NewNode()to create new AST nodes
3. Parse Input
Build the parser from your grammar and use it to parse token slices:
use shiftkit::Parser;
// Build the LALR(1) parser from the grammar
let parser: Parser<Token, AstNode> = Parser::from_grammar(grammar)?;
// Parse your input tokens
let tokens = tokenize("1 + 2 * 3");
let ast_nodes = parser.parse(&tokens)?;
// The root node is at the end of the Vec
let root_id = ast_nodes.len() - 1;
// Traverse the AST using indices
fn evaluate(node_id: AstNodeId, nodes: &[AstNode]) -> i64 {
match &nodes[node_id] {
AstNode::Number(n) => *n,
AstNode::BinOp(lhs, op, rhs) => {
let left = evaluate(*lhs, nodes);
let right = evaluate(*rhs, nodes);
match op {
BinOpType::Add => left + right,
BinOpType::Sub => left - right,
BinOpType::Mul => left * right,
BinOpType::Div => left / right,
}
}
}
}
let result = evaluate(root_id, &ast_nodes);
The parser is generic over tokens (requiring HasTokenType) and any AST node type:
pub struct Parser<T: HasTokenType, A> {
// ... parsing tables
}
impl<T: HasTokenType, A> Parser<T, A> {
pub fn parse(&self, tokens: &[T]) -> Result<Vec<A>, ParseError> {
// Uses `token.token_type()` for parsing decisions
// Tracks `AstNodeType` internally based on which grammar rules fire
}
}
Benefits
Memory Efficiency
- No
Boxallocations for each node - Contiguous memory layout improves cache performance
- Smaller memory footprint for large ASTs
Simplicity
- No lifetime management for AST references
- Easy to implement
Copysemantics for indices - Straightforward serialization (just write the Vec)
Flexibility
- Easy to implement multiple passes over the AST
- Can maintain separate metadata vectors indexed by node ID
- Simple to implement AST transformations (just modify the Vec)
Debugging
- Print the entire AST as a simple Vec
- Node indices provide stable identifiers across transformations
- Easy to visualize the parsing order (nodes are added in reduction order)
Example: Complete Calculator Grammar
A full example showing precedence control with a simple AST:
use shiftkit::{Grammar, TokenType, AstNodeType, ReductionResult, Index, AstNodeId};
// Token type IDs (for grammar)
const TOKEN_NUMBER: TokenType = TokenType(0);
const TOKEN_PLUS: TokenType = TokenType(1);
const TOKEN_MINUS: TokenType = TokenType(2);
const TOKEN_STAR: TokenType = TokenType(3);
const TOKEN_SLASH: TokenType = TokenType(4);
const TOKEN_LPAREN: TokenType = TokenType(5);
const TOKEN_RPAREN: TokenType = TokenType(6);
// AST node type IDs (for grammar only - control precedence)
const NODE_VALUE: AstNodeType = AstNodeType(0); // Atoms
const NODE_PRODUCT: AstNodeType = AstNodeType(1); // *, /
const NODE_SUM: AstNodeType = AstNodeType(2); // +, -
// Your actual AST types (no precedence info needed!)
#[derive(Debug, Clone)]
enum BinOpType { Add, Sub, Mul, Div }
#[derive(Debug, Clone)]
enum AstNode {
Number(i64),
BinOp(AstNodeId, BinOpType, AstNodeId),
}
let mut grammar = Grammar::new();
// Value -> NUMBER
grammar.add_rule(NODE_VALUE, &[TOKEN_NUMBER.into()],
|indices, tokens, nodes| {
let Token::Number(n) = tokens[indices[0].as_token_id()] else { unreachable!() };
ReductionResult::NewNode(AstNode::Number(n))
}
);
// Value -> ( Sum )
grammar.add_rule(NODE_VALUE,
&[TOKEN_LPAREN.into(), NODE_SUM.into(), TOKEN_RPAREN.into()],
|indices, tokens, nodes| {
ReductionResult::Forward(indices[1].as_ast_node_id())
}
);
// Product -> Value
grammar.add_rule(NODE_PRODUCT, &[NODE_VALUE.into()],
|indices, tokens, nodes| {
ReductionResult::Forward(indices[0].as_ast_node_id())
}
);
// Product -> Product * Value
grammar.add_rule(NODE_PRODUCT,
&[NODE_PRODUCT.into(), TOKEN_STAR.into(), NODE_VALUE.into()],
|indices, tokens, nodes| {
ReductionResult::NewNode(AstNode::BinOp(
indices[0].as_ast_node_id(),
BinOpType::Mul,
indices[2].as_ast_node_id()
))
}
);
// Product -> Product / Value
grammar.add_rule(NODE_PRODUCT,
&[NODE_PRODUCT.into(), TOKEN_SLASH.into(), NODE_VALUE.into()],
|indices, tokens, nodes| {
ReductionResult::NewNode(AstNode::BinOp(
indices[0].as_ast_node_id(),
BinOpType::Div,
indices[2].as_ast_node_id()
))
}
);
// Sum -> Product
grammar.add_rule(NODE_SUM, &[NODE_PRODUCT.into()],
|indices, tokens, nodes| {
ReductionResult::Forward(indices[0].as_ast_node_id())
}
);
// Sum -> Sum + Product
grammar.add_rule(NODE_SUM,
&[NODE_SUM.into(), TOKEN_PLUS.into(), NODE_PRODUCT.into()],
|indices, tokens, nodes| {
ReductionResult::NewNode(AstNode::BinOp(
indices[0].as_ast_node_id(),
BinOpType::Add,
indices[2].as_ast_node_id()
))
}
);
// Sum -> Sum - Product
grammar.add_rule(NODE_SUM,
&[NODE_SUM.into(), TOKEN_MINUS.into(), NODE_PRODUCT.into()],
|indices, tokens, nodes| {
ReductionResult::NewNode(AstNode::BinOp(
indices[0].as_ast_node_id(),
BinOpType::Sub,
indices[2].as_ast_node_id()
))
}
);
Result: Full precedence control (parentheses, *//, +/-) with a simple two-variant AST!
License
MIT