#grammar #compiler #parser-grammar

build shiftkit

A Rust library for building parsers and grammars

2 unstable releases

Uses new Rust 2024

new 0.2.0 Apr 13, 2026
0.1.0 Apr 10, 2025

#42 in Parser tooling

MIT license

59KB
1K SLoC

ShiftKit

A generic LALR(1) parser generator with a flat, index-based AST representation.

Design Philosophy

Unlike traditional parser generators that produce recursive AST structures with Box<AstNode> references, ShiftKit takes a different approach: the parser returns a flat Vec of AST nodes, where nodes reference their children by index rather than by pointer.

This design offers several advantages:

  • Memory efficiency: No pointer overhead, better cache locality
  • Predictable memory layout: All nodes stored contiguously
  • Easy traversal: Iterate through the entire AST with a simple loop

Key Concepts

Token Types

Tokens require unique IDs for grammar definitions via TokenType(u32). Your custom token type must implement:

pub trait HasTokenType {
    fn token_type(&self) -> TokenType;
}

This lets you define tokens however you want (enums, structs, etc.) while the parser uses token_type() to identify them during parsing.

AST Node Types - Grammar Only

AstNodeType(u32) is used only for defining grammar rules, not for the AST nodes themselves.

Key insight: The parser tracks which AstNodeType each node has internally based on which grammar rule created it. Your AST nodes don't need to store or know their type - that's the parser's job.

This means your AST can be extremely simple:

#[derive(Debug, Clone)]
enum AstNode {
    Number(i64),
    BinOp(AstNodeId, BinOpType, AstNodeId),
}

You control precedence through grammar structure (e.g., separate VALUE, PRODUCT, SUM node types in the grammar), but all binary operations map to the same simple BinOp variant in your AST.

Reduction Results: New Nodes vs Pass-Through

Reduction functions receive indices corresponding to the grammar rule symbols and return a ReductionResult:

// Reduction function signature
type ReductionFn<T, A> = fn(indices: &[Index], tokens: &[T], nodes: &[A]) -> ReductionResult<A>;

// Index is either a token position or AST node position
pub enum Index {
    Token(TokenId),
    AstNode(AstNodeId),
}

// Result is either a new node or a forwarded node
pub enum ReductionResult<A> {
    /// Create a new AST node and append it to the Vec
    NewNode(A),
    /// Reuse an existing AST node (for pass-through rules like `Sum -> Product`)
    Forward(AstNodeId),
}

This avoids duplicating nodes in the Vec for pass-through grammar rules.

Putting It Together

The key design insights:

  1. AstNodeType for precedence: The grammar uses types like NODE_VALUE, NODE_PRODUCT, NODE_SUM to encode precedence, but your actual AST is simple (just Number and BinOp in this example).

  2. Parser tracks types internally: When a reduction fires, the parser knows which AstNodeType it produced (from the grammar rule's left-hand side). You never need to store or compute this in your AST.

  3. ReductionResult enables pass-through: Rules like Sum -> Product just forward the existing node instead of cloning it, keeping the Vec compact.

  4. Simple AST, complex grammar: Control precedence and parsing behavior through grammar structure, while keeping your AST focused on semantics.

Index-Based AST Structure

When the parser processes input, it:

  1. Takes a slice of tokens as input
  2. Returns a Vec of AST nodes
  3. The last element in the Vec is the root/outermost node
  4. Child nodes appear earlier in the Vec

AST nodes reference their children using AstNodeId (index into the AST Vec) and can optionally reference TokenId (index into the token slice).

Usage

1. Define Your Token and AST Node Types

Tokens need HasTokenType implemented.

use shiftkit::{TokenType, HasTokenType, AstNodeId};

// Token type IDs (for grammar definition)
const TOKEN_NUMBER: TokenType = TokenType(0);
const TOKEN_PLUS: TokenType = TokenType(1);
const TOKEN_MINUS: TokenType = TokenType(2);
const TOKEN_STAR: TokenType = TokenType(3);
const TOKEN_SLASH: TokenType = TokenType(4);
const TOKEN_LPAREN: TokenType = TokenType(5);
const TOKEN_RPAREN: TokenType = TokenType(6);

// Your custom token type
#[derive(Debug, Clone)]
enum Token {
    Number(i64),
    Plus,
    Minus,
    Star,
    Slash,
    LParen,
    RParen,
}

impl HasTokenType for Token {
    fn token_type(&self) -> TokenType {
        match self {
            Self::Number(_) => TOKEN_NUMBER,
            Self::Plus => TOKEN_PLUS,
            Self::Minus => TOKEN_MINUS,
            Self::Star => TOKEN_STAR,
            Self::Slash => TOKEN_SLASH,
            Self::LParen => TOKEN_LPAREN,
            Self::RParen => TOKEN_RPAREN,
        }
    }
}

// Your custom AST node types
#[derive(Debug, Clone)]
enum BinOpType {
    Add,
    Sub,
    Mul,
    Div,
}

#[derive(Debug, Clone)]
enum AstNode {
    Number(i64),
    BinOp(AstNodeId, BinOpType, AstNodeId),
}

2. Define Production Rules

Control precedence through grammar structure, not AST structure:

use shiftkit::{Grammar, AstNodeType, ReductionResult, Index};

// AST node types (for grammar only - not stored in actual AST!)
const NODE_VALUE: AstNodeType = AstNodeType(0);    // Atoms: numbers, parenthesized exprs
const NODE_PRODUCT: AstNodeType = AstNodeType(1);  // *, / (higher precedence)
const NODE_SUM: AstNodeType = AstNodeType(2);      // +, - (lower precedence)

let mut grammar = Grammar::new();

// Value -> NUMBER
grammar.add_rule(
    NODE_VALUE,
    &[TOKEN_NUMBER.into()],
    |indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
        let tok_id = indices[0].as_token_id();
        let Token::Number(num) = tokens[tok_id] else { unreachable!() };
        ReductionResult::NewNode(AstNode::Number(num))
    }
);

// Value -> ( Sum )
grammar.add_rule(
    NODE_VALUE,
    &[TOKEN_LPAREN.into(), NODE_SUM.into(), TOKEN_RPAREN.into()],
    |indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
        // Just forward the inner expression - don't duplicate it!
        ReductionResult::Forward(indices[1].as_ast_node_id())
    }
);

// Product -> Value (pass-through)
grammar.add_rule(
    NODE_PRODUCT,
    &[NODE_VALUE.into()],
    |indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
        ReductionResult::Forward(indices[0].as_ast_node_id())
    }
);

// Product -> Product * Value
grammar.add_rule(
    NODE_PRODUCT,
    &[NODE_PRODUCT.into(), TOKEN_STAR.into(), NODE_VALUE.into()],
    |indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
        let lhs = indices[0].as_ast_node_id();
        let rhs = indices[2].as_ast_node_id();
        ReductionResult::NewNode(AstNode::BinOp(lhs, BinOpType::Mul, rhs))
    }
);

// Sum -> Product (pass-through)
grammar.add_rule(
    NODE_SUM,
    &[NODE_PRODUCT.into()],
    |indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
        ReductionResult::Forward(indices[0].as_ast_node_id())
    }
);

// Sum -> Sum + Product
grammar.add_rule(
    NODE_SUM,
    &[NODE_SUM.into(), TOKEN_PLUS.into(), NODE_PRODUCT.into()],
    |indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
        let lhs = indices[0].as_ast_node_id();
        let rhs = indices[2].as_ast_node_id();
        ReductionResult::NewNode(AstNode::BinOp(lhs, BinOpType::Add, rhs))
    }
);

Key points:

  • AstNodeType variants (NODE_VALUE, NODE_PRODUCT, NODE_SUM) encode precedence in the grammar
  • Your actual AstNode enum is simple - just Number and BinOp
  • Use ReductionResult::Forward() for pass-through rules to avoid duplicating nodes
  • Use ReductionResult::NewNode() to create new AST nodes

3. Parse Input

Build the parser from your grammar and use it to parse token slices:

use shiftkit::Parser;

// Build the LALR(1) parser from the grammar
let parser: Parser<Token, AstNode> = Parser::from_grammar(grammar)?;

// Parse your input tokens
let tokens = tokenize("1 + 2 * 3");
let ast_nodes = parser.parse(&tokens)?;

// The root node is at the end of the Vec
let root_id = ast_nodes.len() - 1;

// Traverse the AST using indices
fn evaluate(node_id: AstNodeId, nodes: &[AstNode]) -> i64 {
    match &nodes[node_id] {
        AstNode::Number(n) => *n,
        AstNode::BinOp(lhs, op, rhs) => {
            let left = evaluate(*lhs, nodes);
            let right = evaluate(*rhs, nodes);
            match op {
                BinOpType::Add => left + right,
                BinOpType::Sub => left - right,
                BinOpType::Mul => left * right,
                BinOpType::Div => left / right,
            }
        }
    }
}

let result = evaluate(root_id, &ast_nodes);

The parser is generic over tokens (requiring HasTokenType) and any AST node type:

pub struct Parser<T: HasTokenType, A> {
    // ... parsing tables
}

impl<T: HasTokenType, A> Parser<T, A> {
    pub fn parse(&self, tokens: &[T]) -> Result<Vec<A>, ParseError> {
        // Uses `token.token_type()` for parsing decisions
        // Tracks `AstNodeType` internally based on which grammar rules fire
    }
}

Benefits

Memory Efficiency

  • No Box allocations for each node
  • Contiguous memory layout improves cache performance
  • Smaller memory footprint for large ASTs

Simplicity

  • No lifetime management for AST references
  • Easy to implement Copy semantics for indices
  • Straightforward serialization (just write the Vec)

Flexibility

  • Easy to implement multiple passes over the AST
  • Can maintain separate metadata vectors indexed by node ID
  • Simple to implement AST transformations (just modify the Vec)

Debugging

  • Print the entire AST as a simple Vec
  • Node indices provide stable identifiers across transformations
  • Easy to visualize the parsing order (nodes are added in reduction order)

Example: Complete Calculator Grammar

A full example showing precedence control with a simple AST:

use shiftkit::{Grammar, TokenType, AstNodeType, ReductionResult, Index, AstNodeId};

// Token type IDs (for grammar)
const TOKEN_NUMBER: TokenType = TokenType(0);
const TOKEN_PLUS: TokenType = TokenType(1);
const TOKEN_MINUS: TokenType = TokenType(2);
const TOKEN_STAR: TokenType = TokenType(3);
const TOKEN_SLASH: TokenType = TokenType(4);
const TOKEN_LPAREN: TokenType = TokenType(5);
const TOKEN_RPAREN: TokenType = TokenType(6);

// AST node type IDs (for grammar only - control precedence)
const NODE_VALUE: AstNodeType = AstNodeType(0);    // Atoms
const NODE_PRODUCT: AstNodeType = AstNodeType(1);  // *, /
const NODE_SUM: AstNodeType = AstNodeType(2);      // +, -

// Your actual AST types (no precedence info needed!)
#[derive(Debug, Clone)]
enum BinOpType { Add, Sub, Mul, Div }

#[derive(Debug, Clone)]
enum AstNode {
    Number(i64),
    BinOp(AstNodeId, BinOpType, AstNodeId),
}

let mut grammar = Grammar::new();

// Value -> NUMBER
grammar.add_rule(NODE_VALUE, &[TOKEN_NUMBER.into()],
    |indices, tokens, nodes| {
        let Token::Number(n) = tokens[indices[0].as_token_id()] else { unreachable!() };
        ReductionResult::NewNode(AstNode::Number(n))
    }
);

// Value -> ( Sum )
grammar.add_rule(NODE_VALUE,
    &[TOKEN_LPAREN.into(), NODE_SUM.into(), TOKEN_RPAREN.into()],
    |indices, tokens, nodes| {
        ReductionResult::Forward(indices[1].as_ast_node_id())
    }
);

// Product -> Value
grammar.add_rule(NODE_PRODUCT, &[NODE_VALUE.into()],
    |indices, tokens, nodes| {
        ReductionResult::Forward(indices[0].as_ast_node_id())
    }
);

// Product -> Product * Value
grammar.add_rule(NODE_PRODUCT,
    &[NODE_PRODUCT.into(), TOKEN_STAR.into(), NODE_VALUE.into()],
    |indices, tokens, nodes| {
        ReductionResult::NewNode(AstNode::BinOp(
            indices[0].as_ast_node_id(),
            BinOpType::Mul,
            indices[2].as_ast_node_id()
        ))
    }
);

// Product -> Product / Value
grammar.add_rule(NODE_PRODUCT,
    &[NODE_PRODUCT.into(), TOKEN_SLASH.into(), NODE_VALUE.into()],
    |indices, tokens, nodes| {
        ReductionResult::NewNode(AstNode::BinOp(
            indices[0].as_ast_node_id(),
            BinOpType::Div,
            indices[2].as_ast_node_id()
        ))
    }
);

// Sum -> Product
grammar.add_rule(NODE_SUM, &[NODE_PRODUCT.into()],
    |indices, tokens, nodes| {
        ReductionResult::Forward(indices[0].as_ast_node_id())
    }
);

// Sum -> Sum + Product
grammar.add_rule(NODE_SUM,
    &[NODE_SUM.into(), TOKEN_PLUS.into(), NODE_PRODUCT.into()],
    |indices, tokens, nodes| {
        ReductionResult::NewNode(AstNode::BinOp(
            indices[0].as_ast_node_id(),
            BinOpType::Add,
            indices[2].as_ast_node_id()
        ))
    }
);

// Sum -> Sum - Product
grammar.add_rule(NODE_SUM,
    &[NODE_SUM.into(), TOKEN_MINUS.into(), NODE_PRODUCT.into()],
    |indices, tokens, nodes| {
        ReductionResult::NewNode(AstNode::BinOp(
            indices[0].as_ast_node_id(),
            BinOpType::Sub,
            indices[2].as_ast_node_id()
        ))
    }
);

Result: Full precedence control (parentheses, *//, +/-) with a simple two-variant AST!

License

MIT

No runtime deps