A C++ compiler implementation built with Python for educational purposes, developed as a project for the Data Structures and Algorithms course at Iran University of Science and Technology (IUST).
This project was created as part of the Data Structures and Algorithms curriculum to demonstrate fundamental compiler design principles and implement key algorithms used in programming language processing. It showcases practical applications of concepts taught in the course, including:
- Tree data structures for parse tree representation
- Stack-based algorithms for parsing
- Hash tables for token management
- Graph algorithms for grammar analysis (FIRST and FOLLOW sets)
- Complexity analysis of compiler algorithms
CPPiler is an educational C++ parser and compiler that performs lexical analysis, tokenization, and syntax parsing on a restricted subset of C++ code. It demonstrates the complete compilation pipeline from source code to parse tree visualization, making it an excellent learning tool for understanding how compilers work.
- Lexical Analysis: Tokenizes C++ source code into meaningful tokens (keywords, identifiers, numbers, symbols, strings)
- Syntax Validation: Checks for basic syntax errors like missing semicolons and type mismatches
- LL(1) Parser: Implements a non-predictive parser using an LL(1) parse table
- Parse Tree Visualization: Generates and displays parse trees showing the syntactic structure of code
- Interactive CLI: User-friendly menu interface for exploring different compiler stages
- Grammar Analysis: Computes FIRST and FOLLOW sets for context-free grammar
CPPiler supports a limited but meaningful subset of C++ syntax:
#include <iostream>
using namespace std;
int main() {
int x = 5;
float y = 3.14;
int sum = x + y * 2;
while(x != 0) {
cin >> x;
cout << "Value: " << x;
}
return 0;
}Supported constructs:
#includedirectivesusing namespace std;int main()function- Variable declarations:
int,float - Arithmetic operations:
+,-,* - Comparison operators:
==,!=,>=,<= - Input:
cin >> variable - Output:
cout << value whileloopsreturn 0;statement
CPPiler follows a classic compiler architecture with the following components:
- Uses regex-based pattern matching to break source code into tokens
- Recognizes 6 token types: reserved words, identifiers, numbers, symbols, strings, and whitespace
- Time Complexity: O(m × n) where m is input length and n is the number of patterns
- Defines a context-free grammar (CFG) with 19 non-terminals
- Contains a pre-computed LL(1) parse table (22 rows × 26 columns)
- Maps grammar symbols to table indices for efficient parsing
- Recursively computes FIRST sets for all non-terminals
- Essential for constructing the LL(1) parser
- Time Complexity: O(N × M × L) where N is non-terminals count, M is productions per non-terminal, and L is average production length
- Computes FOLLOW sets for non-terminals
- Handles epsilon (ε) productions correctly
- Uses visited tracking to avoid infinite loops
- Time Complexity: O(N × P × L) where N is non-terminals count, P is productions count, and L is average production length
- Reads C++ source files and tokenizes them
- Performs basic syntax validation
- Displays tokenized output to the user
- Organizes tokens by type into a formatted table
- Uses SHA256 hashing for token representation
- Time Complexity: O(n) where n is the number of tokens
- Visualizes the LL(1) parse table
- Shows all possible grammar derivations
- Time Complexity: O(n) where n is table size
- Implements stack-based LL(1) parsing algorithm
- Generates parse trees using the
anytreelibrary - Visualizes the syntactic structure of the code
- Time Complexity: O(n + t + r × c) where n is input length, t is tokens count, r is rows, and c is columns in parse table
- Provides a user-friendly CLI using the Rich library
- Allows users to explore different compilation stages
- Options: Tokenize, Token Table, Parse Table, Parse Tree, Exit
Source Code (.cpp)
↓
Lexer (lexer.py)
↓
Tokens
↓
Parser (nonpredective.py)
↓
Parse Tree
↓
Visualization
- Python 3.x
- Required libraries:
rich,anytree
pip install rich anytreepython main.py- Tokenize: View all tokens extracted from your C++ code
- Token Table: See tokens organized by type (keywords, identifiers, numbers, etc.)
- Parse Table: Display the LL(1) parse table used for syntax analysis
- Parse Tree: Generate and visualize the parse tree for your code
- Exit: Close the application
- Python 3: Main implementation language
- Rich: Terminal formatting and beautiful console output
- anytree: Tree data structure for parse tree representation
- re: Regular expressions for lexical analysis
- hashlib: SHA256 hashing for token display
This project demonstrates:
- How compilers process source code from text to structured data
- Implementation of fundamental algorithms from Data Structures and Algorithms course
- Practical application of trees, stacks, and hash tables
- Understanding of formal languages and automata theory
- Complexity analysis and algorithm optimization
- Supports only a restricted subset of C++
- No semantic analysis or code generation
- Limited error recovery mechanisms
- Pre-built parse table (not dynamically generated)
Potential improvements for learning:
- Support for more C++ constructs (functions, arrays, classes)
- Dynamic parse table generation from grammar
- Better error messages and recovery
- Semantic analysis phase
- Intermediate code generation
- Optimization techniques
Course: Data Structures and Algorithms Institution: Iran University of Science and Technology (IUST) Purpose: Educational demonstration of compiler concepts and algorithms
This is an educational project developed for academic purposes.
Note: This compiler is designed for learning purposes and demonstrates fundamental compiler concepts. It is not intended for production use or compiling real-world C++ applications.