Limited expressiveness unlimited portability
Very small cross compiler for a subset of C.
Features
- Supported target and host architectures: OpenRISC, WebAssembly, RISC-V RV32IM, ARMv6-M (Thumb-2) and x86-32.
- Valid source code for Puny C is also valid C99 and can be written in a way that gcc or clang compile it without any warning.
- Code generation is designed to be easily portable to other target architectures.
- Fast compilation, small code size.
PunyCC can compile itself. There is a separate compiler executable for every host and target combination. Host is the architecture where the compiler runs and target is the ISA of the compiled binary. Each compiler is smaller than 10 KByte:
| target \ host | wasm | x86 | armv6m | rv32 | or1k |
|---|---|---|---|---|---|
| wasm | 6049 | 7259 | 7214 | 7476 | 9244 |
| x86 | 6145 | 7442 | 7518 | 7624 | 9560 |
| armv6m | 6219 | 7454 | 7586 | 7908 | 9736 |
| rv32 | 6448 | 8041 | 8094 | 8152 | 9912 |
| or1k | 6379 | 7791 | 7994 | 8028 | 9784 |
- No linker.
- No preprocessor.
- No standard library.
- No
typedef. - No type checking. Variable types are always
unsigned int, except if indexed with[]then the type ischar *. - Any combination of
unsigned,longint,char,voidand*is accepted as valid type. - Type casts are allowed, but ignored.
- Constants: only decimal, character and string without backslash escape
- Statements:
if,while,return. - Variable declaration: C99-style statements.
- Operators: no unary, ternary, extended assignment.
- Operator precedence: simplified, use parentheses instead.
| level | operator | description |
|---|---|---|
| 1 | [] () | array and function call |
| 2 | + - << >> & ^ | | binary operation |
| 3 | < <= > >= == != | comparison |
| 4 | = | assignment |
Inspired by
- cc500 - a tiny self-hosting C compiler by Edmund Grimley Evans
- Obfuscated Tiny C Compiler - very small self compiling C compiler by Fabrice Bellard
- Tiny C Compiler - a small but hyper fast C compiler.
- Compiler Construction - brief but comprehensive book by Niklaus Wirth.
To build punycc for all target architectures use
./make.sh compile_native
The executables are named build/punycc_ARCH.native.
They read C source code from stdin and write an executable to stdout:
./punycc_x86.native < foo.c > foo.x86
To execute foo it must be made executable:
chmod +x foo.x86
./foo.x86
A cross compiled executable can be emulated with qemu:
./punycc_rv32.clang < foo.c > foo.rv32
chmod +x foo.rv32
qemu-riscv32 foo.rv32
There is no standard library or standard include files. Everything must be in
the single source code file. The host_ARCH.c files have some rudimentary
implementations of standard functions that are needed for the compiler.
Use them by concatenating files:
cat host_rv32.c hello.c | ./punycc_rv32.x86 > hello.rv32
Compile all architectures against all others and check if they produce the same on different architectures with:
./make.sh test_full
Show the compiler sizes of all combinations:
./make.sh stats
There is no inline assembler for functions that directly access the operating system (e.g. file I/O). But code can be written in pure binary:
void exit(int) _Pragma("PunyC emit \x58\x5b\x31\xc0\x40\xcd\x80");
/* 58 pop eax
5b pop ebx
31 c0 xor eax, eax
40 inc eax
cd 80 int 128 */
Other compilers ignore the _Pragma statement, which turns the line into a
forward declaration where libc can be linked against.
Each compiler consists of three parts:
- Host-specific standard functions for i/o in
host_ARCH.c - Target-specific code generation in
emit_ARCH.C - Architecture independent compiler parts (scanner, parser and symbol table)
Concatenate the three files and compile it, for example
cat host_x86.c emit_x86.c punycc.c | ./punycc_x86.clang > punycc_x86.x86
Cross compilers can be built by using a different ARCH for host_ and emit_:
cat host_x86.c emit_armv6m._c punycc.c | ./punycc_x86.clang > punycc_armv6m.x86
There is only one buffer buf.
The code grows from 0 upwards, the symbol table grows from the top downwards.
The token buffer for strings and identifiers is dynamically allocated in the
space between them:
0 code_pos code_pos+256 sym_head-256 sym_head buf_size
token_buf token_buf+token_size
+------+---------------+-------------------+---------------+--------------+
| code | 256 bytes | identifier/string | 256 bytes | symbol table |
+------+---------------+-------------------+---------------+--------------+
The symbol table starts at sym_head at ends at the end of the buffer. It is the concatination of symbol entries with the following format:
| offset | size | description |
|---|---|---|
| 0 | 4 bytes | address (little endian) |
| 4 | 1 byte | symbol type |
| 5 | 1 byte | n: length of name |
| 6 | n bytes | name |
The functions prefixed by emit_ are used to generate the machine code. The
template in emit_template.c documents all functions and can be used as
starting point for a new architecture backend. The steps to create the OpenRISC backend
are documented in codegen/or1k/steps.md and may be helpful, too.