Skip to content
/ punycc Public

Very small self-compiling cross compiler for a subset of C

License

Notifications You must be signed in to change notification settings

bobbl/punycc

Repository files navigation

Puny C Compiler

Limited expressiveness unlimited portability

Very small cross compiler for a subset of C.

Features

  • Supported target and host architectures: OpenRISC, WebAssembly, RISC-V RV32IM, ARMv6-M (Thumb-2) and x86-32.
  • Valid source code for Puny C is also valid C99 and can be written in a way that gcc or clang compile it without any warning.
  • Code generation is designed to be easily portable to other target architectures.
  • Fast compilation, small code size.

Compiler Size

PunyCC can compile itself. There is a separate compiler executable for every host and target combination. Host is the architecture where the compiler runs and target is the ISA of the compiled binary. Each compiler is smaller than 10 KByte:

target \ host wasm x86 armv6m rv32 or1k
wasm 6049 7259 7214 7476 9244
x86 6145 7442 7518 7624 9560
armv6m 6219 7454 7586 7908 9736
rv32 6448 8041 8094 8152 9912
or1k 6379 7791 7994 8028 9784

Language Restrictions

  • No linker.
  • No preprocessor.
  • No standard library.
  • No typedef.
  • No type checking. Variable types are always unsigned int, except if indexed with [] then the type is char *.
  • Any combination of unsigned, long int, char, void and * is accepted as valid type.
  • Type casts are allowed, but ignored.
  • Constants: only decimal, character and string without backslash escape
  • Statements: if, while, return.
  • Variable declaration: C99-style statements.
  • Operators: no unary, ternary, extended assignment.
  • Operator precedence: simplified, use parentheses instead.
level operator description
1 [] () array and function call
2 + - << >> & ^ | binary operation
3 < <= > >= == != comparison
4 = assignment

Inspired by

Usage

To build punycc for all target architectures use

./make.sh compile_native

The executables are named build/punycc_ARCH.native. They read C source code from stdin and write an executable to stdout:

./punycc_x86.native < foo.c > foo.x86

To execute foo it must be made executable:

chmod +x foo.x86
./foo.x86

A cross compiled executable can be emulated with qemu:

./punycc_rv32.clang < foo.c > foo.rv32
chmod +x foo.rv32
qemu-riscv32 foo.rv32

There is no standard library or standard include files. Everything must be in the single source code file. The host_ARCH.c files have some rudimentary implementations of standard functions that are needed for the compiler. Use them by concatenating files:

cat host_rv32.c hello.c | ./punycc_rv32.x86 > hello.rv32

Compile all architectures against all others and check if they produce the same on different architectures with:

./make.sh test_full

Show the compiler sizes of all combinations:

./make.sh stats

Low-Level Functions

There is no inline assembler for functions that directly access the operating system (e.g. file I/O). But code can be written in pure binary:

void exit(int) _Pragma("PunyC emit \x58\x5b\x31\xc0\x40\xcd\x80");
/*  58      pop eax
    5b      pop ebx
    31 c0   xor eax, eax
    40      inc eax
    cd 80   int 128 */

Other compilers ignore the _Pragma statement, which turns the line into a forward declaration where libc can be linked against.

Implementation Details

Each compiler consists of three parts:

  1. Host-specific standard functions for i/o in host_ARCH.c
  2. Target-specific code generation in emit_ARCH.C
  3. Architecture independent compiler parts (scanner, parser and symbol table)

Concatenate the three files and compile it, for example

cat host_x86.c emit_x86.c punycc.c | ./punycc_x86.clang > punycc_x86.x86

Cross compilers can be built by using a different ARCH for host_ and emit_:

cat host_x86.c emit_armv6m._c punycc.c | ./punycc_x86.clang > punycc_armv6m.x86

Memory Management

There is only one buffer buf. The code grows from 0 upwards, the symbol table grows from the top downwards. The token buffer for strings and identifiers is dynamically allocated in the space between them:

0   code_pos     code_pos+256         sym_head-256      sym_head   buf_size
                   token_buf      token_buf+token_size
+------+---------------+-------------------+---------------+--------------+
| code |   256 bytes   | identifier/string |   256 bytes   | symbol table |
+------+---------------+-------------------+---------------+--------------+

Symbol Table

The symbol table starts at sym_head at ends at the end of the buffer. It is the concatination of symbol entries with the following format:

offset size description
0 4 bytes address (little endian)
4 1 byte symbol type
5 1 byte n: length of name
6 n bytes name

Code Generation

The functions prefixed by emit_ are used to generate the machine code. The template in emit_template.c documents all functions and can be used as starting point for a new architecture backend. The steps to create the OpenRISC backend are documented in codegen/or1k/steps.md and may be helpful, too.

About

Very small self-compiling cross compiler for a subset of C

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published