A collection of Run-Length Encoders and Decoders, and associated tooling for exploring this space. So far there are only four animals in the zoo. It's a very small zoo.
- WHILE THIS NOTE PERSISTS, I MAY FORCE PUSH TO MASTER
- The codecs are written foremost to be robust, correct, and clear and easy to understand, not for performance.
- I'm mostly interested in exploring legacy formats.
- Single-Header Libraries.
- Releases are:
- Valgrind clean,
- scan-build clean, and
- AFL++ clean (for some reasonable run-time).
"Run-length encoding (RLE) is a form of lossless data compression in which runs of data (sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run." -- Wikipedia
At its most basic, a Run-Length Encoder process input into a series of REP (repeat) and CPY (copy) operations. Sometimes the CPY operation is explicit, sometimes it's in the form of literals (LIT).
The rle_*.h
files are single-header libraries. If you just need one, any one, I recommend downloading rle_packbits.h
and
looking at test_example.c
for how to use it.
#define RLE_ZOO_PACKBITS_IMPLEMENTATION
#include "rle_packbits.h"
int main(void) {
const uint8_t input[] = "ABBBBA";
size_t len = sizeof(input) - 1;
// Call with NULL for dest buffer to calculate output size.
ssize_t expected_size = packbits_compress(input, len, NULL, 0);
...
rle-zoo
can encode and decode files using any of the supplied variants.
rle-genops
can be used to generate complete code word/OPs lists for supported variants, and contains code that verifies
the encoding and decoding scheme for a variant is consistent. Post-implementation this is mostly useful for debugging,
'manual parsing' and reverse-engineering of unknown RLE streams. It can also generate C tables for implementing table-driven
encoders and decoders.
rle-parser
can be used to parse a file using the available RLE variants, which could help identify the
variant used on some unknown data. It also acts as a demonstrator for using rle-genops
tables. It
is a work in progress though, and encoding is broken for some tables.
Usage: ./rle-parser [-d|-e] [-s] [-o offset] [-n len] [-t variant|all] <file>
options:
-d|-e decode / encode(broken)
-s silent -- no debug print
-o file offset to start at
-n number of bytes to process
-t codec name, or 'all'
Available variants:
goldbox
packbits
pcx
icns
Example parsing a packbits encoded file:
$ hexdump -C tests/packbits/tn1023.rle
00000000 fe aa 02 80 00 2a fd aa 03 80 00 2a 22 f7 aa |.....*.....*...|
0000000f
$ ./rle-parser -d -t packbits tests/packbits/tn1023.rle
Reading input from 'tests/packbits/tn1023.rle' (offset=0x0, max len=0x2000)
Parsing 15 byte buffer with 'packbits'
00000000: <fe> REP 3 'aa'
00000002: <02> CPY 3 ; 80 00 2a
00000006: <fd> REP 4 'aa'
00000008: <03> CPY 4 ; 80 00 2a 22
0000000d: <f7> REP 10 'aa'
Parse: rp=15, wp=24
Example parsing a file into packbits format:
$ hexdump -C tests/packbits/tn1023
00000000 aa aa aa 80 00 2a aa aa aa aa 80 00 2a 22 aa aa |.....*......*...|
00000010 aa aa aa aa aa aa aa aa |........|
$ ./rle-parser -e -t packbits tests/packbits/tn1023
Reading input from 'tests/packbits/tn1023' (offset=0x0, max len=0x2000)
Encoding 24 byte buffer with 'packbits'
Encode params = { cpy:{ 1, 128 }, rep:{2, 128} }
00000000: <fe> REP 3 'aa'
00000003: <02> CPY 3 ; 80 00 2a
00000006: <fd> REP 4 'aa'
0000000a: <03> CPY 4 ; 80 00 2a 22
0000000e: <f7> REP 10 'aa'
rp=24, wp=15
Currently the following extraordinary specimens are grazing the fertile grounds of this most amazing Zoo:
Variant | Type | Code Use | REP Range | CPY/LIT Range | Spec. | Notes |
---|---|---|---|---|---|---|
Packbits | CPY | Near-optimal | 2 - 128 | 1 - 128 | ref1 | One code (0x80) wasted on NOP. |
Goldbox | CPY | Sub-optimal | 1 - 127 | 1 - 126 | n/a | Used by SSI Goldbox titles. Many quirks. |
PCX | LIT | Sub-optimal | 0 - 63 | 0 - 191 | link | |
Apple ICNS | CPY | Optimal | 3 - 130 | 1 - 128 | ref |
The packbits
variant reportedly originates from the Apple Macintosh1, but spread far and wide from there, including
into Electronic Arts IFF ILBM.
-
One OP byte encoding the operation and
length
:- 0x00 => CPY 1
- 0x01 => CPY 2
- ..
- 0x7e => CPY 127
- 0x7f => CPY 128
- 0x80 => reserved
- 0x81 => REP 128
- 0x82 => REP 127
- ..
- 0xfe => REP 3
- 0xff => REP 2
-
CPY: If OP high-bit is CLEAR (equally; signed OP is non-negative), then
OP + 1
literal bytes follows. -
REP: If OP high-bit is SET (equally; signed OP is negative), the next byte is repeated 257 - OP (or 1 - signed OP) times.
The encoder should not generate 0x80, which is reserved. The technical note from Apple states that a decoder should "[handle] this situation by skipping any flag-counter byte with this value and interpreting the next byte as the next flag-counter byte."
The existance of the NOP
and REP 2
-- which requires two bytes to encode a run of two bytes -- are inefficiencies in the coding. Not terrible, but you can do better.
The goldbox
variant has been organically hand-crafted to be 100% compatible with the classic Goldbox games Pool of Radiance and Curse of the Azure Bonds,
and probably more. I have round-trip verified the decompressor and compressor on all resource files in both of those games, though the tests provided
in this repository are much more modest.
This variant is mostly interesting in the ways it's suboptimal; It's slightly short-stroked, and doesn't deal optimally with the end of the input.
- One OP byte encoding the operation and
length
:- 0x00 => CPY 1
- 0x01 => CPY 2
- ..
- 0x7d => CPY 126
- 0x7e => CPY 127 ; not used by encoder
- 0x7f => CPY 128 ; not used by encoder
- 0x80 => REP 128 ; not used by encoder
- 0x81 => REP 127
- 0x82 => REP 126
- ..
- 0xfe => REP 2
- 0xff => REP 1
- CPY: If OP high-bit is CLEAR (equally; signed OP is non-negative), then
OP + 1
literal bytes follows. - REP: If OP high-bit is SET (equally; signed OP is negative), then the next byte is to be repeated 1 + the bitwise complement of OP (or equally; the negation of the signed
OP
) times.
To the best of my knowledge, the encoder will never use OPs 0x7e, 0x7f or 0x80, i.e it will never output a literal run of 127 or 128 characters, nor a run of 128 repeated characters.
In addition to not using all the encodings, having both CPY 1 and REP 1 is obviously suboptimal.
It will also insist on encoding the tail of the input as a REP OP if it hits the end while counting literals to CPY.
To the last point, the input 1234
will be encoded as 02 31 32 33 ff 34
, i.e CPY 3 characters, then REP 1 character.
These limits and quirks were determined experimentally using existing game data files. The decoders in the games MAY accept more optimal encodings, but the encoder provided here was specifically crafted such that encoding the output of the decoder is bit-identitical to the original input.
The PCX
variant comes from the ZSoft IBM PC Paintbrush software and its associated PCX image format.
This image format was popular on the PC platform in the mid-1980s and well into the 1990s2, but is now thoroughly obsolete.
The only compression ever defined for this format is a simple RLE variant which uses two bits of a byte to encode REPs, and leaves the rest as literals.
- One OP byte encoding REP and
length
, or used as-is:- 0x00 => LIT 0
- 0x01 => LIT 1
- ..
- 0xbd => LIT 189
- 0xbe => LIT 190
- 0xbf => LIT 191
- 0xc0 => REP 0
- 0xc1 => REP 1
- 0xc2 => REP 2
- ..
- 0xfe => REP 62
- 0xff => REP 63
- REP: If the top two bits of OP are SET (equally; OP is >= 192), then the next byte is repeated
OP & 0x3F
times (0-63). - LIT: Any other byte value is used as-is (0-191).
Obviously any single input byte larger than 191 MUST be encoded as a REP 1 -- expanding the output with one byte -- but an encoder COULD encode any value using REP 1. In addition, depending on how the encoder is structured, it may encode two repeated literals as-is, or as a REP 2. An encoder that scans for runs of literals will probably do the first, while one that output literals as they are seen will probably do the former.
The existance of a REP 0
operation is an inefficiency, and allows the encoder to encode data that is not included in the decoded output.
Used by Apple to compress icon resources. This RLE variant uses a sensible encoding scheme, recognizing that
any REP
with a count of 2 or less is redundant, and doesn't waste precious code space on a NOP
.
- One OP byte encoding the operation and
length
:- 0x00 => CPY 1
- 0x01 => CPY 2
- ..
- 0x7d => CPY 126
- 0x7e => CPY 127
- 0x7f => CPY 128
- 0x80 => REP 3
- 0x81 => REP 4
- 0x82 => REP 5
- ..
- 0xfe => REP 129
- 0xff => REP 130
- CPY: If high-bit is clear, then
OP + 1
literal bytes follow. - REP: If high-bit is set, then repeat next byte
OP - 125
times. Alternatively, clear the high bit and add three to get the count.
The CPY
OPs (0x00-0x7f) are the same as in packbits, but the REP
OPs (0x80-) have been adjusted up to a minimum count of three.
They also come in ascending order compared with packbits; more characters are copied as the OP increase in value, versus fewer in packbits.
- Add 'all' variant compression reporting to
rle-zoo
- Make the rle-parse encoder follow limits/correct.
- Perhaps abandon table idea, generate C source for the codec functions instead.
- Support n-bit variants. At least nibbles + 16-bit, but why not everything.
- Add more animals. Potential candidates: BMP(?), TGA, EXEPACK(?!), many examples here...
- Improve
rle-zoo
to behave more like a standard UNIX filter.
All code is provided under the MIT License.
Footnotes
-
"Understanding PackBits", Apple Technical Note TN1023. ↩ ↩2
-
The MAME project was using PCX up till May 1999; "Switched to PNG from PCX as the main screenshot image format", Project Milestones ↩