12 releases (6 breaking)

Uses new Rust 2024

0.7.5 Jun 4, 2026
0.7.4 May 28, 2026
0.6.0 May 23, 2026
0.5.0 May 23, 2026
0.1.0 May 7, 2026

#291 in Compression

MIT license

125KB
2.5K SLoC

pgzf-rs

A Rust implementation of PGZF (Parallel GZip Format), a blocked compression format that extends standard GZIP (RFC 1952) with parallel compression/decompression and random access support.

PGZF format was designed and created by Jue Ruan (ruanjue@gmail.com). This crate is a Rust reimplementation of his original C implementation. All credit for the format design belongs to the original author.

If you use PGZF, please cite: Parallel random access GZIP format file. Jue Ruan. https://github.com/ruanjue/pgzf

Features

  • Parallel compression -- blocks within a group are compressed concurrently via rayon
  • Parallel decompression -- read-ahead buffer with batch parallel decompression, configurable readahead size
  • Block cache -- LRU cache for decompressed blocks, survives seeks to avoid redundant I/O and decompression
  • Random access -- seek by byte offset or block index using the built-in index
  • Low-level raw block API -- inspect or process raw gzip members without decompression
  • GZIP compatible -- every PGZF file is a valid sequence of gzip members; gzip -d can decompress it
  • Streaming API -- implements std::io::Write (compressor) and std::io::Read + std::io::Seek (decompressor)
  • Auto-detection -- reader automatically detects PGZF vs standard gzip files

Install

[dependencies]
pgzf = "0.7"

CLI Usage

# Compress (file -> file.gz)
pgzf input.txt

# Decompress (file.gz -> file)
pgzf -d input.txt.gz

# Stdin/stdout
echo "hello" | pgzf > out.gz
pgzf -d < out.gz

# Keep original files
pgzf -k input.txt

# Write to stdout
pgzf -c input.txt

# Compression level 9 with 4 threads
pgzf -l 9 -t 4 input.txt

# Random access: read 100 bytes at offset 1000
pgzf -d -s 1000 -q 100 input.txt.gz

# Inspect file info
pgzf -i input.txt.gz

CLI Options

Usage: pgzf [OPTIONS] [FILE]...

Options:
  -d                  Decompress
  -c                  Write to stdout, keep original files
  -k                  Keep input files
  -f                  Force overwrite
  -o <OUTPUT>         Output file
  -t <THREADS>        Number of threads [default: 8]
  -b <BLOCK_SIZE_MB>  Block size in MB (1-256) [default: 1]
  -g <GROUP_BLOCKS>   Number of blocks per group [default: 8000]
  -s <SEEK_BYTE>      Seek to byte offset (decompress only)
  -q <LIMIT>          Limit output bytes (decompress only)
  -l <LEVEL>          Compression level (1-9) [default: 6]
  -i                  Inspect compressed file info

Library Usage

Compress

use pgzf::{PgzfWriter, PgzfConfig};
use std::io::{Write, Cursor};

let config = PgzfConfig::builder()
    .block_size_mb(1)
    .group_blocks(8000)
    .compression_level(6)
    .build();

let mut writer = PgzfWriter::with_config(Cursor::new(Vec::new()), config);
writer.write_all(b"Hello, PGZF!").unwrap();
let cursor = writer.finish().unwrap();
let compressed = cursor.into_inner();

Decompress

use pgzf::PgzfReader;
use std::io::Read;

let mut reader = PgzfReader::new(std::io::Cursor::new(compressed)).unwrap();
let mut output = String::new();
reader.read_to_string(&mut output).unwrap();
assert_eq!(output, "Hello, PGZF!");

Random Access

use pgzf::PgzfReader;
use std::io::{Read, Seek, SeekFrom};

let mut reader = PgzfReader::new(file).unwrap();

// Seek by byte offset
reader.seek_to_byte(1000).unwrap();
let mut buf = [0u8; 100];
reader.read(&mut buf).unwrap();

// Seek by block index
reader.seek_to_block(5).unwrap();

// Standard Seek trait
reader.seek(SeekFrom::Start(500)).unwrap();

Read Block Range

Use read_blocks to decompress a contiguous range of blocks at once. The blocks are decompressed in parallel internally.

use pgzf::PgzfReader;
use std::io::Read;

let mut reader = PgzfReader::new(file).unwrap();

// Read blocks 2-5 (4 blocks total)
let data = reader.read_blocks(2, 4).unwrap();

Raw Block Iteration

Use read_one_raw_block to iterate over raw gzip members without decompression. Each block is returned as a RawBlock struct with the full gzip member, block type, and block index.

use pgzf::{PgzfReader, RawBlock};

let mut reader = PgzfReader::new(file).unwrap();
while let Some(RawBlock { block_index, block_type, raw, .. }) = reader.read_one_raw_block()? {
    println!("block {block_index}: type={block_type:?}, size={}", raw.len());
}

Configure Reader Threading

Control the readahead batch size (number of blocks decompressed in parallel) and the global rayon thread pool:

use pgzf::PgzfReader;

// Builder-style (consumes reader)
let reader = PgzfReader::new(file)?.with_readahead(16);

// Or mutate an existing reader
let mut reader = PgzfReader::new(file)?;
reader.set_readahead_size(16);
println!("readahead: {}", reader.readahead_size());

// Configure rayon global thread pool before creating readers
rayon::ThreadPoolBuilder::new()
    .num_threads(4)
    .build_global()
    .unwrap();

Block Cache

The reader maintains an LRU cache of decompressed blocks that survives seeks. This avoids redundant I/O and decompression when seeking back to previously accessed positions. The cache is enabled by default with a capacity of 64 blocks.

use pgzf::PgzfReader;

// Default cache (64 blocks)
let reader = PgzfReader::new(file)?;

// Custom cache capacity
let reader = PgzfReader::new(file)?.with_block_cache(256);

// Disable cache
let reader = PgzfReader::new(file)?.with_block_cache(0);

// Runtime control
let mut reader = PgzfReader::new(file)?;
reader.set_block_cache_capacity(128);
println!("cache capacity: {}", reader.block_cache_capacity());
println!("cached blocks: {}", reader.block_cache_len());

Cache performance characteristics:

Access pattern Speedup
Repeated seek to same block ~460x
Random seek (cache fits working set) ~2x
Random seek (working set > cache) ~1.1-1.5x
Sequential read No overhead

Inspect Index

use pgzf::PgzfIndex;
use std::fs::File;

let mut file = File::open("output.gz").unwrap();
let index = PgzfIndex::build(&mut file).unwrap();

println!("Groups: {}", index.group_count());
println!("Data blocks: {}", index.block_count());
println!("Uncompressed size: {} bytes", index.total_uncompressed());
println!("Compressed size: {} bytes", index.total_compressed());

Specification

Dependencies

~4.5MB
~80K SLoC