Static hash table that scales via disk-backed expansion to trim memory usage while ensuring each lookup needs exactly one I/O.
- During the build phase, split the raw data into enough buckets so each bucket stays small, which reduces disk I/O during lookups.
- During lookups, touch only a single bucket and load all of its data into memory at once for comparisons.
- Keep only the bucket index in memory; by tuning the number of buckets you can balance memory usage against lookup speed.
Add massmap to your Cargo.toml:
cargo add massmapor manually:
[dependencies]
massmap = "0.1"use massmap::{MassMap, MassMapBuilder};
fn main() -> std::io::Result<()> {
let entries = [
("apple".to_string(), 1u32),
("banana".to_string(), 2u32),
("cherry".to_string(), 3u32),
];
// Offline build step.
let file = std::fs::File::create("fruits.massmap")?;
MassMapBuilder::default()
.with_hash_seed(42)
.with_bucket_count(1024)
.with_bucket_size_limit(16 << 10)
.build(file, entries.iter())?;
// Read-only lookup phase.
let file = std::fs::File::open("fruits.massmap")?;
let map = MassMap::<String, u32, _>::load(file)?;
assert_eq!(map.get("banana")?, Some(2));
Ok(())
}A command-line utility massmap is provided for inspecting and querying massmap files.
# 1. install the CLI tool
cargo install massmap --examples
# 2. convert a JSON file to massmap format
massmap convert -i examples/demo.json -o examples/demo.massmap --bucket-count 32
#> {
#> "header": {
#> "meta_offset": 486,
#> "meta_length": 239
#> },
#> "meta": {
#> "entry_count": 47,
#> "bucket_count": 32,
#> "occupied_bucket_count": 27,
#> "occupied_bucket_range": {
#> "start": 0,
#> "end": 32
#> },
#> "hash_config": {
#> "name": "foldhash",
#> "parameters": {
#> "seed": 0
#> }
#> },
#> "key_type": "alloc::string::String",
#> "value_type": "serde_json::value::Value"
#> }
#> }
# 3. query a key from the massmap file
massmap info examples/demo.massmap -k 1999
#> {
#> "header": {
#> "meta_offset": 486,
#> "meta_length": 239
#> },
#> "meta": {
#> "entry_count": 47,
#> "bucket_count": 32,
#> "occupied_bucket_count": 27,
#> "occupied_bucket_range": {
#> "start": 0,
#> "end": 32
#> },
#> "hash_config": {
#> "name": "foldhash",
#> "parameters": {
#> "seed": 0
#> }
#> },
#> "key_type": "alloc::string::String",
#> "value_type": "serde_json::value::Value"
#> }
#> }
#> Get 1999: Some(Number(7229))
# 4. view the raw bytes of the massmap file
hexdump -C examples/demo.massmap
#> 00000000 4d 41 53 53 4d 41 50 21 00 00 00 00 00 00 01 e6 |MASSMAP!........|
#> 00000010 00 00 00 00 00 00 00 ef 92 92 a4 31 39 39 34 cd |...........1994.|
#> 00000020 0f f1 92 a4 32 30 32 30 ce 00 01 18 94 91 92 a4 |....2020........|
#> 00000030 32 30 32 34 ce 00 01 76 05 92 92 a4 31 39 38 39 |2024...v....1989|
#> 00000040 cd 06 00 92 a4 32 30 30 39 cd 66 44 92 92 a4 31 |.....2009.fD...1|
#> 00000050 39 37 38 cd 01 81 92 a4 31 39 39 36 cd 17 0a 94 |978.....1996....|
#> 00000060 92 a4 31 39 39 31 cd 07 78 92 a4 31 39 39 38 cd |..1991..x..1998.|
#> 00000070 1a cc 92 a4 32 30 30 31 cd 22 0d 92 a4 32 30 31 |....2001."...201|
#> 00000080 38 cd ff fe 91 92 a4 32 30 31 37 cd e8 c8 93 92 |8......2017.....|
#> 00000090 a4 31 39 37 39 cd 01 a7 92 a4 32 30 31 30 cd 78 |.1979.....2010.x|
#> 000000a0 58 92 a4 32 30 31 35 cd c3 02 92 92 a4 31 39 39 |X..2015......199|
#> 000000b0 35 cd 13 e3 92 a4 31 39 39 39 cd 1c 3d 91 92 a4 |5.....1999..=...|
#> 000000c0 31 39 38 30 cd 01 d4 91 92 a4 32 30 30 38 cd 5e |1980......2008.^|
#> 000000d0 24 92 92 a4 32 30 30 35 cd 38 20 92 a4 32 30 32 |$...2005.8 ..202|
#> 000000e0 31 ce 00 01 3c 50 92 92 a4 32 30 30 33 cd 29 aa |1...<P...2003.).|
#> 000000f0 92 a4 32 30 31 32 cd 9b 5b 94 92 a4 31 39 39 33 |..2012..[...1993|
#> 00000100 cd 0b d3 92 a4 32 30 30 30 cd 1f 06 92 a4 32 30 |.....2000.....20|
#> 00000110 31 31 cd 8d b5 92 a4 32 30 31 36 cd d2 17 91 92 |11.....2016.....|
#> 00000120 a4 32 30 30 37 cd 50 0e 92 92 a4 31 39 38 37 cd |.2007.P....1987.|
#> 00000130 04 63 92 a4 32 30 30 34 cd 30 c7 91 92 a4 31 39 |.c..2004.0....19|
#> 00000140 38 38 cd 05 62 92 92 a4 31 39 38 36 cd 03 cd 92 |88..b...1986....|
#> 00000150 a4 31 39 39 37 cd 19 51 92 92 a4 31 39 38 31 cd |.1997..Q...1981.|
#> 00000160 01 f1 92 a4 31 39 38 34 cd 02 be 91 92 a4 32 30 |....1984......20|
#> 00000170 32 33 ce 00 01 66 62 91 92 a4 31 39 39 32 cd 09 |23...fb...1992..|
#> 00000180 1e 92 92 a4 31 39 38 35 cd 03 62 92 a4 32 30 31 |....1985..b..201|
#> 00000190 34 cd b7 40 92 92 a4 32 30 30 32 cd 25 22 92 a4 |4..@...2002.%"..|
#> 000001a0 32 30 30 36 cd 41 62 91 92 a4 32 30 31 33 cd a9 |2006.Ab...2013..|
#> 000001b0 e9 91 92 a4 31 39 38 33 cd 02 4c 91 92 a4 31 39 |....1983..L...19|
#> 000001c0 38 32 cd 02 15 92 92 a4 31 39 39 30 cd 06 7f 92 |82......1990....|
#> 000001d0 a4 32 30 31 39 ce 00 01 11 be 91 92 a4 32 30 32 |.2019........202|
#> 000001e0 32 ce 00 01 4e c2 92 97 2f 20 1b 92 00 20 92 a8 |2...N.../ ... ..|
#> 000001f0 66 6f 6c 64 68 61 73 68 81 a4 73 65 65 64 00 b5 |foldhash..seed..|
#> 00000200 61 6c 6c 6f 63 3a 3a 73 74 72 69 6e 67 3a 3a 53 |alloc::string::S|
#> 00000210 74 72 69 6e 67 b8 73 65 72 64 65 5f 6a 73 6f 6e |tring.serde_json|
#> 00000220 3a 3a 76 61 6c 75 65 3a 3a 56 61 6c 75 65 dc 00 |::value::Value..|
#> 00000230 20 93 18 15 02 93 2d 0c 01 93 39 13 02 93 4c 13 | .....-...9...L.|
#> 00000240 02 93 5f 25 04 93 cc 84 0a 01 93 cc 8e 1c 03 93 |.._%............|
#> 00000250 cc aa 13 02 93 cc bd 0a 01 93 cc c7 0a 01 93 00 |................|
#> 00000260 00 00 93 00 00 00 93 cc d1 15 02 93 cc e6 13 02 |................|
#> 00000270 93 00 00 00 93 cc f9 25 04 93 cd 01 1e 0a 01 93 |.......%........|
#> 00000280 cd 01 28 13 02 93 cd 01 3b 0a 01 93 cd 01 45 13 |..(.....;.....E.|
#> 00000290 02 93 00 00 00 93 cd 01 58 13 02 93 00 00 00 93 |........X.......|
#> 000002a0 cd 01 6b 0c 01 93 cd 01 77 0a 01 93 cd 01 81 13 |..k.....w.......|
#> 000002b0 02 93 cd 01 94 13 02 93 cd 01 a7 0a 01 93 cd 01 |................|
#> 000002c0 b1 0a 01 93 cd 01 bb 0a 01 93 cd 01 c5 15 02 93 |................|
#> 000002d0 cd 01 da 0c 01 |.....|with_hash_seed(seed): choose deterministic sharding.with_bucket_count(count): trade memory for faster lookups.with_writer_buffer_size(bytes): tune streaming IO throughput.with_field_names(true): emit MessagePack maps with named fields for easier debugging.with_bucket_size_limit(bytes): guard against oversized buckets.- Replace the default
MassMapHashLoaderto plug in custom hashers.
MassMapReader and MassMapWriter abstract over positional IO. The traits are implemented for std::fs::File out of the box, but they can also wrap network storage, memory-mapped regions, or custom paged backends. Override MassMapReader::batch_read_at to dispatch vectored reads when available.
Every massmap file begins with a 24-byte header containing a magic number, metadata offset, and metadata length. Key/value buckets are stored contiguously after the header in MessagePack format, followed by a serialized MassMapMeta structure that records bucket locations and hashing parameters.
Comprehensive API documentation is available on docs.rs.
This project is dual-licensed under the MIT License and Apache License 2.0.