Skip to content

Static hash table that scales via disk-backed expansion to trim memory usage while ensuring each lookup needs exactly one I/O.

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

SF-Zhou/massmap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

massmap

Rust codecov Crates.io Documentation

Static hash table that scales via disk-backed expansion to trim memory usage while ensuring each lookup needs exactly one I/O.

  1. During the build phase, split the raw data into enough buckets so each bucket stays small, which reduces disk I/O during lookups.
  2. During lookups, touch only a single bucket and load all of its data into memory at once for comparisons.
  3. Keep only the bucket index in memory; by tuning the number of buckets you can balance memory usage against lookup speed.

Installation

Add massmap to your Cargo.toml:

cargo add massmap

or manually:

[dependencies]
massmap = "0.1"

Quick Start

use massmap::{MassMap, MassMapBuilder};

fn main() -> std::io::Result<()> {
    let entries = [
        ("apple".to_string(), 1u32),
        ("banana".to_string(), 2u32),
        ("cherry".to_string(), 3u32),
    ];

    // Offline build step.
    let file = std::fs::File::create("fruits.massmap")?;
    MassMapBuilder::default()
        .with_hash_seed(42)
        .with_bucket_count(1024)
        .with_bucket_size_limit(16 << 10)
        .build(file, entries.iter())?;

    // Read-only lookup phase.
    let file = std::fs::File::open("fruits.massmap")?;
    let map = MassMap::<String, u32, _>::load(file)?;
    assert_eq!(map.get("banana")?, Some(2));
    Ok(())
}

CLI tool

A command-line utility massmap is provided for inspecting and querying massmap files.

# 1. install the CLI tool
cargo install massmap --examples

# 2. convert a JSON file to massmap format
massmap convert -i examples/demo.json -o examples/demo.massmap --bucket-count 32
#> {
#>   "header": {
#>     "meta_offset": 486,
#>     "meta_length": 239
#>   },
#>   "meta": {
#>     "entry_count": 47,
#>     "bucket_count": 32,
#>     "occupied_bucket_count": 27,
#>     "occupied_bucket_range": {
#>       "start": 0,
#>       "end": 32
#>     },
#>     "hash_config": {
#>       "name": "foldhash",
#>       "parameters": {
#>         "seed": 0
#>       }
#>     },
#>     "key_type": "alloc::string::String",
#>     "value_type": "serde_json::value::Value"
#>   }
#> }

# 3. query a key from the massmap file
massmap info examples/demo.massmap -k 1999
#> {
#>   "header": {
#>     "meta_offset": 486,
#>     "meta_length": 239
#>   },
#>   "meta": {
#>     "entry_count": 47,
#>     "bucket_count": 32,
#>     "occupied_bucket_count": 27,
#>     "occupied_bucket_range": {
#>       "start": 0,
#>       "end": 32
#>     },
#>     "hash_config": {
#>       "name": "foldhash",
#>       "parameters": {
#>         "seed": 0
#>       }
#>     },
#>     "key_type": "alloc::string::String",
#>     "value_type": "serde_json::value::Value"
#>   }
#> }
#> Get 1999: Some(Number(7229))

# 4. view the raw bytes of the massmap file
hexdump -C examples/demo.massmap
#> 00000000  4d 41 53 53 4d 41 50 21  00 00 00 00 00 00 01 e6  |MASSMAP!........|
#> 00000010  00 00 00 00 00 00 00 ef  92 92 a4 31 39 39 34 cd  |...........1994.|
#> 00000020  0f f1 92 a4 32 30 32 30  ce 00 01 18 94 91 92 a4  |....2020........|
#> 00000030  32 30 32 34 ce 00 01 76  05 92 92 a4 31 39 38 39  |2024...v....1989|
#> 00000040  cd 06 00 92 a4 32 30 30  39 cd 66 44 92 92 a4 31  |.....2009.fD...1|
#> 00000050  39 37 38 cd 01 81 92 a4  31 39 39 36 cd 17 0a 94  |978.....1996....|
#> 00000060  92 a4 31 39 39 31 cd 07  78 92 a4 31 39 39 38 cd  |..1991..x..1998.|
#> 00000070  1a cc 92 a4 32 30 30 31  cd 22 0d 92 a4 32 30 31  |....2001."...201|
#> 00000080  38 cd ff fe 91 92 a4 32  30 31 37 cd e8 c8 93 92  |8......2017.....|
#> 00000090  a4 31 39 37 39 cd 01 a7  92 a4 32 30 31 30 cd 78  |.1979.....2010.x|
#> 000000a0  58 92 a4 32 30 31 35 cd  c3 02 92 92 a4 31 39 39  |X..2015......199|
#> 000000b0  35 cd 13 e3 92 a4 31 39  39 39 cd 1c 3d 91 92 a4  |5.....1999..=...|
#> 000000c0  31 39 38 30 cd 01 d4 91  92 a4 32 30 30 38 cd 5e  |1980......2008.^|
#> 000000d0  24 92 92 a4 32 30 30 35  cd 38 20 92 a4 32 30 32  |$...2005.8 ..202|
#> 000000e0  31 ce 00 01 3c 50 92 92  a4 32 30 30 33 cd 29 aa  |1...<P...2003.).|
#> 000000f0  92 a4 32 30 31 32 cd 9b  5b 94 92 a4 31 39 39 33  |..2012..[...1993|
#> 00000100  cd 0b d3 92 a4 32 30 30  30 cd 1f 06 92 a4 32 30  |.....2000.....20|
#> 00000110  31 31 cd 8d b5 92 a4 32  30 31 36 cd d2 17 91 92  |11.....2016.....|
#> 00000120  a4 32 30 30 37 cd 50 0e  92 92 a4 31 39 38 37 cd  |.2007.P....1987.|
#> 00000130  04 63 92 a4 32 30 30 34  cd 30 c7 91 92 a4 31 39  |.c..2004.0....19|
#> 00000140  38 38 cd 05 62 92 92 a4  31 39 38 36 cd 03 cd 92  |88..b...1986....|
#> 00000150  a4 31 39 39 37 cd 19 51  92 92 a4 31 39 38 31 cd  |.1997..Q...1981.|
#> 00000160  01 f1 92 a4 31 39 38 34  cd 02 be 91 92 a4 32 30  |....1984......20|
#> 00000170  32 33 ce 00 01 66 62 91  92 a4 31 39 39 32 cd 09  |23...fb...1992..|
#> 00000180  1e 92 92 a4 31 39 38 35  cd 03 62 92 a4 32 30 31  |....1985..b..201|
#> 00000190  34 cd b7 40 92 92 a4 32  30 30 32 cd 25 22 92 a4  |4..@...2002.%"..|
#> 000001a0  32 30 30 36 cd 41 62 91  92 a4 32 30 31 33 cd a9  |2006.Ab...2013..|
#> 000001b0  e9 91 92 a4 31 39 38 33  cd 02 4c 91 92 a4 31 39  |....1983..L...19|
#> 000001c0  38 32 cd 02 15 92 92 a4  31 39 39 30 cd 06 7f 92  |82......1990....|
#> 000001d0  a4 32 30 31 39 ce 00 01  11 be 91 92 a4 32 30 32  |.2019........202|
#> 000001e0  32 ce 00 01 4e c2 92 97  2f 20 1b 92 00 20 92 a8  |2...N.../ ... ..|
#> 000001f0  66 6f 6c 64 68 61 73 68  81 a4 73 65 65 64 00 b5  |foldhash..seed..|
#> 00000200  61 6c 6c 6f 63 3a 3a 73  74 72 69 6e 67 3a 3a 53  |alloc::string::S|
#> 00000210  74 72 69 6e 67 b8 73 65  72 64 65 5f 6a 73 6f 6e  |tring.serde_json|
#> 00000220  3a 3a 76 61 6c 75 65 3a  3a 56 61 6c 75 65 dc 00  |::value::Value..|
#> 00000230  20 93 18 15 02 93 2d 0c  01 93 39 13 02 93 4c 13  | .....-...9...L.|
#> 00000240  02 93 5f 25 04 93 cc 84  0a 01 93 cc 8e 1c 03 93  |.._%............|
#> 00000250  cc aa 13 02 93 cc bd 0a  01 93 cc c7 0a 01 93 00  |................|
#> 00000260  00 00 93 00 00 00 93 cc  d1 15 02 93 cc e6 13 02  |................|
#> 00000270  93 00 00 00 93 cc f9 25  04 93 cd 01 1e 0a 01 93  |.......%........|
#> 00000280  cd 01 28 13 02 93 cd 01  3b 0a 01 93 cd 01 45 13  |..(.....;.....E.|
#> 00000290  02 93 00 00 00 93 cd 01  58 13 02 93 00 00 00 93  |........X.......|
#> 000002a0  cd 01 6b 0c 01 93 cd 01  77 0a 01 93 cd 01 81 13  |..k.....w.......|
#> 000002b0  02 93 cd 01 94 13 02 93  cd 01 a7 0a 01 93 cd 01  |................|
#> 000002c0  b1 0a 01 93 cd 01 bb 0a  01 93 cd 01 c5 15 02 93  |................|
#> 000002d0  cd 01 da 0c 01                                    |.....|

Configuration

  • with_hash_seed(seed): choose deterministic sharding.
  • with_bucket_count(count): trade memory for faster lookups.
  • with_writer_buffer_size(bytes): tune streaming IO throughput.
  • with_field_names(true): emit MessagePack maps with named fields for easier debugging.
  • with_bucket_size_limit(bytes): guard against oversized buckets.
  • Replace the default MassMapHashLoader to plug in custom hashers.

Readers and Writers

MassMapReader and MassMapWriter abstract over positional IO. The traits are implemented for std::fs::File out of the box, but they can also wrap network storage, memory-mapped regions, or custom paged backends. Override MassMapReader::batch_read_at to dispatch vectored reads when available.

Format Layout

Every massmap file begins with a 24-byte header containing a magic number, metadata offset, and metadata length. Key/value buckets are stored contiguously after the header in MessagePack format, followed by a serialized MassMapMeta structure that records bucket locations and hashing parameters.

Documentation

Comprehensive API documentation is available on docs.rs.

License

This project is dual-licensed under the MIT License and Apache License 2.0.

About

Static hash table that scales via disk-backed expansion to trim memory usage while ensuring each lookup needs exactly one I/O.

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •  

Languages