Skip to content

SEEKCommons/wd-fuse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wd-fuse

wd-fuse is a read-only Wikidata FUSE prototype built on libfuse's high-level API.

The mount is intentionally hybrid:

  • FUSE serves a read-only namespace.
  • A Python materializer fetches Special:EntityData on first access.
  • Fetched entities are frozen into a mount-private backing tree for snapshot stability.
  • Reverse edges come from a local prebuilt index, never from remote on-demand queries.

v0 Semantics

  • Default graph view: truthy/
  • Full statement view: full/
  • Entity-to-entity values: symlinks into /entities/<id>
  • Literal values: .txt or tiny .json
  • Ranks, qualifiers, references: nested under full statement directories
  • Raw source views: raw.json, raw.ttl
  • Large value sets: paginated under pages/<nnnn>/
  • Writes: rejected

Layout

/
├── README.txt
├── snapshot.json
└── entities/
    └── Q42/
        ├── id.txt
        ├── type.txt
        ├── modified.txt
        ├── revision.txt
        ├── labels/
        ├── descriptions/
        ├── aliases/
        ├── truthy/
        │   └── by-property/
        │       └── P31/
        │           ├── property -> ../../../../../entities/P31
        │           └── values/
        ├── full/
        │   └── by-property/
        │       └── P31/
        │           ├── property -> ../../../../../entities/P31
        │           └── statements/
        ├── incoming/
        │   └── by-property/
        ├── raw.json
        └── raw.ttl

Truthy values use the best-rank Wikidata projection: preferred statements if any exist for a property, otherwise all non-deprecated statements.

snapshot.json schema

Written once at mount time. Fields:

Field Type Description
kind string Always "wd-fuse-snapshot"
cache_root string Absolute path to the backing tree
created_at string ISO 8601 UTC timestamp of mount creation
page_size number Pagination threshold used for this generation
revision_pin string | null Revision pinned via --revision-pin, or null
incoming_index string | null Path passed via --incoming-index, or null

Dependencies

Runtime:

# Debian / Ubuntu
sudo apt install fuse3 libfuse3-3 python3

# Fedora / RHEL
sudo dnf install fuse3 fuse3-libs python3

Build:

# Debian / Ubuntu
sudo apt install cmake gcc libfuse3-dev

# Fedora / RHEL
sudo dnf install cmake gcc fuse3-devel

Python 3.10 or newer is required for the materializer (str | None union syntax).

Build

The repo vendors the public libfuse headers because this environment only exposes the runtime library. You still need the shared library and fusermount3 on the host.

cmake -S . -B build
cmake --build build

Run

mkdir -p /tmp/wd-mount
./build/wd-fuse \
  --incoming-index examples/incoming.jsonl \
  --page-size 256 \
  /tmp/wd-mount
mkdir -p /tmp/wd-cache /tmp/wd-mount
./build/wd-fuse \
  --cache-root /tmp/wd-cache \
  --incoming-index examples/incoming.jsonl \
  /tmp/wd-mount

Run in the foreground to keep error output visible (recommended for first-time use and debugging):

./build/wd-fuse --incoming-index examples/incoming.jsonl /tmp/wd-mount -f

Enable FUSE-level debug output:

./build/wd-fuse --incoming-index examples/incoming.jsonl /tmp/wd-mount -d

Then browse:

ls /tmp/wd-mount/entities/Q42
readlink /tmp/wd-mount/entities/Q42/truthy/by-property/P31/values/000000
cat /tmp/wd-mount/entities/Q42/full/by-property/P31/statements/count.txt

Unmount with:

fusermount3 -u /tmp/wd-mount

Incoming Index

--incoming-index accepts either:

  • a JSON Lines file with records like {"target":"Q42","property":"P50","source":"Q25169"}
  • a directory of per-target .json or .jsonl shard files

Flat JSONL file

A single file where every line is one reverse-edge record:

{"target":"Q42","property":"P50","source":"Q25169"}
{"target":"Q42","property":"P31","source":"Q463035"}
{"target":"Q5","property":"P31","source":"Q42"}

Sharded directory

For large indexes, split into per-entity files under one of these layouts (tried in order):

index/
├── Q42.jsonl          # flat: index/Q42.jsonl
├── Q/
│   └── Q42.jsonl      # one-char prefix: index/Q/<id>.jsonl
└── QA/
    └── Q42.jsonl      # two-char prefix: index/QA/<id>.jsonl

Each shard file may be .jsonl (one record per line) or .json (array or {"edges":[...]} object). The "target" field may be omitted in per-entity shards; it is inferred from the filename.

The prototype reads only local index data for reverse edges — no remote queries are issued.

Notes

  • Supported entity ids in v0: Q... and P...
  • raw.ttl uses Special:EntityData/<id>.ttl?flavor=dump
  • Without --revision-pin, each entity is fixed at the revision first fetched during the mount generation
  • Entity materialization may take 1–3 seconds per entity (two HTTP requests to www.wikidata.org). Transient errors are retried up to three times with exponential back-off.

Troubleshooting

Transport endpoint is not connected — The mount process exited without cleanly unmounting. Force-unmount with:

fusermount3 -uz /tmp/wd-mount

Input/output error on a specific entity — The materializer failed (network error, unexpected API response, etc.). Check stderr output by running with -f. The entity directory will not be created, so retrying the access will attempt materialization again.

Nothing appears at the mountpoint — FUSE daemonized and may have exited immediately. Run with -f to keep output in the terminal and see any startup errors.

About

A FUSE plugin for Wikidata

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors