Skip to content

jankdc/bote

Repository files navigation

bote

A fast, modern and low-memory approach to processing a big JSON:

npm install @botejs/core
import { fileURLToPath } from 'node:url';
import { open, fromFile } from '@botejs/core';

// 181 MB GeoJSON:
// { type: "...", features: [{ properties: { STREET: "..." }}] }
const filePath = fileURLToPath(new URL('../citylots.json', import.meta.url));

await using cursor = await open(fromFile(filePath));

const byStreet = await cursor
  .iter('features', {
    select: ['properties', 'STREET'],
  })
  .reduce((tally, street) => {
    if (typeof street === 'string') {
      tally.set(street, (tally.get(street) ?? 0) + 1);
    }
    return tally;
  }, new Map());

console.log([...byStreet].sort((a, b) => b[1] - a[1]).slice(0, 10));

Given a seekable or forward source and a path, it retrieves values out of a JSON, without loading the whole thing in-memory.

Here's a run of snippet above (Apple M1 Pro 2021, default settings, RUNS=100, Node v26):

method mean time mean peak footprint (MB)
bote v0.9 0.447 ± 0.002 s 34.1 ± 1.9
JSON.parse 0.828 ± 0.023 s 508.9 ± 2.4
@discoveryjs/json-ext: v1.1.0 1.303 ± 0.012 s 397.4 ± 2.7
JSONStream: v1.3 4.448 ± 0.055 s 62.2 ± 0.7
@streamparser/json: v0.0.22 4.935 ± 0.021 s 60.1 ± 6.4
oboe.js: v2.1 8.041 ± 0.340 s 97.0 ± 1.3
stream-json: v3.4.0 12.323 ± 0.876 s 149.0 ± 6.7

For comparison notes, go here.

Features

  • Modern AsyncIterator API with helpers that emulate the tc39 ones
  • Validate with Standard Schema, avoiding those pesky unknowns
  • Supports multiple sources of data (e.g. file, network, stream) or write a custom one. (see sources.js for the built-in ones)
  • For forward-only sources, there's support for replaying/buffering, allowing navigation to previous values

Supported

  • Node.js >= 22.18.0
  • ESM-only
  • Platforms
    • macOS (Apple Silicon aarch64 and Intel x86_64)
    • Linux x64 (x86_64, glibc)
    • Windows x64 (x86_64, MSVC)
    • More if requested :)

API

open(source, options?)

const cursor = await open(source: Source, options?: SeekableOpenOptions): Promise<RootCursor>;

Opens a cursor over a source. The returned RootCursor owns the underlying reader: close() (or letting an await using scope end) releases it exactly once. A seekable source supports the index cache and repeated, out-of-order queries; a forward source is a single pass and rejects the cache knobs.

import { open, fromFile } from '@botejs/core';

await using cursor = await open(fromFile('./users.json'));
console.log(await cursor.get('users', 0, 'name'));
Cache options (seekable sources only)

The cache remembers where members live as bote walks through the JSON. It caches structure, never source bytes. The defaults are good; reach for these only to bound memory tighter or to turn the cache off.

option default meaning
indexCacheEntries 1024 index entries kept in memory; 0 disables the cache
objectMemberCap unlimited keys indexed per object; 0 skips indexing object keys
arrayIndexInterval 16 index every Nth array position; 0 skips indexing array positions
await using cursor = await open(fromFile('./big.json'), {
  arrayIndexInterval: 8,
  indexCacheEntries: 4096,
  objectMemberCap: 256,
});

Passing any of these to a forward source throws a RangeError.

fromFile(path, options?)

const source = fromFile(path: string, options?: { chunkBytes?: number }): SeekableSource;

A seekable source over a local file. Opens a handle on open() and reads byte ranges on demand, so large files are never fully read, only the chunks a query touches. chunkBytes (a non-zero multiple of 64) overrides the read granularity.

await using cursor = await open(fromFile('./users.json', { chunkBytes: 128 * 1024 }));

fromBuffer(bytes, options?)

const source = fromBuffer(bytes: Uint8Array | ArrayBuffer, options?: { chunkBytes?: number }): SeekableSource;

A seekable source over JSON already resident in memory - data you fetched or built yourself rather than a file on disk.

await using cursor = await open(fromBuffer(new TextEncoder().encode('{"ok":true}')));
console.log(await cursor.get('ok')); // -> true

fromHttpRange(url, options?)

const source = fromHttpRange(url: string, options?: { chunkBytes?: number; init?: RequestInit }): SeekableSource;

A seekable source over a remote file using HTTP range requests. A HEAD discovers the length and confirms Accept-Ranges: bytes; each read then fetches only its byte window. init is merged into every request (headers, credentials, an AbortSignal).

await using cursor = await open(
  fromHttpRange('https://example.com/big.json', {
    init: { headers: { authorization: 'Bearer ...' } },
  }),
);
console.log(await cursor.get('users', 1000, 'name'));

fromReadable(produce, options?)

const source = fromReadable(produce: () => ReadableStream | NodeReadable, options?: ReadableOptions): ForwardSource;

A forward-only source backed by a re-openable readable stream. Pass a thunk that produces a fresh stream (a live Readable cannot be re-streamed), not the stream itself. Each cursor operation is an independent scan from the start, so a second query rewinds, which by default throws (see rewind).

import { createReadStream } from 'node:fs';

await using cursor = await open(fromReadable(() => createReadStream('./events.json')));
for await (const event of cursor.iter('events')) {
  console.log(event.type);
}
Forward options
option default meaning
size discovered known total length, if any; lets the engine skip rediscovering the end
decode none transform applied to each (re)acquired stream, e.g. to decompress
rewind 'forbid' what a query needing an earlier offset does (see below)
chunkBytes 262144 read granularity (non-zero multiple of 64)

rewind trades resident memory for re-read ability:

  • 'forbid' - a single forward pass; a rewind throws ForwardReplayError.
  • 'replay' - re-acquire the stream from the start. Safe only when the producer is idempotent (yields the same bytes each call). No extra memory.
  • 'buffer' - snapshot the whole stream into memory on first read, enabling random access at O(n) resident memory.
await using cursor = await open(
  fromReadable(() => createReadStream('./events.json.gz'), {
    decode: (raw) => raw.pipeThrough(new DecompressionStream('gzip')),
    rewind: 'replay',
  }),
);

fromHttpRequest(url, options?)

const source = fromHttpRequest(url: string, options?: HttpRequestOptions): ForwardSource;

A forward-only source over an HTTP response body, streamed in one pass (GET by default). The forward counterpart to fromHttpRange: prefer it when you scan once and the server has no range support. Takes the same decode/rewind options as fromReadable, plus init merged into every fetch.

await using cursor = await open(
  fromHttpRequest('https://example.com/events.json', {
    init: { headers: { authorization: 'Bearer ...' } },
    rewind: 'buffer',
  }),
);
console.log(await cursor.get('events', 0, 'type'));

cursor.get(...path, schema?)

const value = await cursor.get(...path: Segment[], schema?: StandardSchema): Promise<unknown>;

Reads and decodes the value at path, returning a real JS value, or undefined if the path is absent (distinct from a present JSON null). With no segments it decodes the whole document. Reading a whole container materializes all of it, so prefer iter for large arrays/objects.

const name = await cursor.get('users', 0, 'name'); // -> "Ada"
const missing = await cursor.get('users', 0, 'nope'); // -> undefined
const nulled = await cursor.get('users', 0, 'deletedAt'); // -> null (the member exists)
Validating with a schema

Pass a Standard Schema (zod, valibot, arktype, ...) as the trailing argument to validate and parse the value. The return type is inferred from the schema's output, and a validation miss throws a ValidationError.

import { z } from 'zod';

const age = await cursor.get('users', 0, 'age', z.number()); // typed as number

cursor.has(...path, schema?)

const exists = await cursor.has(...path: Segment[], schema?: StandardSchema): Promise<boolean>;

Reports whether a value exists at path without decoding it. A member explicitly set to JSON null still counts as present; an out-of-range array index is absent.

if (await cursor.has('users', 0, 'email')) {
  console.log(await cursor.get('users', 0, 'email'));
}
console.log(await cursor.has('users', 999)); // -> false on a shorter array
Validating with a schema

With a trailing schema, has also requires the value to validate. Unlike get, a parse or validation miss yields false instead of throwing.

import { z } from 'zod';

if (await cursor.has('users', 0, 'email', z.string().email())) {
  console.log('has a well-formed email');
}

cursor.hop(...path)

const child = await cursor.hop(...path: Segment[]): Promise<Cursor | null>;

Resolves path to a container and hands back a new cursor anchored there, so further get/has/iter/hop run relative to it. Returns null when nothing lives at the path. A child shares the root's source and lifetime. Closing the root closes it too, and there is nothing to close on the child itself.

const user = await cursor.hop('users', 0);
if (user) {
  console.log(await user.get('name'));
  const city = await (await user.hop('address'))?.get('city');
}

cursor.iter(...path, options?)

const stream = cursor.iter(...path: Segment[], options?: IterOptions | StandardSchema): IterStream;

Streams the members of the array or object at path one item at a time, so a million-element array never lands in memory all at once. An empty path iterates the root container; iterating an object yields its values (use withKey for the names). Returns an IterStream.

for await (const user of cursor.iter('users')) {
  console.log(user.name);
}
Options

A trailing Standard Schema is shorthand for { schema }. The full options object:

option default meaning
select none project each member: a segment/path picks a sub-value, a field map builds an object
schema none validate each item (after select)
withKey false yield [key, value] tuples (key = member name or array index)
onInvalid 'throw' policy for items failing schema; 'skip' drops them
maxBatchCount 1000 max items fetched across the native boundary per pull
maxBatchBytes 262144 max serialized bytes held per pull (caps peak memory for large items)
for await (const row of cursor.iter('users', {
  select: { id: 'id', email: ['contact', 'email'] },
  schema: z.object({ id: z.number(), email: z.string() }),
  onInvalid: 'skip',
})) {
  console.log(row.id, row.email);
}

cursor.close()

await cursor.close(): Promise<void>;

Releases the underlying source (file handle, fetch body, etc.). Idempotent, and only on the root cursor. Prefer await using so it runs automatically when the scope ends; call it directly when you cannot use that syntax.

const cursor = await open(fromFile('./users.json'));
try {
  console.log(await cursor.get('users', 0, 'name'));
} finally {
  await cursor.close();
}

IterStream

iter returns a lazy, single-pass pipeline that mirrors the TC39 async iterator helpers.

const firstFive = await cursor
  .iter('users', { select: 'name' })
  .filter((name) => name.startsWith('A'))
  .take(5)
  .toArray();

Supports map, filter, take, drop, toArray, forEach, reduce, find, some and every

Errors

Everything bote throws extends BoteError (catch that to catch anything; branch on .code for the kind). The concrete types are PathError, ValidationError, MalformedJsonError, SourceReadError, ForwardReplayError, and ClosedCursorError. Most carry the path where the fault occurred.

import { BoteError, ValidationError } from '@botejs/core';

try {
  await cursor.get('users', 0, 'age', z.number());
} catch (err) {
  if (err instanceof ValidationError) {
    console.error(err.issues);
  }
}

Status

Pre-1.0. Still in development and APIs may change based on feedback, bugs and holy divinations from the coding gods.

After a lot of chaos, I'm finally-kinda-sorta happy with the public API. Major breaking changes seems to be slowing down so feedback from the community and dogfooding on my end is what's next.

License

MIT.

About

A fast, modern and low-memory approach to processing a big JSON

Resources

License

Stars

Watchers

Forks

Sponsor this project

 

Packages

 
 
 

Contributors