piika

A reusable, reproducible pi search-agent workspace

There are many search agents, but this one is

yours.

piika is the successor to pi-serini, the original benchmark-driven search-agent workspace started by Jheng-Hong (Matt) Yang. This project carries that path forward under the new repository name.

piika is a reusable, benchmark-driven pi search-agent workspace for index-driven BM25 retrieval, agentic search, and benchmark-aware evaluation.

Current update: this repository has been rebranded to piika from the original pi-serini. Historical release notes, published datasets, and compatibility identifiers may still use the pi-serini name.

Current release status: v0.3.0 supports index-driven benchmark and agentic search workflows for MS MARCO v1 Passage (dl19, dl20) and BrowseComp-Plus, with benchmark-template included as a tiny local end-to-end demo benchmark. This release tracks the Pi package namespace migration to @earendil-works/* and requires pi 0.74.0 or newer.

The repo is now manifest-driven rather than BrowseComp-Plus-only:

benchmark defaults live in typed registry entries under src/benchmarks/
each run snapshots its resolved benchmark condition into benchmark_manifest_snapshot.json
active Node.js/TypeScript control-plane entrypoints live under src/orchestration/
compatibility-only TypeScript entrypoints live under src/legacy/
shared runtime primitives live under src/runtime/
legacy shell scripts remain available as compatibility shims

BrowseComp-Plus remains the default benchmark for reproducibility, but the same control plane now also supports MS MARCO v1 Passage and a tiny local benchmark-template demo benchmark.

Supported benchmarks

browsecomp-plus — default packaged benchmark with query sets q9, q100, q300, and qfull
msmarco-v1-passage — index-driven MS MARCO v1 passage benchmark with query sets dl19 and dl20
benchmark-template — tiny local end-to-end demo benchmark for development and validation

To inspect the registered benchmark catalog from the CLI:

npm run bench -- benchmarks

Requirements

pi 0.74.0 or newer installed and logged in
Node.js with npx
Java 21+
python3
uv
curl or wget

Supported developer environments:

macOS
Linux

If Java is installed in a non-standard location, set JAVA_HOME explicitly before running setup or benchmark commands.

Pi packages now live under the @earendil-works/* npm namespace. This repo depends on @earendil-works/pi-coding-agent and @earendil-works/pi-tui; use that namespace for any local extension or SDK imports rather than the retired @mariozechner/* package names.

Model note: gpt-5.3-codex was used for some historical judge runs, but OpenAI's Codex model documentation now lists it as deprecated when signing in with ChatGPT. Use the current recommended Codex models for new subscription-backed runs, and label any replacement judge model explicitly in reports.

Quickstart

1. Set up benchmark assets

BrowseComp-Plus base assets:

npm run setup:browsecomp-plus

BrowseComp-Plus decrypted ground truth is a separate opt-in step and requires an explicit decryption secret from the operator:

BROWSECOMP_PLUS_CANARY='...your secret...' \
npm run setup:ground-truth:browsecomp-plus

MS MARCO v1 Passage:

npm run setup:msmarco-v1-passage

Tiny local demo benchmark:

npm run setup:benchmark -- --benchmark benchmark-template

2. Run a benchmark query set

Use the same generic command surface for every benchmark; only BENCHMARK and QUERY_SET change.

Default single-process launch:

BENCHMARK=msmarco-v1-passage \
QUERY_SET=dl19 \
MODEL=openai-codex/gpt-5.4-mini \
npm run run:benchmark:query-set

Shared BM25 daemon (preferred package alias):

BENCHMARK=browsecomp-plus \
QUERY_SET=q9 \
MODEL=openai-codex/gpt-5.4-mini \
PI_BM25_RPC_PORT=50455 \
npm run run:benchmark:query-set:shared-bm25

Sharded shared-daemon launch (preferred package alias):

BENCHMARK=browsecomp-plus \
QUERY_SET=q100 \
SHARD_COUNT=4 \
MODEL=openai-codex/gpt-5.4-mini \
npm run run:benchmark:query-set:sharded-shared-bm25

Tiny local demo run:

BENCHMARK=benchmark-template \
QUERY_SET=test \
MODEL=openai-codex/gpt-5.4-mini \
npm run run:benchmark:query-set

BM25 tuning during benchmark runs

Benchmark runs accept BM25 tuning through environment variables:

PI_BM25_K1 — default 0.9
PI_BM25_B — default 0.4
PI_BM25_THREADS — default 1

Example with explicit BM25 tuning:

PI_BM25_K1=0.82 \
PI_BM25_B=0.68 \
BENCHMARK=msmarco-v1-passage \
QUERY_SET=dl19 \
MODEL=openai-codex/gpt-5.4-mini \
npm run run:benchmark:query-set

Example with shared BM25 daemon tuning:

PI_BM25_K1=0.82 \
PI_BM25_B=0.68 \
PI_BM25_THREADS=4 \
BENCHMARK=browsecomp-plus \
QUERY_SET=q9 \
MODEL=openai-codex/gpt-5.4-mini \
npm run run:benchmark:query-set:shared-bm25

Suggested BrowseComp-Plus parameters:

PI_BM25_K1=25
PI_BM25_B=1

Example:

PI_BM25_K1=25 \
PI_BM25_B=1 \
BENCHMARK=browsecomp-plus \
QUERY_SET=q9 \
MODEL=openai-codex/gpt-5.4-mini \
npm run run:benchmark:query-set:shared-bm25

For systematic BM25 parameter search rather than manual overrides, use:

npm run tune:bm25

3. Summarize and evaluate a run

Summarize:

RUN_DIR=runs/<run> npm run summarize:run

Retrieval evaluation:

RUN_DIR=runs/<run> npm run evaluate:retrieval

Judge evaluation:

INPUT_DIR=runs/<run> npm run evaluate:run

Generate a Markdown report:

RUN_DIR=runs/<run> npm run report:run

benchctl operator workflow

Use the direct run:benchmark:* entrypoints when you want low-level benchmark execution with explicit benchmark and query-set control.

Use benchctl when you want the higher-level operator surface for:

listing registered benchmarks and managed presets
launching supervisor-managed runs
checking run status and managed process state
monitoring runs in the live terminal dashboard

Common commands:

List registered benchmarks and presets:

npm run bench -- benchmarks

Launch a managed shared run:

npm run bench -- run --preset q9_shared --model openai-codex/gpt-5.4-mini

Launch a managed sharded run:

npm run bench -- run --preset browsecomp-plus/qfull_sharded --model openai-codex/gpt-5.4-mini --shards 8

Inspect current run status:

npm run bench:status
npm run bench:managed

Open the live operator TUI:

npm run bench:tui

For the full managed-run and monitoring workflow, see Running benchmarks.

Preferred entrypoints

Preferred operator-facing commands are the Node-first package scripts:

npm run setup:benchmark
npm run run:benchmark:query-set
npm run run:benchmark:query-set:shared-bm25
npm run run:benchmark:query-set:sharded-shared-bm25
npm run summarize:run
npm run evaluate:retrieval
npm run evaluate:run
npm run report:run
npm run bench:tui

Legacy shell scripts under scripts/ still work, but they are compatibility shims rather than the preferred control plane. The older package aliases run:benchmark:query-set:shared and run:benchmark:query-set:sharded also still work as compatibility aliases, but the preferred operator-facing names now say explicitly that these paths use a shared BM25 daemon. The two intentional shell-level implementation boundaries that remain are benchmark-scoped setup scripts and the thin BM25 JVM bootstrap script used by the typed BM25 launch helpers.

Repo layout

src/orchestration/ — active benchmark-first launch/setup/tuning control-plane entrypoints
src/legacy/ — compatibility-only TypeScript entrypoints that are still intentionally preserved for historical low-level contracts
src/runtime/ — shared runtime primitives such as prompt construction, artifact-path helpers, and isolated agent-dir handling
src/benchmarks/ — typed benchmark definitions, registry helpers, run-manifest snapshot logic
src/wrappers/ — downstream summarize/eval/report wrapper entrypoints and precedence helpers
src/operator/ — monitor, supervisor, TUI, and benchctl operator surfaces
src/evaluation/ — retrieval and judge evaluation backends plus metric helpers
src/report/ — Markdown report generation and report-data helpers
src/bm25/ — BM25 subprocess startup and local transport helpers
src/pi-search/ — pi search extension and helpers
scripts/ — compatibility wrappers plus benchmark-scoped setup implementations and the thin BM25 JVM bootstrap script
jvm/ — JVM BM25 RPC server
data/<dataset>/... — benchmark-scoped local dataset assets
indexes/<index-name>/ — benchmark-scoped local Lucene indexes
vendor/anserini/ — Anserini fatjar prepared locally by setup scripts
runs/ — benchmark run outputs
evals/ — evaluation outputs
notes/ — local notes and experiment writeups

Citation

@misc{hsu2026rethinkingagenticsearchpiserini,
  title         = {Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?},
  author        = {Tz-Huan Hsu and Jheng-Hong Yang and Jimmy Lin},
  year          = {2026},
  eprint        = {2605.10848},
  archivePrefix = {arXiv},
  primaryClass  = {cs.IR},
  url           = {https://arxiv.org/abs/2605.10848}
}

Notes

Runs snapshot their resolved benchmark condition into <run>/benchmark_manifest_snapshot.json.
Reports now prefer structured run setup metadata from <run>/run_setup.json and fall back to legacy launcher logs when needed.
Do not track generated benchmark content under data/, indexes/, runs/, evals/, or scratch/.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
.githooks		.githooks
.github/workflows		.github/workflows
bin		bin
data		data
docs		docs
evals		evals
indexes		indexes
jvm/src/main/java/dev/jhy/piserini		jvm/src/main/java/dev/jhy/piserini
notes		notes
runs		runs
scripts		scripts
src		src
tests		tests
vendor		vendor
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
RELEASE_CHECKLIST.md		RELEASE_CHECKLIST.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

piika

Supported benchmarks

Requirements

Quickstart

1. Set up benchmark assets

2. Run a benchmark query set

BM25 tuning during benchmark runs

3. Summarize and evaluate a run

benchctl operator workflow

Preferred entrypoints

Repo layout

Read more

Citation

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

piika

Supported benchmarks

Requirements

Quickstart

1. Set up benchmark assets

2. Run a benchmark query set

BM25 tuning during benchmark runs

3. Summarize and evaluate a run

benchctl operator workflow

Preferred entrypoints

Repo layout

Read more

Citation

Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages