A reusable, reproducible pi search-agent workspace
There are many search agents, but this one is
yours.
piika is the successor to pi-serini, the original benchmark-driven search-agent workspace started by Jheng-Hong (Matt) Yang. This project carries that path forward under the new repository name.
piika is a reusable, benchmark-driven pi search-agent workspace for index-driven BM25 retrieval, agentic search, and benchmark-aware evaluation.
Current update: this repository has been rebranded to piika from the original pi-serini. Historical release notes, published datasets, and compatibility identifiers may still use the pi-serini name.
Current release status: v0.3.0 supports index-driven benchmark and agentic search workflows for MS MARCO v1 Passage (dl19, dl20) and BrowseComp-Plus, with benchmark-template included as a tiny local end-to-end demo benchmark. This release tracks the Pi package namespace migration to @earendil-works/* and requires pi 0.74.0 or newer.
The repo is now manifest-driven rather than BrowseComp-Plus-only:
- benchmark defaults live in typed registry entries under
src/benchmarks/ - each run snapshots its resolved benchmark condition into
benchmark_manifest_snapshot.json - active Node.js/TypeScript control-plane entrypoints live under
src/orchestration/ - compatibility-only TypeScript entrypoints live under
src/legacy/ - shared runtime primitives live under
src/runtime/ - legacy shell scripts remain available as compatibility shims
BrowseComp-Plus remains the default benchmark for reproducibility, but the same control plane now also supports MS MARCO v1 Passage and a tiny local benchmark-template demo benchmark.
browsecomp-plus— default packaged benchmark with query setsq9,q100,q300, andqfullmsmarco-v1-passage— index-driven MS MARCO v1 passage benchmark with query setsdl19anddl20benchmark-template— tiny local end-to-end demo benchmark for development and validation
To inspect the registered benchmark catalog from the CLI:
npm run bench -- benchmarkspi0.74.0or newer installed and logged in- Node.js with
npx - Java 21+
python3uvcurlorwget
Supported developer environments:
- macOS
- Linux
If Java is installed in a non-standard location, set JAVA_HOME explicitly before running setup or benchmark commands.
Pi packages now live under the @earendil-works/* npm namespace. This repo depends on @earendil-works/pi-coding-agent and @earendil-works/pi-tui; use that namespace for any local extension or SDK imports rather than the retired @mariozechner/* package names.
Model note: gpt-5.3-codex was used for some historical judge runs, but OpenAI's Codex model documentation now lists it as deprecated when signing in with ChatGPT. Use the current recommended Codex models for new subscription-backed runs, and label any replacement judge model explicitly in reports.
BrowseComp-Plus base assets:
npm run setup:browsecomp-plusBrowseComp-Plus decrypted ground truth is a separate opt-in step and requires an explicit decryption secret from the operator:
BROWSECOMP_PLUS_CANARY='...your secret...' \
npm run setup:ground-truth:browsecomp-plusMS MARCO v1 Passage:
npm run setup:msmarco-v1-passageTiny local demo benchmark:
npm run setup:benchmark -- --benchmark benchmark-templateUse the same generic command surface for every benchmark; only BENCHMARK and QUERY_SET change.
Default single-process launch:
BENCHMARK=msmarco-v1-passage \
QUERY_SET=dl19 \
MODEL=openai-codex/gpt-5.4-mini \
npm run run:benchmark:query-setShared BM25 daemon (preferred package alias):
BENCHMARK=browsecomp-plus \
QUERY_SET=q9 \
MODEL=openai-codex/gpt-5.4-mini \
PI_BM25_RPC_PORT=50455 \
npm run run:benchmark:query-set:shared-bm25Sharded shared-daemon launch (preferred package alias):
BENCHMARK=browsecomp-plus \
QUERY_SET=q100 \
SHARD_COUNT=4 \
MODEL=openai-codex/gpt-5.4-mini \
npm run run:benchmark:query-set:sharded-shared-bm25Tiny local demo run:
BENCHMARK=benchmark-template \
QUERY_SET=test \
MODEL=openai-codex/gpt-5.4-mini \
npm run run:benchmark:query-setBenchmark runs accept BM25 tuning through environment variables:
PI_BM25_K1— default0.9PI_BM25_B— default0.4PI_BM25_THREADS— default1
Example with explicit BM25 tuning:
PI_BM25_K1=0.82 \
PI_BM25_B=0.68 \
BENCHMARK=msmarco-v1-passage \
QUERY_SET=dl19 \
MODEL=openai-codex/gpt-5.4-mini \
npm run run:benchmark:query-setExample with shared BM25 daemon tuning:
PI_BM25_K1=0.82 \
PI_BM25_B=0.68 \
PI_BM25_THREADS=4 \
BENCHMARK=browsecomp-plus \
QUERY_SET=q9 \
MODEL=openai-codex/gpt-5.4-mini \
npm run run:benchmark:query-set:shared-bm25Suggested BrowseComp-Plus parameters:
PI_BM25_K1=25PI_BM25_B=1
Example:
PI_BM25_K1=25 \
PI_BM25_B=1 \
BENCHMARK=browsecomp-plus \
QUERY_SET=q9 \
MODEL=openai-codex/gpt-5.4-mini \
npm run run:benchmark:query-set:shared-bm25For systematic BM25 parameter search rather than manual overrides, use:
npm run tune:bm25Summarize:
RUN_DIR=runs/<run> npm run summarize:runRetrieval evaluation:
RUN_DIR=runs/<run> npm run evaluate:retrievalJudge evaluation:
INPUT_DIR=runs/<run> npm run evaluate:runGenerate a Markdown report:
RUN_DIR=runs/<run> npm run report:runUse the direct run:benchmark:* entrypoints when you want low-level benchmark execution with explicit benchmark and query-set control.
Use benchctl when you want the higher-level operator surface for:
- listing registered benchmarks and managed presets
- launching supervisor-managed runs
- checking run status and managed process state
- monitoring runs in the live terminal dashboard
Common commands:
List registered benchmarks and presets:
npm run bench -- benchmarksLaunch a managed shared run:
npm run bench -- run --preset q9_shared --model openai-codex/gpt-5.4-miniLaunch a managed sharded run:
npm run bench -- run --preset browsecomp-plus/qfull_sharded --model openai-codex/gpt-5.4-mini --shards 8Inspect current run status:
npm run bench:status
npm run bench:managedOpen the live operator TUI:
npm run bench:tuiFor the full managed-run and monitoring workflow, see Running benchmarks.
Preferred operator-facing commands are the Node-first package scripts:
npm run setup:benchmarknpm run run:benchmark:query-setnpm run run:benchmark:query-set:shared-bm25npm run run:benchmark:query-set:sharded-shared-bm25npm run summarize:runnpm run evaluate:retrievalnpm run evaluate:runnpm run report:runnpm run bench:tui
Legacy shell scripts under scripts/ still work, but they are compatibility shims rather than the preferred control plane. The older package aliases run:benchmark:query-set:shared and run:benchmark:query-set:sharded also still work as compatibility aliases, but the preferred operator-facing names now say explicitly that these paths use a shared BM25 daemon. The two intentional shell-level implementation boundaries that remain are benchmark-scoped setup scripts and the thin BM25 JVM bootstrap script used by the typed BM25 launch helpers.
src/orchestration/— active benchmark-first launch/setup/tuning control-plane entrypointssrc/legacy/— compatibility-only TypeScript entrypoints that are still intentionally preserved for historical low-level contractssrc/runtime/— shared runtime primitives such as prompt construction, artifact-path helpers, and isolated agent-dir handlingsrc/benchmarks/— typed benchmark definitions, registry helpers, run-manifest snapshot logicsrc/wrappers/— downstream summarize/eval/report wrapper entrypoints and precedence helperssrc/operator/— monitor, supervisor, TUI, and benchctl operator surfacessrc/evaluation/— retrieval and judge evaluation backends plus metric helperssrc/report/— Markdown report generation and report-data helperssrc/bm25/— BM25 subprocess startup and local transport helperssrc/pi-search/—pisearch extension and helpersscripts/— compatibility wrappers plus benchmark-scoped setup implementations and the thin BM25 JVM bootstrap scriptjvm/— JVM BM25 RPC serverdata/<dataset>/...— benchmark-scoped local dataset assetsindexes/<index-name>/— benchmark-scoped local Lucene indexesvendor/anserini/— Anserini fatjar prepared locally by setup scriptsruns/— benchmark run outputsevals/— evaluation outputsnotes/— local notes and experiment writeups
- paper
- Project page
- Running benchmarks
- Evaluation semantics
- Reproducibility
- Adding a benchmark
- BM25 backend interface
- Released Run on BrowseComp-Plus (Canary to prevent leakage:
piserini-a-minimal-search-agent)
@misc{hsu2026rethinkingagenticsearchpiserini,
title = {Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?},
author = {Tz-Huan Hsu and Jheng-Hong Yang and Jimmy Lin},
year = {2026},
eprint = {2605.10848},
archivePrefix = {arXiv},
primaryClass = {cs.IR},
url = {https://arxiv.org/abs/2605.10848}
}- Runs snapshot their resolved benchmark condition into
<run>/benchmark_manifest_snapshot.json. - Reports now prefer structured run setup metadata from
<run>/run_setup.jsonand fall back to legacy launcher logs when needed. - Do not track generated benchmark content under
data/,indexes/,runs/,evals/, orscratch/.
MIT