cója is a local search engine over Wikipedia dump data, implemented in Go.
This document explains the system design, core concepts, indexing model, ranking logic, and internal data flow.
The project is built to implement a practical information-retrieval system from first principles, with:
- streamed corpus ingestion,
- fielded inverted indexing,
- lexical ranking with BM25,
- phrase/proximity-aware relevance adjustments,
- and reproducible offline evaluation.
The design favors clarity of IR concepts and debuggability over distributed scale.
The index maps normalized terms to posting lists.
A posting represents one document-level occurrence summary:
DocIDFrequency(term frequency in that field/document)Positions(token positions, enabling phrase/proximity logic)
The same document is indexed across multiple fields:
titleintrobody
Each field has separate posting lists and separate length statistics.
This enables field-aware scoring (for example, a match in title can be weighted more than body).
BM25 is used as the base lexical scoring function.
Scoring is computed independently per field and combined as a weighted sum:
score = w_title * BM25(title) + w_intro * BM25(intro) + w_body * BM25(body) + phraseBoost
Current relative weighting emphasizes title and intro over body.
Because postings store token positions, the system can reward exact in-order adjacency for query phrases.
Phrase boosts are field-sensitive:
- strongest in
title, - then
intro, - then
body.
For multi-term queries:
- strict stage requires all query terms,
- fallback stage allows partial matches if strict stage yields no results.
This balances precision and recall for entity-style queries.
A benchmark runner evaluates ranked output using judged query sets and computes:
MRR@10nDCG@10Recall@50
This provides objective tuning feedback for ranking changes.
The system has three major runtime paths:
- Indexing path: converts Wikipedia dump pages into persisted index segments.
- Serving path: loads persisted segments into memory and answers queries.
- Evaluation path: runs judged queries and computes quality metrics.
-
Input stream
- Wikipedia XML is consumed either from a file stream or stdin stream.
- Parsing is sequential because XML token order is linear.
-
Document extraction
- Only article namespace pages are kept.
- Redirect pages are skipped.
-
Text normalization
- MediaWiki markup and HTML-like artifacts are removed.
- Entities and whitespace are normalized.
-
Field preparation
titlecomes from page metadata.bodycomes from cleaned article text.introis a compact first-sentence-like snippet derived from body.
-
Tokenization and stemming
- lowercasing
- boundary-based token splitting
- stopword and numeric filtering
- lightweight stemming
- positional emission
-
Index construction
- for each field, per-term positions/frequencies are aggregated per document.
- posting lists are updated.
- document length and corpus stats are updated.
-
Segment persistence
- documents are indexed in bounded chunks.
- each chunk is serialized as a segment file.
- a manifest records segment inventory and corpus-level metadata.
- Manifest is read.
- Referenced segments are loaded and merged into one in-memory index.
- Query text is parsed into:
- normalized terms,
- optional quoted phrases,
- optional soft phrase for unquoted multi-term queries.
- Candidate scoring is computed with weighted fielded BM25.
- Phrase/proximity boosts are applied using positional postings.
- Strict-or-fallback retrieval mode selects result set.
- Ranked results are returned with canonical Wikipedia URLs.
All query text goes through the same normalization pipeline used during indexing to ensure term-space consistency.
Quoted substrings are parsed as explicit phrase intents and receive targeted positional boosts.
When a query has multiple unquoted terms, the full term sequence is treated as a soft phrase candidate, improving name/entity ranking without requiring explicit quotes.
For each query term:
- compute field-specific BM25 using field-specific document length normalization.
- accumulate weighted contribution across fields.
Documents that match a higher fraction of query terms are scaled upward.
For each phrase candidate:
- detect adjacent in-order term positions by field.
- apply additive boosts with highest priority for title matches.
Results are sorted by descending score, with deterministic tie-break behavior.
-
Posting lists
PostingLists(body)TitlePostingLists(title)IntroPostingLists(intro)
-
Document store
- document title
- per-field lengths
-
Corpus stats
- total docs
- per-field total token counts
- derived average field lengths
-
Segment files
- serialized index shards
- intended for bounded-memory indexing and resumable artifacts
-
Manifest
- segment inventory
- indexing parameters
- corpus aggregate stats
cmd/indexer/main.go: indexing orchestration, worker/collector pipeline, segment writing.cmd/server/main.go: search serving, query handling, result projection.cmd/benchmark/main.go: judged-query evaluation and metrics reporting.
parser.go: streaming XML page extraction.wikitext.go: wiki markup cleaning and intro extraction.
tokenizer.go: normalization/tokenization pipeline with positions.stemmer.go: stemming logic.
index.go: core index schema, per-field stats, merge.search.go: query parsing + scoring + ranking.persist.go: segment serialization/deserialization.manifest.go: manifest parsing and segment loading.
UI layer consuming /search responses for result presentation.
Judged query files used by the evaluation runner.
- Startup-heavy, query-fast serving
- all segments are loaded into RAM before serving.
- Lexical and explainable ranking
- BM25 + explicit boosts rather than opaque ML ranking.
- Simple local persistence
- gob-based segments for straightforward serialization.
- restart requires reload/merge of persisted segments,
- memory usage grows with indexed corpus size,
- quality is bounded by lexical features and judged-set coverage.
- Single-node in-memory serving model.
- No typo correction or synonym/alias expansion yet.
- No learned reranking stage.
- No distributed indexing/search execution.
- No production hardening layer (auth, quotas, multi-tenant controls).
cója currently represents a complete lexical IR stack:
- streamed ingestion,
- fielded positional indexing,
- weighted BM25 + phrase-aware ranking,
- persistent segments,
- online serving,
- and offline quality measurement.
Its architecture is intentionally modular so ranking and storage strategies can evolve independently.