Skip to content

gsaini/lucene-by-example

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Lucene by Example

Java Apache Lucene Maven JUnit 5 License Platform Status

GitHub last commit GitHub repo size GitHub stars

A hands-on learning project that walks through the core features of Apache Lucene one self-contained module at a time.

Everything runs entirely in memory against a tiny built-in book catalogue, so you can read a module, run it, tweak it, and immediately see how the output changes — no external services, no setup.


Table of contents


Requirements

  • Java 25+ (current LTS)
  • Maven 3.8+
  • Lucene 10.4.0 (declared in pom.xml, pulled by Maven)

Running

Run every module in order:

mvn -q compile exec:java

Run a single module by number (1–10):

mvn -q compile exec:java -Dexec.args=3

Run a few modules in sequence:

mvn -q compile exec:java -Dexec.args="1 3 7"

Running the tests

Each module has a matching integration-test class under src/test/java/com/example/lucene/ that builds a real in-memory index, runs real queries, and asserts on real results — no mocks.

# Run every test
mvn -q test

# Run a single test class
mvn -q test -Dtest=Module03_QueryTypesIT

# Run a single method
mvn -q test -Dtest=Module03_QueryTypesIT#fuzzy_query

The tests double as executable documentation: each @DisplayName describes the Lucene behaviour the assertion locks in, so reading the test list is another way to learn what each module covers.

What each module covers

# Module What you'll learn
1 Module01_HelloLucene.java Directory, Analyzer, IndexWriter, IndexSearcher, TermQuery — the minimum viable pipeline.
2 Module02_FieldsAndAnalyzers.java StringField vs TextField vs StoredField vs Point vs DocValues; how analyzers produce different tokens.
3 Module03_QueryTypes.java TermQuery, PhraseQuery, BooleanQuery (MUST / SHOULD / MUST_NOT / FILTER), WildcardQuery, PrefixQuery, FuzzyQuery, RegexpQuery, numeric range queries.
4 Module04_QueryParser.java Lucene's classic query-string syntax, including MultiFieldQueryParser with per-field boosts.
5 Module05_Highlighting.java Generating snippet fragments with matched terms wrapped in HTML tags.
6 Module06_Faceting.java Sidebar-style facet counts using FacetField + Taxonomy index.
7 Module07_SortingAndScoring.java Sort by doc-values fields; FunctionScoreQuery to blend BM25 with a numeric signal.
8 Module08_UpdatesAndDeletes.java updateDocument by primary key, deleteDocuments by Term and Query, deleteAll.
9 Module09_CustomAnalyzer.java Building an Analyzer pipeline with stop-words, synonyms, stemming, edge n-grams, ASCII folding.
10 Module10_Suggester.java AnalyzingInfixSuggester for fast autocomplete.

Architecture

End-to-end pipeline

The big picture: a domain object enters on the left, an index is built in the middle, and queries flow back through the right.

   ┌──────────────────────────────────────────────────────────────────────────┐
   │                          WRITE PATH (indexing)                           │
   └──────────────────────────────────────────────────────────────────────────┘

   ┌────────────┐    field      ┌────────────┐   analyze    ┌────────────┐
   │  Domain    │ ───mapping──▶ │  Document  │ ───tokens──▶ │  Analyzer  │
   │  object    │               │  + Fields  │              │   chain    │
   │ (Book,     │               │            │              │            │
   │  Product…) │               └─────┬──────┘              └─────┬──────┘
   └────────────┘                     │                           │
                                      ▼                           ▼
                              ┌──────────────────┐       ┌──────────────────┐
                              │   IndexWriter    │◀──────│   Token stream   │
                              │ (transactional,  │       │  + attributes    │
                              │  one per index)  │       └──────────────────┘
                              └────────┬─────────┘
                                       │ flush / commit
                                       ▼
                              ┌──────────────────┐
                              │     Directory    │   (FSDirectory, MMapDirectory,
                              │  ┌────────────┐  │    ByteBuffersDirectory, …)
                              │  │ segment_1  │  │
                              │  │ segment_2  │  │   ── segments are immutable
                              │  │ segment_3  │  │      and merged in the background
                              │  └────────────┘  │
                              └────────┬─────────┘
                                       │
   ┌───────────────────────────────────┼──────────────────────────────────────┐
   │                                   ▼                READ PATH (searching) │
   └───────────────────────────────────┬──────────────────────────────────────┘
                                       │
                              ┌────────▼─────────┐
                              │  IndexReader     │   point-in-time snapshot
                              │ (DirectoryReader │   of all segments
                              │  .open(dir))     │
                              └────────┬─────────┘
                                       │
                              ┌────────▼─────────┐       ┌────────────────────┐
                              │  IndexSearcher   │◀──────│      Query         │
                              │ (BM25Similarity, │       │ (TermQuery,        │
                              │  collectors,     │       │  BooleanQuery,     │
                              │  rewrites)       │       │  PhraseQuery, …)   │
                              └────────┬─────────┘       └────────────────────┘
                                       │
                                       ▼
                              ┌──────────────────┐
                              │     TopDocs      │   ranked ScoreDoc[]
                              │  (scores + ids)  │   + optional facets,
                              │                  │     highlights, sorts
                              └──────────────────┘

Indexing pipeline (write path)

Inside IndexWriter.addDocument(...), each Field flows through the analyzer chain, and the resulting tokens are recorded in postings, doc-values, points and stored fields — depending on which FieldType flags were set.

   Document
   ├── StringField "id"       ─────▶ exact-term postings        (no analysis)
   ├── TextField "title"      ─────▶ Analyzer ─▶ tokens ─▶ postings
   ├── TextField "description"─────▶ Analyzer ─▶ tokens ─▶ postings
   ├── IntPoint  "year"       ─────▶ BKD tree                    (range queries)
   ├── DoubleDocValuesField   ─────▶ columnar doc-values         (sort / facet / function)
   ├── SortedDocValuesField   ─────▶ columnar doc-values         (sort / facet)
   ├── FacetField "Category"  ─────▶ Taxonomy index              (facet counts)
   └── StoredField "raw"      ─────▶ stored-fields blob          (retrieval only)

                                  Analyzer chain
                                  ───────────────
   raw text ──▶ Tokenizer ──▶ TokenFilter ──▶ TokenFilter ──▶ … ──▶ indexed tokens
                  ▲              ▲              ▲
                  │              │              │
                  │       LowerCaseFilter  StopFilter   SynonymGraphFilter   PorterStemFilter
                  │
              e.g. StandardTokenizer (Unicode word breaks)

Searching pipeline (read path)

   user input ──▶ QueryParser ──▶ Query tree ──▶ rewrite ──▶ Weight ──▶ Scorer
                                                                          │
                                                  per-segment iteration ──┘
                                                                          │
                                          BM25 score + Similarity         │
                                                                          ▼
                                                                  TopDocsCollector
                                                                          │
                                                                          ▼
                                            ┌─────────────────────────────┴───┐
                                            │ TopDocs (ScoreDoc[] + totalHits)│
                                            └─────────────────────────────────┘
                                                          │
                            ┌─────────────────────────────┼─────────────────────────────┐
                            ▼                             ▼                             ▼
                  StoredFields (retrieval)        Highlighter (snippets)        Facets (counts)

Inverted index — the core data structure

This is the idea every Lucene feature is built on: instead of storing "doc → words", Lucene flips it to "word → docs". Looking up a term is then an O(1) hash/Trie lookup followed by a walk over its postings list.

                           Forward (what we wrote)
                           ───────────────────────
   docId=1   "Lucene in Action"
   docId=2   "Effective Java"
   docId=3   "Java Concurrency in Practice"

                           Inverted (what Lucene stores)
                           ─────────────────────────────
   term            postings list (docId → freq, positions, offsets)
   ─────────       ───────────────────────────────────────────────
   action     ──▶  [ (1, freq=1, pos=[2]) ]
   concurrency──▶  [ (3, freq=1, pos=[1]) ]
   effective  ──▶  [ (2, freq=1, pos=[0]) ]
   in         ──▶  [ (1, freq=1, pos=[1]), (3, freq=1, pos=[2]) ]
   java       ──▶  [ (2, freq=1, pos=[1]), (3, freq=1, pos=[0]) ]
   lucene     ──▶  [ (1, freq=1, pos=[0]) ]
   practice   ──▶  [ (3, freq=1, pos=[3]) ]

   ▲ stored in segment files: .tim/.tip (term dictionary), .doc/.pos (postings)

A TermQuery("java") walks the postings list under java → docs [2, 3]. A PhraseQuery also walks positions to verify words appear adjacent. BM25 scoring uses the frequency and length normalisation from this same index.

Field-type decision matrix

A quick reference for which field type to pick for which purpose:

Need Use this field
Exact-match on an ID/code StringField
Full-text search TextField (with the right Analyzer)
Just return it with the hit StoredField
Numeric range query IntPoint / LongPoint / DoublePoint
Sort or facet SortedDocValuesField, NumericDocValuesField, DoubleDocValuesField
Facet counts (taxonomy) FacetField (+ FacetsConfig.build(...))
Autocomplete feed source into AnalyzingInfixSuggester

One logical field often becomes 2–3 Lucene fields. For example, year is usually IntPoint (range query) + NumericDocValuesField (sort) + StoredField (retrieval).

Project module map

How the 10 modules fit on the architecture diagram:

                ┌─────────────────────────────────────────────────────────────┐
                │                       INDEX BUILDING                        │
                │                                                             │
                │   Module 1  Hello Lucene        Module 8  Update/Delete    │
                │   Module 2  Field types         Module 9  Custom Analyzer  │
                │   Module 6  Facet indexing                                  │
                └────────────────────────┬────────────────────────────────────┘
                                         │
                                         ▼
                ┌─────────────────────────────────────────────────────────────┐
                │                          QUERYING                           │
                │                                                             │
                │   Module 3  Query types         Module 4  QueryParser      │
                │   Module 7  Sort / function     Module 10 Suggester        │
                └────────────────────────┬────────────────────────────────────┘
                                         │
                                         ▼
                ┌─────────────────────────────────────────────────────────────┐
                │                    POST-PROCESSING                          │
                │                                                             │
                │   Module 5  Highlighting        Module 6  Facet counts     │
                └─────────────────────────────────────────────────────────────┘

Three rules to remember

  1. Field type decides what queries are possible. A field that is not indexed cannot be searched. A field that is not stored cannot be retrieved. Sorting and faceting need a doc-values flavour of the field.
  2. Use the same Analyzer for indexing and searching. Otherwise your query terms won't match the tokens you wrote to the index. Module 2 makes this obvious by showing the token output of four analyzers side-by-side.
  3. Documents are immutable. "Update" means delete + add, keyed off a unique field. See Module 8.

Where to go next

  • The official Lucene 10.4.0 demo shows indexing of real files from disk.
  • Lucene's MIGRATE.md is the best place to see what changes between major versions (e.g. 9.x → 10.x removed the static FacetsCollector.search(...) helper in favour of FacetsCollectorManager, used in Module 6 of this project).
  • Real-world systems built on Lucene worth studying: Elasticsearch, OpenSearch, Solr — they reuse the APIs you've practised in this project.

About

Hands-on learning project for Apache Lucene 10.x on Java 25 — 10 progressive modules covering indexing, analyzers, query types, query parser, highlighting, faceting, sorting/scoring, updates, custom analyzers, and autocomplete.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages