A simple Clojure wrapper for Apache Lucene, currently targeting Lucene 10.4.0.
Key usage scenarios
- Search
- The core use-case of Lucene.
- Suggest
- Prefix-queries for content in any field.
- Query flexibility
- Search supports fielded maps, OR via sets, AND via vectors/sequences, plain strings, and classic Lucene query syntax.
- Analyzer composition
- Convenience helpers exist for standard, keyword, simple, and per-field analyzers.
Both in-memory, and on-disk indexes can be used depending on the dataset size.
Disk indexes can also be rebuilt in place with :re-create? true.
Note: UNSTABLE API. No releases yet.
Inspired by other example wrappers I’ve come across. Notably
[org.msync/lucene-clj "0.3.0-SNAPSHOT"]Available via clojars.
Current development baseline:
- Clojure 1.12.4
- Apache Lucene 10.4.0
- Java 21 or newer
The primary use-case is for in-process text search needs for read-only data-sets that can be managed on single-instance deployments. For multi-instance deployments, keeping modifications of data in sync is an effort.
Use this library when you need light-weight text-search support without the hassle of setting up something like Solr. You may update the index if you wish, but have to take care of any race conditions, and since it is in-process, you will also need to take care of updating all instances in a multi-instance use scenario.
The objectives are loosely as follows.
- Stick to core Lucene. No script/language specific dependencies part of the core library, but can be added by users per need.
- Support for prefix based suggestions - a feature of Lucene I found quite undocumented, as well as lacking good examples for.
- Track the latest Lucene versions.
I am thankful to the above library authors for their liberal licensing. I’ve used their ideas/code in places.
lucene-clj is an opinionated embedded retrieval library for Clojure applications.
It is not trying to wrap all of Lucene. Instead, it focuses on a narrow slice of Lucene that is especially valuable for application search and agentic workflows:
- map-first document indexing
- analyzer composition
- ergonomic lexical retrieval
- suggestions and completion
- explainable ranking
- future vector and hybrid retrieval
The design principle is to keep the human-facing API compact and idiomatic. Where possible, Clojure data shapes carry query intent instead of multiplying API entry points.
For retrieval strategy, the intended order of importance is:
- BM25-style lexical retrieval as the primary baseline
- optional classic TF-IDF-style scoring when comparison or compatibility is useful
- vector retrieval as a complement, not a replacement, for lexical retrieval
- hybrid retrieval when both lexical precision and semantic recall matter
At least for now, lucene-clj is not trying to be:
- a complete wrapper over all Lucene modules
- a distributed search system
- a replacement for Solr, Elasticsearch, or OpenSearch
- a schema-heavy document database
- a host for every Lucene storage, codec, faceting, spatial, or analytics feature
- a direct exposure of low-level Lucene internals when a smaller Clojure abstraction is sufficient
This focus is intentional. Lucene is vast; lucene-clj aims to be small, opinionated, and excellent at embedded retrieval rather than broad and thin.
lucene-clj includes a small benchmark harness for the indexing hot path.
Run it with:
lein bench benchmarks/manual.edn manual $(git rev-parse --short HEAD)This writes EDN captures under benchmarks/ so performance changes can be reviewed and committed.
Current measured improvements from the first indexing hot-path refactor:
map->documentfor one document:33.2 us->6.15 us- compiled one-document encode:
7.53 us->4.10 us - compiled batch encode:
7.04 ms->5.18 ms - batch indexing:
31.6 ms->25.8 ms
The benchmark harness separates encoder cost from end-to-end indexing cost so hot-path changes can be evaluated more precisely.
Recent benchmark captures use the same directory strategy as lucene-clj’s :memory indexes.
Older captures that used a different Lucene directory implementation are useful historically, but not directly comparable to the current ones.
The later clean break to the canonical :fields schema stayed in the same performance range:
- compiled batch encode:
5.18 ms->4.94 ms - batch indexing:
25.8 ms->27.1 ms
That trade-off is intentional: schema clarity moved up substantially while indexing throughput stayed close to the optimized baseline.
There’s sample data in the repository that we use in our examples. A hand-created sample with fictional and non-fictional characters is here and one from Kaggle on music albums is here. These are also used in the tests.
A complete scenario from index creation to search actions is described below.
- Albums - Kaggle - [local]
- Hand-created, real + fictional characters here
When dealing with Lucene and data it processes, key terms to note are
- Document
- A unit of related text. It has possibly many fields, and is a unit of consumption and also of each search result. A
Documentis a collection ofFields. - Field
- Every field is a container of indexable content. They can range across many types, from simple text to latitude and longitude.
- Analyzer
- Analyzes the input documents, and preprocesses terms appropriately. Depending on the context, decisions on tokenizing, stemming, stopwords removal, or treating input as-is - these are controlled by the use of appropriate analyzers
This is a pretty hand-wavy description, but useful enough for our purpose.
Lucene consumes documents, each of which is made up of fields having values. As is natural in Clojure, we represent all such things as maps.
{
:title-field "This is a title"
:abstract-field "This is an abstract of what is to follow"
:author-field "Lekhak Sampaadak"
:body-field "And here's the crux of the article with all the gory details"
}To prepare our content for ingestion and indexing, we do some straightforward CSV parsing and conversion of each row into a map. Each column has a name and is used as the key for the field name in the document-map. All the preparation code is in the msync.lucene.tests-common test namespace, which we’ll refer to as the common namespace where required for clarity. We use two CSV data-sets as our sources of documents to create two indexes, to demonstrate some distinct use-cases. All data files are in the ~test-resources~ subdirectory.
We use two simple datasets, stored as CSV. Loading is straightforward CSV parsing and converting to maps – the first rows in each file are the header rows, holding names of respective columns.
- Sample, hand-coded documents. Plain, simple data.
;; In the common namespace
(take 5 (read-csv-resource-file sample-data-file))| first-name | last-name | age | real | gender | bio |
| Suppandi | Varadarajan | 16 | false | m | A wonderful, innocent soul. You’ll enjoy his antics. |
| Shikari | Shambhu | 32 | False | m | Carries a gun. But no bullets. Animals love him. |
| Chacha | Chaudhary | 64 | FalSe | m | The supercomputer. And then some more! |
| Sabu | Jupiterwala | 2 | false | m | Yes, of legal age. Just a different age-scale because of the planet he comes from. Strong, powerful, but kind. Because, not an earthling. Children love him. |
- Albums data. From Kaggle.
- The columns
GenreandSubgenre, are comma-separated values themselves- They are to be pre-processed before feeding to lucene-clj
- These are multi-valued fields.
- The columns
;; In the common namespace
(take 5 (read-csv-resource-file albums-file))| Number | Year | Album | Artist | Genre | Subgenre |
| 1 | 1967 | Sgt. Pepper’s Lonely Hearts Club Band | The Beatles | Rock | Rock & Roll, Psychedelic Rock |
| 2 | 1966 | Pet Sounds | The Beach Boys | Rock | Pop Rock, Psychedelic Rock |
| 3 | 1966 | Revolver | The Beatles | Rock | Psychedelic Rock, Pop Rock |
| 4 | 1965 | Highway 61 Revisited | Bob Dylan | Rock | Folk Rock, Blues Rock |
Analyzers process each field’s content in a manner that is apt - according to what the programmer/domain-expert decides.
Some fields need to be tokenized and stemmed, while some are to be treated verbatim. Natural language text, versus some proper nouns like company name or music genre.
In the albums dataset, the Year, Genre and Subgenre fields’ texts are not to be tokenized and stemmed, or filtered for stop-words. Hence, they are configured to be analyzed with the keyword analyzer. Other fields can be treated like normal text. So, in this case, we use a composed analyzer that can treat each field in its special way.
Note that the same analyzers we use while creating indexes should be used when querying the index for search and suggest to avoid surprises. This shouldn’t be surprising.
Here’s how we create analyzers.
;; In the common namespace
;; This is the default analyzer, an instance of the StandardAnalyzer
;; of Lucene
(defonce default-analyzer (analyzers/standard-analyzer))
;; This analyzer considers field values verbatim
;; Will not tokenize and stem
(defonce keyword-analyzer (analyzers/keyword-analyzer))
;; A per-field analyzer, which composes other kinds of analyzers
;; For album data, we have marked some fields as verbatim
;; Takes a default analyzer, and then a map of field to field-specific analyzer
(defonce album-data-analyzer
(analyzers/per-field-analyzer default-analyzer
{:Year keyword-analyzer
:Genre keyword-analyzer
:Subgenre keyword-analyzer}))With the background setup done and explained, let us move ahead to demonstrating indexing and searching. You may want to try the following in a REPL by requiring the namespace the prior code is in and then playing along. I’ve used the dev namespace below, the code for which can be found here.
(ns dev
(:require [msync.lucene :as lucene]
[msync.lucene
[document :as ld]
[tests-common :as common]]))In memory
(defonce album-index (lucene/create-index! :type :memory
:analyzer common/album-data-analyzer))Or, on disk
(defonce album-index (lucene/create-index! :type :disk
:path "/path/to/index/directory"
:analyzer common/album-data-analyzer))If you want to rebuild an existing on-disk index from scratch, pass :re-create? true.
(defonce album-index (lucene/create-index! :type :disk
:path "/path/to/index/directory"
:analyzer common/album-data-analyzer
:re-create? true))A sample of the album data for reference.
The Genre and Subgenre columns are pre-processed, as mentioned above, and split further.
(drop 2 (take 5 common/album-data))({:Number "3",
:Year "1966",
:Album "Revolver",
:Artist "The Beatles",
:Genre ("Rock"),
:Subgenre ("Psychedelic Rock" "Pop Rock")}
{:Number "4",
:Year "1965",
:Album "Highway 61 Revisited",
:Artist "Bob Dylan",
:Genre ("Rock"),
:Subgenre ("Folk Rock" "Blues Rock")}
{:Number "5",
:Year "1965",
:Album "Rubber Soul",
:Artist "The Beatles",
:Genre ("Rock" "Pop"),
:Subgenre ("Pop Rock")})
Documents are Clojure maps. Each key-value in the map represents one logical field. The second argument to index! is now a canonical :fields schema where each field is defined in one place.
The core field options are:
:type- currently:text,:keyword,:long,:boolean,:double, or:instant:stored?- whether the field value is stored and can be returned later:indexed?- whether the field participates in search:multi-valued?- whether the field accepts multiple values:suggest- optional completion settings:weightadjusts ranking of suggestions:contexts-fromuses one field, several fields, or a function to derive suggestion contexts
Use :keyword for exact string matching, :long and :double for exact numeric matching, :boolean for true-or-false fields, and :instant for exact time matching.
Typed fields work naturally with map queries:
{:fields {:rating {:type :double}
:published-at {:type :instant}
:active {:type :boolean}}}
(lucene/search idx {:rating 4.5})
(lucene/search idx {:published-at (java.time.Instant/parse "1977-02-04T00:00:00Z")})
(lucene/search idx {:active true})In the following schema:
:Year,:Genre, and:Subgenreare exact-match keyword fields:Genreand:Subgenreare marked multi-valued:Albumand:Artistare suggest-enabled:Albumsuggestions are given more weight than:Artist- suggestion contexts come from the
:Genrefield
(def album-fields
{:Number {:type :text
:stored? true
:indexed? true}
:Year {:type :keyword
:stored? true
:indexed? true}
:Album {:type :text
:stored? true
:indexed? true
:suggest {:weight 5
:contexts-from :Genre}}
:Artist {:type :text
:stored? true
:indexed? true
:suggest {:contexts-from :Genre}}
:Genre {:type :keyword
:stored? true
:indexed? true
:multi-valued? true}
:Subgenre {:type :keyword
:stored? true
:indexed? true
:multi-valued? true}})(lucene/index! album-index common/album-data
{:fields common/album-fields})A simple search example, in which we pass a map specifying the field, and the value we are looking for. The result includes the :hit, a :score for that :hit, and the :doc-id which is an identifier that Lucene manages. Notice that the result - :hit - is a Lucene Document object.
(lucene/search album-index {:Year "1977"}
{:results-per-page 2})[{:doc-id 25,
:score 1.4994705,
:hit
#object[org.apache.lucene.document.Document 0x24750f97 "Document<stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Number:26> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Year:1977> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Album:Rumours> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Artist:Fleetwood Mac> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Genre:Rock> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Subgenre:Pop Rock>>"]}
{:doc-id 40,
:score 1.4994705,
:hit
#object[org.apache.lucene.document.Document 0x6d6a6fe4 "Document<stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Number:41> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Year:1977> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Album:Never Mind the Bollocks Here's the Sex Pistols> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Artist:Sex Pistols> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Genre:Rock> stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<Subgenre:Punk>>"]}]
For convenience, lucene-clj has a function that can be used to convert the Lucene Document into a Clojure map. It supports both :multi-fields and :fields-to-keep to shape the result. When you use typed stored fields such as :boolean or :instant, pass :field-specs so document->map can decode them back to typed values. But beyond basic use-cases, supply your own.
(lucene/search album-index {:Year "1977"}
{:results-per-page 2
:hit->doc ld/document->map})[{:doc-id 25,
:score 1.4994705,
:hit
{:Number "26",
:Year "1977",
:Album "Rumours",
:Artist "Fleetwood Mac",
:Genre "Rock",
:Subgenre "Pop Rock"}}
{:doc-id 40,
:score 1.4994705,
:hit
{:Number "41",
:Year "1977",
:Album "Never Mind the Bollocks Here's the Sex Pistols",
:Artist "Sex Pistols",
:Genre "Rock",
:Subgenre "Punk"}}]
Notice though, that the :Genre and :Subgenre fields did not come back as collections. The document->map function isn’t smart to identify that, and needs a hint to make that happen. With the modified hit->doc argument, the two fields come back as vectors with possibly multiple values.
(lucene/search album-index
{:Year "1977"}
{:results-per-page 2
:hit->doc #(ld/document->map % :multi-fields [:Genre :Subgenre])})[{:doc-id 25,
:score 1.4994705,
:hit
{:Number "26",
:Year "1977",
:Album "Rumours",
:Artist "Fleetwood Mac",
:Genre ["Rock"],
:Subgenre ["Pop Rock"]}}
{:doc-id 40,
:score 1.4994705,
:hit
{:Number "41",
:Year "1977",
:Album "Never Mind the Bollocks Here's the Sex Pistols",
:Artist "Sex Pistols",
:Genre ["Rock"],
:Subgenre ["Punk"]}}]
Paginated query results are supported via the :page option. Also, the following example projects a subset of the document fields by passing a modified function as the :hit->doc argument.
(lucene/search album-index
{:Year "1968"} ;; Map of field-values to search with
{:results-per-page 5 ;; Control the number of results returned
:page 2 ;; Page number, starting 0 as default
:hit->doc #(-> %
ld/document->map
(select-keys [:Year :Album]))})[{:doc-id 160,
:score 1.4311604,
:hit {:Year "1968", :Album "The Dock of the Bay"}}
{:doc-id 170,
:score 1.4311604,
:hit {:Year "1968", :Album "The Notorious Byrd Brothers"}}
{:doc-id 204,
:score 1.4311604,
:hit {:Year "1968", :Album "Wheels of Fire"}}
{:doc-id 233,
:score 1.4311604,
:hit {:Year "1968", :Album "Bookends"}}
{:doc-id 257,
:score 1.4311604,
:hit
{:Year "1968",
:Album "The Kinks Are The Village Green Preservation Society"}}]
The same projection can be expressed with ld/document->map directly by using :fields-to-keep.
(lucene/search album-index
{:Year "1968"}
{:results-per-page 5
:page 2
:hit->doc #(ld/document->map % :fields-to-keep #{:Year :Album})})For one-off calls, the :page option is still convenient.
For stable pagination across repeated calls, prefer a reusable search session plus :search-after.
The continuation token can simply be the last result map from the previous page, since it already carries :doc-id and :score.
(with-open [session (lucene/open-session album-index)]
(let [page-0 (lucene/search session
{:Year "1968"}
{:results-per-page 5
:hit->doc #(ld/document->map % :fields-to-keep #{:Year :Album})})
page-1 (lucene/search session
{:Year "1968"}
{:results-per-page 5
:search-after (last page-0)
:hit->doc #(ld/document->map % :fields-to-keep #{:Year :Album})})]
[page-0 page-1]))The session pins a single Lucene reader snapshot.
That means repeated search and suggest calls within the same with-open block see a stable view of the index.
If the index changes and you open a new session, result ordering and pagination boundaries may change accordingly.
Searching in a single field, for a single value
(lucene/search album-index {:Year "1967"} {:results-per-page 2 :hit->doc ld/document->map})
When the query form is a plain string, pass :field-name.
- A single word becomes a term query.
- A string containing spaces becomes a phrase query.
(lucene/search album-index "the sun"
{:field-name :Album
:hit->doc ld/document->map})For Lucene’s query parser syntax, parse the DSL explicitly and pass the resulting Query object to lucene/search.
(lucene/search album-index
(msync.lucene.query/parse-dsl "Album:\"the sun\" AND Year:1976"
common/album-data-analyzer)
{:hit->doc ld/document->map})Searching in a single field, where any of the values in the set are allowed
(lucene/search album-index {:Year #{"1960" "1965"}}
{:results-per-page 5
:hit->doc #(-> % ld/document->map (select-keys [:Year :Album]))})[{:doc-id 118,
:score 2.2562923,
:hit {:Year "1960", :Album "At Last!"}}
{:doc-id 347,
:score 2.2562923,
:hit {:Year "1960", :Album "Muddy Waters at Newport 1960"}}
{:doc-id 357,
:score 2.2562923,
:hit {:Year "1960", :Album "Sketches of Spain"}}
{:doc-id 3,
:score 1.6102078,
:hit {:Year "1965", :Album "Highway 61 Revisited"}}
{:doc-id 4,
:score 1.6102078,
:hit {:Year "1965", :Album "Rubber Soul"}}]
When looking for multiple terms in a single field, pass a vector.
(lucene/search album-index {:Album ["complete" "unbelievable"]} {:hit->doc ld/document->map})[{:doc-id 253,
:score 3.0571077,
:hit
{:Number "254",
:Year "1966",
:Album
"Complete & Unbelievable: The Otis Redding Dictionary of Soul",
:Artist "Otis Redding",
:Genre "Funk / Soul",
:Subgenre "Soul"}}]
Be sure that your queries are semantically right for the data-set. For example, AND-ing over two different years will lead to an empty result-set, obviously.
(lucene/search album-index {:Year ["1964" "1965"]})[]
Spaces in the query string are inferred to mean a phrase search operation
(lucene/search album-index {:Album "the sun"} {:hit->doc ld/document->map})[{:doc-id 10,
:score 2.8861985,
:hit
{:Number "11",
:Year "1976",
:Album "The Sun Sessions",
:Artist "Elvis Presley",
:Genre "Rock",
:Subgenre "Rock & Roll"}}
{:doc-id 287,
:score 2.544825,
:hit
{:Number "288",
:Year "1968",
:Album "Anthem of the Sun",
:Artist "Grateful Dead",
:Genre "Rock",
:Subgenre "Psychedelic Rock"}}
{:doc-id 310,
:score 2.544825,
:hit
{:Number "311",
:Year "1994",
:Album "The Sun Records Collection",
:Artist "Various",
:Genre "& Country",
:Subgenre "Rockabilly"}}]
This is an AND operation
(lucene/search album-index {:Album "the sun" :Year "1976"} {:hit->doc ld/document->map})[{:doc-id 10,
:score 4.56387,
:hit
{:Number "11",
:Year "1976",
:Album "The Sun Sessions",
:Artist "Elvis Presley",
:Genre "Rock",
:Subgenre "Rock & Roll"}}]
Notice that in the suggest function call, the field and suggestion-prefix are not passed as a map, as unlike search, suggest calls are only supported over a single field.
From above, the fields Album and Artist have a :suggest entry in their field specs. Suggestion weight and contexts are part of the field definition instead of being spread across separate indexing options.
(lucene/suggest album-index :Album "par"
{:hit->doc #(ld/document->map % :multi-fields [:Genre :Subgenre])
:contexts ["Electronic"]})[{:hit
{:Number "140",
:Year "1978",
:Album "Parallel Lines",
:Artist "Blondie",
:Genre ["Electronic" "Rock"],
:Subgenre ["New Wave" "Pop Rock" "Punk" "Disco"]},
:score 1.0,
:doc-id 139}]
Use :max-results to cap the number of suggestions returned, and :skip-duplicates? true when duplicate suggestions are not useful for the caller.
(lucene/suggest album-index :Album "s"
{:max-results 2
:skip-duplicates? true
:hit->doc #(ld/document->map % :fields-to-keep #{:Album :Artist})})We can ask for fuzzy matching when querying for suggestions.
(lucene/suggest album-index :Album "per"
{:hit->doc #(ld/document->map % :multi-fields [:Genre :Subgenre])
:fuzzy? true
:contexts ["Electronic"]})[{:hit
{:Number "140",
:Year "1978",
:Album "Parallel Lines",
:Artist "Blondie",
:Genre ["Electronic" "Rock"],
:Subgenre ["New Wave" "Pop Rock" "Punk" "Disco"]},
:score 2.0,
:doc-id 139}
{:hit
{:Number "76",
:Year "1984",
:Album "Purple Rain",
:Artist "Prince and the Revolution",
:Genre ["Electronic" "Rock" "Funk / Soul" "Stage & Screen"],
:Subgenre ["Pop Rock" "Funk" "Soundtrack" "Synth-pop"]},
:score 2.0,
:doc-id 75}]
Notice how forever matches fever too.
(lucene/search album-index {:Album "forever"}
{:hit->doc #(ld/document->map % :multi-fields [:Genre :Subgenre])
:fuzzy? true})[{:doc-id 39,
:score 3.0850303,
:hit
{:Number "40",
:Year "1967",
:Album "Forever Changes",
:Artist "Love",
:Genre ["Rock"],
:Subgenre ["Folk Rock" "Psychedelic Rock"]}}
{:doc-id 131,
:score 0.9592955,
:hit
{:Number "132",
:Year "1977",
:Album "Saturday Night Fever: The Original Movie Sound Track",
:Artist "Various Artists",
:Genre ["Electronic" "�Stage & Screen"],
:Subgenre ["Soundtrack" "�Disco"]}}]
lucene-clj leans on Clojure data shapes so that one search function can cover the common query kinds.
{:field "value"}- fielded search
{:field ["a" "b"]}- AND within one field
{:field #{"a" "b"}}- OR within one field
"single term"with:field-name- single-field term search
"multiple words"with:field-name- phrase search
Queryfrommsync.lucene.query/parse-dsl- explicit Lucene query syntax
Examples:
(lucene/search album-index {:Album "rumours"})
(lucene/search album-index {:Album ["complete" "unbelievable"]})
(lucene/search album-index {:Year #{"1960" "1965"}})
(lucene/search album-index "the sun" {:field-name :Album})
(lucene/search album-index
(msync.lucene.query/parse-dsl "Album:\"the sun\" AND Year:1976"
common/album-data-analyzer))This API is intentionally shape-driven: the kind of Clojure value you pass determines the query behavior. That keeps the public surface compact, but it also means that changing a query from a set to a vector, or from a single word to a spaced string, changes the semantics. `
- Some minimal technical overview of Lucene internals for this project can be found here.
Copyright © 2018-2020 Ravindra R. Jaju
Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.