Adrien Grand
jpountz.bsky.social
Adrien Grand
@jpountz.bsky.social
#Lucene developer
It's interesting how the Elasticsearch and Datadog (www.datadoghq.com/blog/enginee...) approaches to wildcard search differ. Both use n-gram indexes, but with different strategies to contain storage amplification. Datadog hashes 4-grams while ES aggressively normalizes 3-grams.
Inside Husky’s query engine: Real-time access to 100 trillion events | Datadog
See how Husky enables interactive querying across 100 trillion events daily by combining caching, smart indexing, and query pruning.
www.datadoghq.com
October 13, 2025 at 8:27 AM
Ge Song merged a good ~15% speedup for BM25F queries in Lucene benchmarks.mikemccandless.com/CombinedOrHi... (last data point) github.com/apache/lucen...
Lucene CombinedOrHighMed queries/sec
benchmarks.mikemccandless.com
October 12, 2025 at 7:15 AM
New blog: vectorized evaluation of disjunctive queries jpountz.github.io/2025/10/11/v... It explains how Lucene manages to be fast at evaluating top hits by BM25 score, even with hard queries that have only stop words or tens of terms.
Vectorized evaluation of disjunctive queries
In a previous blog post, I explained how Lucene significantly improved query evaluation efficiency by migrating to a vectorized execution model, and described the algorithm that Lucene uses to evaluat...
jpountz.github.io
October 11, 2025 at 8:27 PM
Reposted by Adrien Grand
BM25F is an adjustment to BM25 that accounts for multiple fields, beating out naive summing of BM25 scores

softwaredoug.com/blog/2025/09...
BM25F from scratch
BM25 run across multiple fields isn’t as simple as summing a bunch of field-level BM25 scores.
softwaredoug.com
September 18, 2025 at 5:33 PM
Lucene 10.3 is out with 40% faster lexical search, 15% faster dense vector search and 30% faster terms dictionary lookups. lucene.apache.org/core/corenew...
Lucene™ Core News
Apache Lucene is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for...
lucene.apache.org
September 14, 2025 at 7:26 AM
Lucene just bumped the block size of its postings lists from 128 to 256. This gave very good speedups (up to 45%) to most queries, and up to 10-15% slowdowns to filtered term queries. benchmarks.mikemccandless.com/2025.09.10.1...
benchmarks.mikemccandless.com
September 11, 2025 at 1:46 PM
#Lucene just switched from a binary heap to a ternary heap to collect top hits by score. This helps a small bit when computing top-100 hits (~2% on the fastest queries) but up to 15% when computing top-1000 hits - thanks to better cache efficiency github.com/apache/lucen...
Adding 3-ary LongHeap to speed up collectors like TopDoc*Collectors by RamakrishnaChilaka · Pull Request #15140 · apache/lucene
Description This PR updates LongHeap from a fixed 2-ary heap to a 3-ary heap (the code is generic with n-ary Heap). The change improves cache locality and reduces heap operations for larger heaps, ...
github.com
September 4, 2025 at 2:04 PM
I just ran the Tantivy benchmark (tantivy-search.github.io/bench/) on Lucene 10.2 vs a Lucene 10.3 snapshot build. Lucene 10.2 already performed very well, but Lucene 10.3 is on another level. Very exciting.
August 30, 2025 at 8:12 PM
I just blogged about how Lucene improved query evaluation efficiency by ~40% through vectorization: jpountz.github.io/2025/08/28/c...
Compilation vs. vectorization, search engine edition
Virtual function calls are quite expensive, which is why database systems have been looking into ways to avoid performing one or more virtual function calls per record when processing a query. Two mai...
jpountz.github.io
August 28, 2025 at 1:10 PM
I spent some time looking at the Vespa source code to see how it compares with Lucene jpountz.github.io/2025/07/25/m...
More on Vespa vs. Lucene/Elasticsearch
In a previous post, I took a look at the Vespa vs. Elasticsearch benchmark that the Vespa people run. The results made me want to dig a little deeper to see how Vespa and Lucene/Elasticsearch differ i...
jpountz.github.io
July 26, 2025 at 6:36 PM
This small change yielded a ~5% speedup on several queries of Lucene's nightly benchmarks (see last data point at benchmarks.mikemccandless.com/OrStopWords....). Can you guess why?
July 9, 2025 at 6:24 AM
Last month, Lucene changed query evaluation to work in a more term-at-a-time fashion within small-ish windows of doc IDs. This yielded a good speedup on its own (annotation IL benchmarks.mikemccandless.com/OrHighMed.html).
Lucene BooleanQuery (OR, high freq, medium freq term) queries/sec
benchmarks.mikemccandless.com
July 4, 2025 at 11:28 AM
Lucene is getting an increasing number of high-quality contributions from ByteDance employees, especially around performance. Good to see that this project keeps attracting contributors from all around the world.
June 26, 2025 at 3:40 PM
Another common point I did not expect: Vespa's strict vs. unstrict iterators is quite similar to Lucene's two-phase iteration. And both projects use this feature to effectively combine dynamic pruning with filtering (a hard and underappreciated problem IMO).
June 25, 2025 at 12:53 PM
Someone asked me for my opinion on the Vespa vs. Elasticsearch performance comparison today at Berlin Buzzwords, so I gave it a try: jpountz.github.io/2025/06/17/a...
A look at the Vespa vs. Elasticsearch benchmark
I was attending Berlin Buzzwords today and someone asked me about the Elasticsearch vs. Vespa comparison produced by the Vespa people, so I thought I’d publish my thoughts.
jpountz.github.io
June 17, 2025 at 8:17 PM
Andrei Dan kindly captured pictures of Luca and I telling the story of how the Lucene 10 release went
June 16, 2025 at 1:43 PM
Via @rmuir.org : Linux 6.15 introduced a big speedup for Lucene on AMD processors benchmarks.mikemccandless.com/FilteredOrHi... (last data point, not annotated yet) thanks to faster TLB invalidation www.phoronix.com/review/amd-i...
Lucene FilteredOrHighMed queries/sec
benchmarks.mikemccandless.com
June 16, 2025 at 1:34 PM
Uwe now explains how Lucene takes advantage of the Panama foreign memory and vector support in spite of the fact that these features are still preview/incubating in the JDK
June 16, 2025 at 10:21 AM
Uwe Schindler gives a short history of Apache Lucene at #bbuzzz
June 16, 2025 at 10:08 AM
Lucene is getting faster at deep search by switching to a more efficient heap implementation to collect top hits. github.com/apache/lucen...
Move HitQueue in TopScoreDocCollector to a LongHeap by gf2121 · Pull Request #14714 · apache/lucene
This tries to encode ScoreDoc#score and ScoreDoc#doc to a comparable long and use a LongHeap instead of HitQueue. This seems to help apparently when i increase topN = 1000 (mikemccand/luceneutil#35...
github.com
June 6, 2025 at 8:02 AM
A nice optimization landed on the hash table that Lucene uses to build inverted indexes: github.com/apache/lucen.... Some previously unused bits are now used to cache hash codes, effectively making collisions cheaper to resolve.
Cache high-order bits of hashcode to speed up BytesRefHash by bugmakerrrrrr · Pull Request #14720 · apache/lucene
Description This PR tries to utilize the unused part of the id to cache the high-order bits of the hashcode to speed up BytesRefHash. I used 1 million 16-byte UUIDs to benchmark this change, and t...
github.com
June 4, 2025 at 4:33 PM
There has been a big regression in Lucene's nightly benchmarks recently after a kernel upgrade. Mike and @rmuir.org found that it was caused by a change in the Linux scheduler configuration. github.com/apache/lucen...
Nightly benchmark regression on 2025.05.01 · Issue #14630 · apache/lucene
Description I'm seeing a big performance change (mostly regression) on 2025.05.01 benchmark, without an annotation. There are many commits diff for this run, i have not managed to identify but mayb...
github.com
May 19, 2025 at 5:23 AM
I wanted to share what I learned from Tantivy's "Search Benchmark, the Game", so I set up GitHub pages and wrote two blogs, on general observations on the benchmark jpountz.github.io/2025/05/12/a... and how it helped drive performance improvements in Lucene jpountz.github.io/2025/04/12/w...
An analysis of Search Benchmark, the Game
“Search Benchmark, the Game” is maintained at https://github.com/quickwit-oss/search-benchmark-game by the Tantivy folks and published at https://tantivy-search.github.io/bench/. I don’t know the full...
jpountz.github.io
May 12, 2025 at 5:47 PM
Yelp's nrtSearch was just upgraded to Lucene 10. Also switched from persistent storage to object storage as a source of truth, and plans on doing NRT replication via object storage instead of over the network. Very similar to Elasticsearch Serverless. engineeringblog.yelp.com/2025/05/nrts...
Nrtsearch 1.0.0: Incremental Backups, Lucene 10, and More
Nrtsearch 1.0.0: Incremental Backups, Lucene 10, and More Sarthak Nandi and Andrew Prudhomme May 8, 2025 It has been over 3 years since we published our Nrtsearch blog post and...
engineeringblog.yelp.com
May 10, 2025 at 8:08 AM