Lightnews — Scholar-powered news

Vectorized evaluation of disjunctive queries

October 12, 2025 at 7:15 AM

Adrien Grand

@jpountz.bsky.social

New blog: vectorized evaluation of disjunctive queries jpountz.github.io/2025/10/11/v... It explains how Lucene manages to be fast at evaluating top hits by BM25 score, even with hard queries that have only stop words or tens of terms.

In a previous blog post, I explained how Lucene significantly improved query evaluation efficiency by migrating to a vectorized execution model, and described the algorithm that Lucene uses to evaluat...

BM25 run across multiple fields isn’t as simple as summing a bunch of field-level BM25 scores.

October 11, 2025 at 8:27 PM

Reposted by Adrien Grand

Doug Turnbull

@softwaredoug.bsky.social

BM25F is an adjustment to BM25 that accounts for multiple fields, beating out naive summing of BM25 scores

softwaredoug.com/blog/2025/09...

BM25F from scratch

softwaredoug.com

September 18, 2025 at 5:33 PM

Adrien Grand

@jpountz.bsky.social

Lucene 10.3 is out with 40% faster lexical search, 15% faster dense vector search and 30% faster terms dictionary lookups. lucene.apache.org/core/corenew...

Lucene™ Core News

Apache Lucene is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for...

lucene.apache.org

September 14, 2025 at 7:26 AM

Adrien Grand

@jpountz.bsky.social

Lucene just bumped the block size of its postings lists from 128 to 256. This gave very good speedups (up to 45%) to most queries, and up to 10-15% slowdowns to filtered term queries. benchmarks.mikemccandless.com/2025.09.10.1...

Adding 3-ary LongHeap to speed up collectors like TopDoc*Collectors by RamakrishnaChilaka · Pull Request #15140 · apache/lucene

September 11, 2025 at 1:46 PM

Adrien Grand

@jpountz.bsky.social

#Lucene just switched from a binary heap to a ternary heap to collect top hits by score. This helps a small bit when computing top-100 hits (~2% on the fastest queries) but up to 15% when computing top-1000 hits - thanks to better cache efficiency github.com/apache/lucen...

Description This PR updates LongHeap from a fixed 2-ary heap to a 3-ary heap (the code is generic with n-ary Heap). The change improves cache locality and reduces heap operations for larger heaps, ...

Compilation vs. vectorization, search engine edition

September 4, 2025 at 2:04 PM

Adrien Grand

@jpountz.bsky.social

I just ran the Tantivy benchmark (tantivy-search.github.io/bench/) on Lucene 10.2 vs a Lucene 10.3 snapshot build. Lucene 10.2 already performed very well, but Lucene 10.3 is on another level. Very exciting.

August 30, 2025 at 8:12 PM

Adrien Grand

@jpountz.bsky.social

I just blogged about how Lucene improved query evaluation efficiency by ~40% through vectorization: jpountz.github.io/2025/08/28/c...

Virtual function calls are quite expensive, which is why database systems have been looking into ways to avoid performing one or more virtual function calls per record when processing a query. Two mai...

Why you should configure an index sort on your Lucene indexes

August 28, 2025 at 1:10 PM

Adrien Grand

@jpountz.bsky.social

Why you probably should configure an index sort on your Lucene indexes: jpountz.github.io/2025/07/26/w...

Some time ago, I wrote that “if you do not configure an index sort on your Lucene indexes, you are missing search-time efficiency benefits that are almost certainly worth the (low) index-time overhead...

More on Vespa vs. Lucene/Elasticsearch

July 26, 2025 at 9:06 PM

Adrien Grand

@jpountz.bsky.social

I spent some time looking at the Vespa source code to see how it compares with Lucene jpountz.github.io/2025/07/25/m...

In a previous post, I took a look at the Vespa vs. Elasticsearch benchmark that the Vespa people run. The results made me want to dig a little deeper to see how Vespa and Lucene/Elasticsearch differ i...

Lucene BooleanQuery (OR, high freq, medium freq term) queries/sec

July 26, 2025 at 6:36 PM

Adrien Grand

@jpountz.bsky.social

This small change yielded a ~5% speedup on several queries of Lucene's nightly benchmarks (see last data point at benchmarks.mikemccandless.com/OrStopWords....). Can you guess why?

July 9, 2025 at 6:24 AM

Adrien Grand

@jpountz.bsky.social

Last month, Lucene changed query evaluation to work in a more term-at-a-time fashion within small-ish windows of doc IDs. This yielded a good speedup on its own (annotation IL benchmarks.mikemccandless.com/OrHighMed.html).

A look at the Vespa vs. Elasticsearch benchmark

July 4, 2025 at 11:28 AM

Adrien Grand

@jpountz.bsky.social

Lucene is getting an increasing number of high-quality contributions from ByteDance employees, especially around performance. Good to see that this project keeps attracting contributors from all around the world.

June 26, 2025 at 3:40 PM

Adrien Grand

@jpountz.bsky.social

Another common point I did not expect: Vespa's strict vs. unstrict iterators is quite similar to Lucene's two-phase iteration. And both projects use this feature to effectively combine dynamic pruning with filtering (a hard and underappreciated problem IMO).

June 25, 2025 at 12:53 PM

Adrien Grand

@jpountz.bsky.social

Someone asked me for my opinion on the Vespa vs. Elasticsearch performance comparison today at Berlin Buzzwords, so I gave it a try: jpountz.github.io/2025/06/17/a...

I was attending Berlin Buzzwords today and someone asked me about the Elasticsearch vs. Vespa comparison produced by the Vespa people, so I thought I’d publish my thoughts.

Lucene FilteredOrHighMed queries/sec

June 17, 2025 at 8:17 PM

Adrien Grand

@jpountz.bsky.social

Andrei Dan kindly captured pictures of Luca and I telling the story of how the Lucene 10 release went

June 16, 2025 at 1:43 PM

Adrien Grand

@jpountz.bsky.social

Via @rmuir.org : Linux 6.15 introduced a big speedup for Lucene on AMD processors benchmarks.mikemccandless.com/FilteredOrHi... (last data point, not annotated yet) thanks to faster TLB invalidation www.phoronix.com/review/amd-i...

Move HitQueue in TopScoreDocCollector to a LongHeap by gf2121 · Pull Request #14714 · apache/lucene

June 16, 2025 at 1:34 PM

Adrien Grand

@jpountz.bsky.social

Uwe now explains how Lucene takes advantage of the Panama foreign memory and vector support in spite of the fact that these features are still preview/incubating in the JDK

June 16, 2025 at 10:21 AM

Adrien Grand

@jpountz.bsky.social

Uwe Schindler gives a short history of Apache Lucene at #bbuzzz

June 16, 2025 at 10:08 AM

Adrien Grand

@jpountz.bsky.social

Lucene is getting faster at deep search by switching to a more efficient heap implementation to collect top hits. github.com/apache/lucen...

This tries to encode ScoreDoc#score and ScoreDoc#doc to a comparable long and use a LongHeap instead of HitQueue. This seems to help apparently when i increase topN = 1000 (mikemccand/luceneutil#35...

Cache high-order bits of hashcode to speed up BytesRefHash by bugmakerrrrrr · Pull Request #14720 · apache/lucene

June 6, 2025 at 8:02 AM

Adrien Grand

@jpountz.bsky.social

A nice optimization landed on the hash table that Lucene uses to build inverted indexes: github.com/apache/lucen.... Some previously unused bits are now used to cache hash codes, effectively making collisions cheaper to resolve.

Description This PR tries to utilize the unused part of the id to cache the high-order bits of the hashcode to speed up BytesRefHash. I used 1 million 16-byte UUIDs to benchmark this change, and t...

Nightly benchmark regression on 2025.05.01 · Issue #14630 · apache/lucene

June 4, 2025 at 4:33 PM

Adrien Grand

@jpountz.bsky.social

There has been a big regression in Lucene's nightly benchmarks recently after a kernel upgrade. Mike and @rmuir.org found that it was caused by a change in the Linux scheduler configuration. github.com/apache/lucen...

Description I'm seeing a big performance change (mostly regression) on 2025.05.01 benchmark, without an annotation. There are many commits diff for this run, i have not managed to identify but mayb...

An analysis of Search Benchmark, the Game

May 19, 2025 at 5:23 AM

Adrien Grand

@jpountz.bsky.social

I wanted to share what I learned from Tantivy's "Search Benchmark, the Game", so I set up GitHub pages and wrote two blogs, on general observations on the benchmark jpountz.github.io/2025/05/12/a... and how it helped drive performance improvements in Lucene jpountz.github.io/2025/04/12/w...

“Search Benchmark, the Game” is maintained at https://github.com/quickwit-oss/search-benchmark-game by the Tantivy folks and published at https://tantivy-search.github.io/bench/. I don’t know the full...