Lightnews — Scholar-powered news

Marcel Böhme

@mboehme.bsky.social

The AI reviewer lists several other items as weaknesses and the corresponding suggestions for improvement. These are summarily deemed to be fixable. Yay!

11/

November 8, 2025 at 7:51 PM

Marcel Böhme

@mboehme.bsky.social

The fourth weakness is a set of presentation issues. These are helpful but easily fixed.

10/

November 8, 2025 at 7:51 PM

Marcel Böhme

@mboehme.bsky.social

The third weakness is a matter of preference.

Our theorem expresses what (and how efficiently) we can learn about detecting non-zero incoherence given the alg. output: "If after n(δ,ε) samples we detect no disagreement, then incoherence is at most ε with prob. at least 1-δ".

9/

November 8, 2025 at 7:51 PM

Marcel Böhme

@mboehme.bsky.social

The second weakness is incorrect.

Our incoherence-based detection reports indeed no false positives: A non-zero incoherence implies a non-zero error, even empirically. To cite the AI reviewer: "If two sampled programs disagree on an input, at least one of them is wrong".

8/

November 8, 2025 at 7:51 PM

Marcel Böhme

@mboehme.bsky.social

Using our own formalization, the AI reviewer literally proposes a fix where we compute pass@1 as the proportion of tasks with non-zero error. Nice!

7/

November 8, 2025 at 7:51 PM

Marcel Böhme

@mboehme.bsky.social

The first weakness seems critical. The AI reviewer finds an error in a "key equation".

Nothing reject-worthy: Not a key equation but a remark, and the error is just a typo; e.g., 1-\bar{E}(S,1,1) fixes it.

But YES, the AI reviewer found a bug in our Equation (12). Wow!!

6/

November 8, 2025 at 7:51 PM

Marcel Böhme

@mboehme.bsky.social

The three other strengths are also all items that we would consider as important strengths of our paper.

Interestingly, one item (highlighted in blue) is never mentioned in our paper, but something we are now actively pursuing.

4/

November 8, 2025 at 7:51 PM

Marcel Böhme

@mboehme.bsky.social

Our AI reviewer establishes 4 strengths. Most human reviewers might have a hard time to establish strengths so succinctly and at this level of details. Great!

Apart from a minor error (no executable semantics needed; only ability to execute), the first strength looks good.

3/

November 8, 2025 at 7:51 PM

Marcel Böhme

@mboehme.bsky.social

Let's find out which strengths the AI reviewer establishes for our paper. We'll look at weaknesses (those that get papers rejected) later.

The summary of review definitely hits the nail on the head. We can see motivation and main contributions. Nice!

2/

November 8, 2025 at 7:51 PM

Marcel Böhme

@mboehme.bsky.social

🧵 A human review of our AI review at #AAAI26.

📝: arxiv.org/abs/2507.00057
🦋 : bsky.app/profile/did:...

We are off to a good start. While the synopsis misses the motivation (*why* this is interesting), it offers the most important points. Good abstract-length summary.

1/

November 8, 2025 at 7:51 PM

Marcel Böhme

@mboehme.bsky.social

What's most interesting is that we linked pass@1, a common measure of LLM performance, to incoherence and found that a ranking of LLMs using our metric is similar to the same as a pass@1-based ranking which requires some explicit definition of correctness, e.g., manually-provided "golden" programs.

November 8, 2025 at 8:00 AM

Marcel Böhme

@mboehme.bsky.social

Just accepted at #AAAI26 in Singapore: Our paper on estimating the *correctness* of LLM-generated code in the absence of oracles (e.g., a ground-truth implementation).

📝 arxiv.org/abs/2507.00057

with Thomas Valentin (ENS Paris-Saclay), Ardi Madadi, and Gaetano Sapia (#MPI_SP).

November 8, 2025 at 8:00 AM

Marcel Böhme

@mboehme.bsky.social

🧗 Manually writing fuzz drivers doesn't scale.
🚩 Auto-generating them gives false positives.
👩‍💻 Invivo fuzzing requires a user to configure the system and to execute the target.

🤖 Can we substitute the user and auto-generate configuration and executions?

Find out @ gpsapia.github.io/files/ICSE_2...

November 3, 2025 at 6:24 PM

Marcel Böhme

@mboehme.bsky.social

Gaetano's paper on Scaling Security Testing by Adressing the Reachability Gap has been accepted at #ICSE26!

📝 gpsapia.github.io/files/ICSE_2...
🧑‍💻 github.com/GPSapia/Reac...

How to scale automatic security testing to arbitrary systems?

November 3, 2025 at 6:24 PM

Marcel Böhme

@mboehme.bsky.social

AAAI'26 has been adopting AI reviews in two stages of the review process. I can see that we have to handle the reviewer overload, but I don't think AI reviews are beneficial, at all, for our scientific progress.

If our paper gets accepted at #AAAI26, I will review our AI-generated review here 🤠

October 25, 2025 at 12:18 PM

Marcel Böhme

@mboehme.bsky.social

We believe that our probabilistic perspective of correctness for the LLM-generated program as a random variable gives rise to a proliferation of new techniques built for trustworthy code generation with probabilistic guarantees.

Comments and feedback welcome!

July 2, 2025 at 7:30 AM

Marcel Böhme

@mboehme.bsky.social

This work on "Estimating Correctness Without Oracles in LLM-Based Code Generation" was led by Thomas Valentin (ENS Paris Saclay) with the generous advice and help from Ardi Madadi (MPI-SP) and Gaetano Sapia (MPI-SP).

July 2, 2025 at 7:30 AM

Marcel Böhme

@mboehme.bsky.social

A traditional pass@1 based evaluation of the code generation abilities of LLMs can be reliably substituted with our oracle-less evaluation. This brings substantial benefits. For instance, it removes reliance on human-written oracles (reducing data leakage and overfitting problems).

July 2, 2025 at 7:29 AM

Marcel Böhme

@mboehme.bsky.social

It's been a lot of fun! Up here in Trondheim the sun never really sets at this time of the year. This is a picture from 9:30pm which feels like an eternal 4pm.

See y'all next year!

June 29, 2025 at 7:31 AM

Marcel Böhme

@mboehme.bsky.social

Will Wilson (@AntithesisHQ.bsky.social) talked about the four professional paths with a beautiful historical metaphor from being a member of a guilt (academia) to being a siege engineer (startup founder). He also talked about his efforts at Antithesis to build a deterministic VM for fuzzing.

June 29, 2025 at 7:30 AM

Marcel Böhme

@mboehme.bsky.social

Miryung Kim (UCLA) talked about challenges in domain-specific fuzzing beyond those of general-purpose, including very slow targets (from HW circuits to distributed systems), and her approach to developing domain-specific program transformations, mutation operators, feedback, etc.

June 29, 2025 at 7:29 AM

Marcel Böhme

@mboehme.bsky.social

It was great to see the community come together again at our 4th #FUZZING workshop in Trondheim this year! We drew a big crowd. Enjoyed the super lively discussions.

Thanks to the organizers:
* @rohan.padhye.org
* @yannicnoller.bsky.social
* @ruijiemeng.bsky.social and
* László Szekeres (Google)

June 29, 2025 at 7:25 AM

Marcel Böhme

@mboehme.bsky.social

Thrilled to share a recent opinion piece at the IEEE Security and Privacy (Vol. 23, Issue 3).

Basically a long-term perspective on the field meant for both researchers and practitioners.

📝 ieeexplore.ieee.org/stamp/stamp....

June 19, 2025 at 9:40 AM

Marcel Böhme

@mboehme.bsky.social

My amazing co-chair Lingming Zhang and I, 12 area chairs in 9 areas, and 260 PC members are looking forward to your submissions to the 40th IEEE/ACM International Conference on Automated Software Engineering in Seoul!

📝 conf.researchr.org/track/ase-20... (CfP)
📝 ase25.hotcrp.com/u/0/ (Submission)

May 3, 2025 at 4:54 PM

Marcel Böhme

@mboehme.bsky.social

Policy on LLM-assisted Reviews @ #ASE25

May 3, 2025 at 2:28 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news