Marcel Böhme
banner
mboehme.bsky.social
Marcel Böhme
@mboehme.bsky.social
Software Security @MPI, PhD @NUS, Dipl.-Inf. @TUDresden.


Research Group: http://mpi-softsec.github.io
Awesome! Also, I'll be happy to catch up in Seoul in the week after next if you are around for ASE :)
November 9, 2025 at 1:29 PM
On the negative side, the AI reviewer seems to be worse at setting priorities, i.e., distinguishing between critical and insubstantial problems w.r.t. to the main claims. Moreover, it was convincingly incorrect whereas a human reviewer might be incorrect and detectably "silent" on the rationale.
2/2
November 9, 2025 at 8:59 AM
Great question!

On the positive side, I found the AI reviewer *way* more elaborate in eliciting both the positive and negative points. The review is more objective, less/not opinionated. It is more constructive and for every weakness makes suggestions for improvements.

1/
November 9, 2025 at 8:50 AM
Exactly. This is our assumption. Also, there can be infinitely many ways to implement that function.
November 9, 2025 at 7:44 AM
🧵 A human review of our AI review at #AAAI26.

📝: arxiv.org/abs/2507.00057
🦋 : bsky.app/profile/did:...

We are off to a good start. While the synopsis misses the motivation (*why* this is interesting), it offers the most important points. Good abstract-length summary.

1/
November 8, 2025 at 8:04 PM
Overall, the AI reviewer is super impressive! I think, it would help me tremendously during the preparation of our submission to identify points to improve before the paper is submitted.

However, it does make errors, and I wouldn't trust it as an actual (co)-reviewer.

12/12
November 8, 2025 at 7:51 PM
The AI reviewer lists several other items as weaknesses and the corresponding suggestions for improvement. These are summarily deemed to be fixable. Yay!

11/
November 8, 2025 at 7:51 PM
The fourth weakness is a set of presentation issues. These are helpful but easily fixed.

10/
November 8, 2025 at 7:51 PM
The third weakness is a matter of preference.

Our theorem expresses what (and how efficiently) we can learn about detecting non-zero incoherence given the alg. output: "If after n(δ,ε) samples we detect no disagreement, then incoherence is at most ε with prob. at least 1-δ".

9/
November 8, 2025 at 7:51 PM
The second weakness is incorrect.

Our incoherence-based detection reports indeed no false positives: A non-zero incoherence implies a non-zero error, even empirically. To cite the AI reviewer: "If two sampled programs disagree on an input, at least one of them is wrong".

8/
November 8, 2025 at 7:51 PM
Using our own formalization, the AI reviewer literally proposes a fix where we compute pass@1 as the proportion of tasks with non-zero error. Nice!

7/
November 8, 2025 at 7:51 PM
The first weakness seems critical. The AI reviewer finds an error in a "key equation".

Nothing reject-worthy: Not a key equation but a remark, and the error is just a typo; e.g., 1-\bar{E}(S,1,1) fixes it.

But YES, the AI reviewer found a bug in our Equation (12). Wow!!

6/
November 8, 2025 at 7:51 PM
Let's now look at the weaknesses that our AI reviewer has found in our paper. Weaknesses are usually the reasons why a paper is rejected. They better be correct.

5/
November 8, 2025 at 7:51 PM
The three other strengths are also all items that we would consider as important strengths of our paper.

Interestingly, one item (highlighted in blue) is never mentioned in our paper, but something we are now actively pursuing.

4/
November 8, 2025 at 7:51 PM
Our AI reviewer establishes 4 strengths. Most human reviewers might have a hard time to establish strengths so succinctly and at this level of details. Great!

Apart from a minor error (no executable semantics needed; only ability to execute), the first strength looks good.

3/
November 8, 2025 at 7:51 PM
Let's find out which strengths the AI reviewer establishes for our paper. We'll look at weaknesses (those that get papers rejected) later.

The summary of review definitely hits the nail on the head. We can see motivation and main contributions. Nice!

2/
November 8, 2025 at 7:51 PM
For any programming task d, there exists only one (unknown) ground-truth function f.
November 8, 2025 at 10:07 AM
Thanks Toby! Yes. Our assumption (which resolves both of your cases) is that for any programming task d there exists a correct, deterministic function f that the LLM is meant to implement. That function can be unknown and there can be many (correct) implementations of f in any programming language.
November 8, 2025 at 10:03 AM
What's most interesting is that we linked pass@1, a common measure of LLM performance, to incoherence and found that a ranking of LLMs using our metric is similar to the same as a pass@1-based ranking which requires some explicit definition of correctness, e.g., manually-provided "golden" programs.
November 8, 2025 at 8:00 AM
This is easy to see. For instance, if we prompt an LLM twice to write a program that reverses a list and both programs disagree on the output for an input, then at least one of those programs must be incorrect.
November 8, 2025 at 8:00 AM
We define *incorrectness* as the probability that the generated program does not implement the correct function.

We define *incoherence* as the probability that any two generated programs implement different functions and prove that incoherence is an upper bound on incorrectness.
November 8, 2025 at 8:00 AM