Lightnews — Scholar-powered news

Andrés Corrada

@andrescorrada.bsky.social

count given true label. Yes, there are, algebraically they are in the appendix of my paper. The illustration at the top of the post is the 2nd term in these equations, Q_label_true. So the full visual proof is going to need illustrations for every one of those sums in the very same three squares.

October 13, 2025 at 12:26 PM

Andrés Corrada

@andrescorrada.bsky.social

I'm working on visual demonstrations of the logical computations for the M=2 axioms of unsupervised evaluation for classifiers. This example is for 3 labels (a, b, c). Each illustration shows the decomposition that must be true for any pair of classifiers given assumed number of label questions.

For 3-labels, we can write the count of 'a' items as equal to a sum over all the pairs possible responses given 'a' label. This is visually demonstrated as three squares, one for each label. We sum squares that are not transparent.

Same as the 'a' label picture but now for the 'b' label.

Same as the 'a' label picture, but now for the 'c' label.

October 13, 2025 at 12:26 PM

Andrés Corrada

@andrescorrada.bsky.social

If we are at (Q_a=8,Q_b=9,Q_c=8) for a 3-label test, and observe a classifier saying (R_a=3,R_b=10,R_c=12) you can start parsing out the logical consequences. There is no way they can be better than 75% on the c-label, e.g. All of those bespoke deductions are the M=1 axioms. arxiv.org/abs/2510.00821

October 13, 2025 at 11:56 AM

Andrés Corrada

@andrescorrada.bsky.social

So all computations of possible evaluations given observations occur at the same fixed Q-point. Note that the logic cannot tell us anything about what this Q-point could be. Other knowledge or assumptions must be invoked to do that - science and engineering.
The M=1 axioms then come into play.

October 13, 2025 at 11:56 AM

Andrés Corrada

@andrescorrada.bsky.social

The plot also illustrates the "atomic" logical operations in unsupervised evaluation. We cannot know anything about the answer key. Hence, all possible values of a statistic of the answer key must be possible. For three labels, these are the points (Q_a, Q_b, Q_c). Q_a = number of 'a' questions, etc

October 13, 2025 at 11:56 AM

Andrés Corrada

@andrescorrada.bsky.social

For three or more labels, there are more error modes than correct responses. So there are many ways to be X% correct on any given label. In binary classification, there is only one. This is shown here for three label classification by three LLMs-as-Judges grading the output of two other LLMs.

Possible evaluations of being correct on three labels by 2 LLMs-as-Judges. The z-axis shows the multiplicity of evaluations at their joint correct evaluation. For any number of corrects they get, there are many ways ways to be wrong.

October 13, 2025 at 11:56 AM

Andrés Corrada

@andrescorrada.bsky.social

In general, knowing and applying the M=1 axioms for R-label classification gives you a reduction in uncertainty that goes as (1 - 1 / Q^(R-1)). The M=1 axioms are explained in my latest paper arxiv.org/abs/2510.00821

October 10, 2025 at 6:00 PM

Andrés Corrada

@andrescorrada.bsky.social

The central question in any logic of unsupervised evaluation is -- what are the group evaluations logically consistent with how we observe experts agreeing/disagreeing on a test for which we have no answer key? As this series of plots for a single binary classifier that labeled (Q=10) items shows.

October 10, 2025 at 6:00 PM

Andrés Corrada

@andrescorrada.bsky.social

In any debate, it would help if the participants had some measure of the reliability of the opinions of others and of their own. In addition, logical consistency can act as a way to warn us that at least one member of the debate is actually violating our accuracy specification.

October 10, 2025 at 12:29 PM

Andrés Corrada

@andrescorrada.bsky.social

Any dialogue framework would benefit from understanding the logic of unsupervised evaluation - what are the group evaluations logically consistent with how we observe experts agreeing/disagreeing in their decisions? Evaluate, then collectively decide.

October 10, 2025 at 12:29 PM

Andrés Corrada

@andrescorrada.bsky.social

When no one in the room can be trusted, who judges the judges can be partially answered by using the principle of logical consistency between disagreeing experts. The test itself, i.e. its answer key, should also be a suspect in unlabeled settings. arxiv.org/abs/2510.00821

October 10, 2025 at 12:20 PM

Andrés Corrada

@andrescorrada.bsky.social

Thank you. The logic is so basic that I believe it precedes information theoretic assumptions, but I may be wrong. There is a geometry to the possible set of evaluations. In the space of label accuracy and prevalences, the axioms become polynomials. Algebraic Geometry is the tool to use then.

October 10, 2025 at 12:14 PM

Andrés Corrada

@andrescorrada.bsky.social

Someday it will seem strange that seminars on the theoretical foundations of machine learning did not discuss the logic of unsupervised evaluation and its axioms. Simple linear relations that anyone with knowledge of high-school math can comprehend. arxiv.org/abs/2510.00821

October 9, 2025 at 5:57 PM

Andrés Corrada

@andrescorrada.bsky.social

This is example actually does not require any iid assumptions. It comes solely from the single classifier set of R axioms.
So again, @beenwrekt.bsky.social , observation for decision theory does not apply to evaluation theory. Universal laws without iid assumptions that say more is better exist.

October 9, 2025 at 5:06 PM

Andrés Corrada

@andrescorrada.bsky.social

There is an algebraic solution to the problem of inferring the performance of three independent classifiers that labeled Q items. It is the evaluation converse to Condorcet's theorem. ntqr.readthedocs.io/en/latest/no...

October 9, 2025 at 5:00 PM

Andrés Corrada

@andrescorrada.bsky.social

Logical consistency between disagreeing experts can be used to keep us safer when we use their noisy decisions. My latest on this overlooked tool to tame AIs -- submitted to IEEE SaTML 2026 in Berlin. arxiv.org/abs/2510.00821

Logical Consistency Between Disagreeing Experts and Its Role in AI Safety

October 2, 2025 at 12:08 PM

Andrés Corrada

@andrescorrada.bsky.social

Formalization is considered the sine qua non of maturity in many fields. How far could you go with your work, @far.ai , if you understood and used the logic of unsupervised evaluation I have developed? How far could you inspire others? Anybody can try this logic. ntqr.readthedocs.io/en/latest

October 1, 2025 at 9:12 PM

Andrés Corrada

@andrescorrada.bsky.social

Verification of evaluations for classifiers in unlabeled settings exists. You can read about it or try it now. ntqr.readthedocs.io/en/latest

October 1, 2025 at 9:05 PM

Andrés Corrada

@andrescorrada.bsky.social

@far.ai is not the only institution or person clamoring for verification of AI systems. The authors of this paper, which include @yoshuabengio.bsky.social , also are tone deaf to the existence of a logic for unsupervised evaluation that uses no probability. There is no "towards" in the logic.

October 1, 2025 at 9:03 PM

Andrés Corrada

@andrescorrada.bsky.social

And we need @far.ai to understand that the verification of evaluations of classifiers exists. This is my work on the logic of unsupervised evaluation for classifiers. So simple anybody who understands high-school math will get it. ntqr.readthedocs.io/en/latest

October 1, 2025 at 9:03 PM

Andrés Corrada

@andrescorrada.bsky.social

I was invited to give a talk at @mcgilluniversity.bsky.social this May at Trusted and Automated Decision Making (TADM). One of the other speakers, Bill Farmer,wrote this book on Type Theory. He told me a story about the quixotic quest for the universal code verifier he was involved in.

September 28, 2025 at 2:55 PM

Andrés Corrada

@andrescorrada.bsky.social

Everything about carrying out statistical evaluation of experts taking any multiple choice exam is complete. These pictures show one aspect of that completeness -- we can enumerate all the possible evaluations for any number of experts on any test, on any subject given summaries of their decisions.

September 28, 2025 at 2:43 PM

Andrés Corrada

@andrescorrada.bsky.social

There have been calls for using formal verification of software programs as a model for formally verifying AI systems, for example @yoshuabengio.bsky.social and his co-authors. But, as you know, formal verification of code is a non-trivial task. The verifier of any code does not exist.

September 28, 2025 at 2:43 PM

Andrés Corrada

@andrescorrada.bsky.social

@adolfont.github.io, it may interest you to know that we can take concepts from theorem provers to formalize the computation of the group evaluations that are logically consistent with how we observe experts, human or robotic, agreeing/disagreeing on a test. ntqr.readthedocs.io/en/latest

September 28, 2025 at 2:30 PM

Andrés Corrada

@andrescorrada.bsky.social

You can formalize evaluations of classifiers or any expert answering a multiple-choice exam. It is mathematically complete and relies on one logical property alone - logical consistency between disagreeing experts.

September 27, 2025 at 7:39 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news