Andrés Corrada
andrescorrada.bsky.social
Andrés Corrada
@andrescorrada.bsky.social
Scientist, Inventor, author of the NTQR Python package for AI safety through formal verification of unsupervised evaluations. On a mission to eliminate Majority Voting from AI systems. E Pluribus Unum.
count given true label. Yes, there are, algebraically they are in the appendix of my paper. The illustration at the top of the post is the 2nd term in these equations, Q_label_true. So the full visual proof is going to need illustrations for every one of those sums in the very same three squares.
October 13, 2025 at 12:26 PM
I'm working on visual demonstrations of the logical computations for the M=2 axioms of unsupervised evaluation for classifiers. This example is for 3 labels (a, b, c). Each illustration shows the decomposition that must be true for any pair of classifiers given assumed number of label questions.
October 13, 2025 at 12:26 PM
If we are at (Q_a=8,Q_b=9,Q_c=8) for a 3-label test, and observe a classifier saying (R_a=3,R_b=10,R_c=12) you can start parsing out the logical consequences. There is no way they can be better than 75% on the c-label, e.g. All of those bespoke deductions are the M=1 axioms. arxiv.org/abs/2510.00821
October 13, 2025 at 11:56 AM
So all computations of possible evaluations given observations occur at the same fixed Q-point. Note that the logic cannot tell us anything about what this Q-point could be. Other knowledge or assumptions must be invoked to do that - science and engineering.
The M=1 axioms then come into play.
October 13, 2025 at 11:56 AM
The plot also illustrates the "atomic" logical operations in unsupervised evaluation. We cannot know anything about the answer key. Hence, all possible values of a statistic of the answer key must be possible. For three labels, these are the points (Q_a, Q_b, Q_c). Q_a = number of 'a' questions, etc
October 13, 2025 at 11:56 AM
For three or more labels, there are more error modes than correct responses. So there are many ways to be X% correct on any given label. In binary classification, there is only one. This is shown here for three label classification by three LLMs-as-Judges grading the output of two other LLMs.
October 13, 2025 at 11:56 AM
In general, knowing and applying the M=1 axioms for R-label classification gives you a reduction in uncertainty that goes as (1 - 1 / Q^(R-1)). The M=1 axioms are explained in my latest paper arxiv.org/abs/2510.00821
October 10, 2025 at 6:00 PM
The central question in any logic of unsupervised evaluation is -- what are the group evaluations logically consistent with how we observe experts agreeing/disagreeing on a test for which we have no answer key? As this series of plots for a single binary classifier that labeled (Q=10) items shows.
October 10, 2025 at 6:00 PM
In any debate, it would help if the participants had some measure of the reliability of the opinions of others and of their own. In addition, logical consistency can act as a way to warn us that at least one member of the debate is actually violating our accuracy specification.
October 10, 2025 at 12:29 PM
Any dialogue framework would benefit from understanding the logic of unsupervised evaluation - what are the group evaluations logically consistent with how we observe experts agreeing/disagreeing in their decisions? Evaluate, then collectively decide.
October 10, 2025 at 12:29 PM
When no one in the room can be trusted, who judges the judges can be partially answered by using the principle of logical consistency between disagreeing experts. The test itself, i.e. its answer key, should also be a suspect in unlabeled settings. arxiv.org/abs/2510.00821
October 10, 2025 at 12:20 PM
Thank you. The logic is so basic that I believe it precedes information theoretic assumptions, but I may be wrong. There is a geometry to the possible set of evaluations. In the space of label accuracy and prevalences, the axioms become polynomials. Algebraic Geometry is the tool to use then.
October 10, 2025 at 12:14 PM
Someday it will seem strange that seminars on the theoretical foundations of machine learning did not discuss the logic of unsupervised evaluation and its axioms. Simple linear relations that anyone with knowledge of high-school math can comprehend. arxiv.org/abs/2510.00821
October 9, 2025 at 5:57 PM
This is example actually does not require any iid assumptions. It comes solely from the single classifier set of R axioms.
So again, @beenwrekt.bsky.social , observation for decision theory does not apply to evaluation theory. Universal laws without iid assumptions that say more is better exist.
October 9, 2025 at 5:06 PM
There is an algebraic solution to the problem of inferring the performance of three independent classifiers that labeled Q items. It is the evaluation converse to Condorcet's theorem. ntqr.readthedocs.io/en/latest/no...
October 9, 2025 at 5:00 PM
Logical consistency between disagreeing experts can be used to keep us safer when we use their noisy decisions. My latest on this overlooked tool to tame AIs -- submitted to IEEE SaTML 2026 in Berlin. arxiv.org/abs/2510.00821
October 2, 2025 at 12:08 PM
Formalization is considered the sine qua non of maturity in many fields. How far could you go with your work, @far.ai , if you understood and used the logic of unsupervised evaluation I have developed? How far could you inspire others? Anybody can try this logic. ntqr.readthedocs.io/en/latest
October 1, 2025 at 9:12 PM
Verification of evaluations for classifiers in unlabeled settings exists. You can read about it or try it now. ntqr.readthedocs.io/en/latest
October 1, 2025 at 9:05 PM
@far.ai is not the only institution or person clamoring for verification of AI systems. The authors of this paper, which include @yoshuabengio.bsky.social , also are tone deaf to the existence of a logic for unsupervised evaluation that uses no probability. There is no "towards" in the logic.
October 1, 2025 at 9:03 PM
And we need @far.ai to understand that the verification of evaluations of classifiers exists. This is my work on the logic of unsupervised evaluation for classifiers. So simple anybody who understands high-school math will get it. ntqr.readthedocs.io/en/latest
October 1, 2025 at 9:03 PM
I was invited to give a talk at @mcgilluniversity.bsky.social this May at Trusted and Automated Decision Making (TADM). One of the other speakers, Bill Farmer,wrote this book on Type Theory. He told me a story about the quixotic quest for the universal code verifier he was involved in.
September 28, 2025 at 2:55 PM
Everything about carrying out statistical evaluation of experts taking any multiple choice exam is complete. These pictures show one aspect of that completeness -- we can enumerate all the possible evaluations for any number of experts on any test, on any subject given summaries of their decisions.
September 28, 2025 at 2:43 PM
There have been calls for using formal verification of software programs as a model for formally verifying AI systems, for example @yoshuabengio.bsky.social and his co-authors. But, as you know, formal verification of code is a non-trivial task. The verifier of any code does not exist.
September 28, 2025 at 2:43 PM
@adolfont.github.io, it may interest you to know that we can take concepts from theorem provers to formalize the computation of the group evaluations that are logically consistent with how we observe experts, human or robotic, agreeing/disagreeing on a test. ntqr.readthedocs.io/en/latest
September 28, 2025 at 2:30 PM
You can formalize evaluations of classifiers or any expert answering a multiple-choice exam. It is mathematically complete and relies on one logical property alone - logical consistency between disagreeing experts.
September 27, 2025 at 7:39 PM