Csaba Szepesvari
@skiandsolve.bsky.social
⛷️ ML Theorist carving equations and mountain trails | 🚴♂️ Biker, Climber, Adventurer | 🧠 Reinforcement Learning: Always seeking higher peaks, steeper walls and better policies.
https://ualberta.ca/~szepesva
https://ualberta.ca/~szepesva
Reposted by Csaba Szepesvari
Stay tuned for updates by following us or the organizers:
Alberto Metelli, @antoine-mln.bsky.social , Dirk van der Hoeven, Felix Berkenkamp, Francesco Trovò, @gioramponi.bsky.social Marco Mussi, @skiandsolve.bsky.social , and @tillfreihaut.bsky.social
ARLET
A simple, whitespace theme for academics. Based on [*folio](https://github.com/bogoli/-folio) design.
arlet-workshop.github.io
July 28, 2025 at 5:46 PM
Stay tuned for updates by following us or the organizers:
Alberto Metelli, @antoine-mln.bsky.social , Dirk van der Hoeven, Felix Berkenkamp, Francesco Trovò, @gioramponi.bsky.social Marco Mussi, @skiandsolve.bsky.social , and @tillfreihaut.bsky.social
..actually, not only standard notation, but also to be able to speak about the loss (=log-loss) used to train today's LLMs.
July 10, 2025 at 3:19 PM
..actually, not only standard notation, but also to be able to speak about the loss (=log-loss) used to train today's LLMs.
No, it is not information retrieval. It is deducing new things from old things. You can do this by running a blind breadth-first (unintelligent) search producing all proofs of all possible statements. Just don't want errors. But this is not retrieval. It is computation.
July 10, 2025 at 5:21 AM
No, it is not information retrieval. It is deducing new things from old things. You can do this by running a blind breadth-first (unintelligent) search producing all proofs of all possible statements. Just don't want errors. But this is not retrieval. It is computation.
Of course approximations are useful. The paper is narrowly focused on deductive reasoning which seem to require the exactness we talk about. The point is that regardless of whether you use quantum mechanics or the Newtonian one, you don't want your derivations mistake-ridden.
July 10, 2025 at 5:19 AM
Of course approximations are useful. The paper is narrowly focused on deductive reasoning which seem to require the exactness we talk about. The point is that regardless of whether you use quantum mechanics or the Newtonian one, you don't want your derivations mistake-ridden.
Worst-case vs. average case: yes!
But I would not necessarily connect these to minimax vs. Bayes.
But I would not necessarily connect these to minimax vs. Bayes.
July 10, 2025 at 5:14 AM
Worst-case vs. average case: yes!
But I would not necessarily connect these to minimax vs. Bayes.
But I would not necessarily connect these to minimax vs. Bayes.
Yeah, admittedly, not a focus point of the paper. How about if the model produces a single response, the loss is the zero-one loss. Then the model better choose the label with the highest probability label, which is OK. Point of having mu: Not much point, just matching standard notation..
July 10, 2025 at 5:13 AM
Yeah, admittedly, not a focus point of the paper. How about if the model produces a single response, the loss is the zero-one loss. Then the model better choose the label with the highest probability label, which is OK. Point of having mu: Not much point, just matching standard notation..
I am curious about these examples.. (and yes, I can construct a few, too, but I want to add more)
July 10, 2025 at 4:51 AM
I am curious about these examples.. (and yes, I can construct a few, too, but I want to add more)
No, this is not correct: Learning 1[A>B] interestingly has the same complexity (provably). This is because 1[A>B] is in the "orbit" of 1[A>=B]. So the symmetric learning who is being taught 1[A>B] need to figure out it is not taught 1[A>=B].
July 10, 2025 at 4:50 AM
No, this is not correct: Learning 1[A>B] interestingly has the same complexity (provably). This is because 1[A>B] is in the "orbit" of 1[A>=B]. So the symmetric learning who is being taught 1[A>B] need to figure out it is not taught 1[A>=B].
Maybe. I am asking for much less here from the machines. I am asking for them just to be correct (or stay silent). No intelligence, just good old fashioned computation.
July 9, 2025 at 2:44 AM
Maybe. I am asking for much less here from the machines. I am asking for them just to be correct (or stay silent). No intelligence, just good old fashioned computation.
the solution is found..
July 9, 2025 at 2:42 AM
the solution is found..
Yes, transformers do not have "working memory". Also, I don't believe in that using them in AR mode is powerful enough for challenging problems. In a way, without "working memory", external "loop", we say the model should solve problems by free association ad infinitum or at least until
July 9, 2025 at 2:42 AM
Yes, transformers do not have "working memory". Also, I don't believe in that using them in AR mode is powerful enough for challenging problems. In a way, without "working memory", external "loop", we say the model should solve problems by free association ad infinitum or at least until
On the paper: Interesting but indeed there is little in common. On the problem studied in the paper: Would not a slightly more general statistical framework solve your problem? Ie measure error differently than through the prediction loss (AR models: parameters, spectral measure, etc.).
July 9, 2025 at 2:39 AM
On the paper: Interesting but indeed there is little in common. On the problem studied in the paper: Would not a slightly more general statistical framework solve your problem? Ie measure error differently than through the prediction loss (AR models: parameters, spectral measure, etc.).
Yeah, I don't see the exactness happening that much on its own through statistical learning. Neither experimentally, nor theoretically. We have an example for illustrating this: use the uniform distribution for good coverage, teach transformers to compare m-bit integers using GD. Need 2^m examples.
July 9, 2025 at 2:39 AM
Yeah, I don't see the exactness happening that much on its own through statistical learning. Neither experimentally, nor theoretically. We have an example for illustrating this: use the uniform distribution for good coverage, teach transformers to compare m-bit integers using GD. Need 2^m examples.
Yeah, we cite this and this was a paper that got me started on this project!
July 9, 2025 at 2:32 AM
Yeah, we cite this and this was a paper that got me started on this project!
Glad to see someone remembers these:)
April 4, 2025 at 2:05 AM
Glad to see someone remembers these:)
should be distinguished. The reason they should not is because they are indistinguishable. So at least those need to be collapsed. So yes, one can start with redundant models, where it will appear you could have epistemic uncertainty, but this is easy to rule out. 2/2
March 20, 2025 at 10:49 PM
should be distinguished. The reason they should not is because they are indistinguishable. So at least those need to be collapsed. So yes, one can start with redundant models, where it will appear you could have epistemic uncertainty, but this is easy to rule out. 2/2
I guess with a worst-case hat on, we just all die:) In other words, indeed, the distinction is useful inasmuch as the modelling assumptions are valid. And there the mixture of two Diracs over 0 and 1 actually is a bad example, because that says that two models that are identical as distributions 1/x
March 20, 2025 at 10:47 PM
I guess with a worst-case hat on, we just all die:) In other words, indeed, the distinction is useful inasmuch as the modelling assumptions are valid. And there the mixture of two Diracs over 0 and 1 actually is a bad example, because that says that two models that are identical as distributions 1/x
I guess I stop here:) 5/5
March 20, 2025 at 10:43 PM
I guess I stop here:) 5/5
Well, yes, to the degree that the model you use correctly reflects what's going on. Example with drug trials, randomized patient allocation. Result is effectiveness. Meaning of aleatoric and epistemic uncertainty should be clear and they help with explaining outcomes of the trial. 4/x
March 20, 2025 at 10:41 PM
Well, yes, to the degree that the model you use correctly reflects what's going on. Example with drug trials, randomized patient allocation. Result is effectiveness. Meaning of aleatoric and epistemic uncertainty should be clear and they help with explaining outcomes of the trial. 4/x
One observes 1, there is epistemic uncertainty (the model could be the first or the second). Of course, nothing is black and white like this ever. And we talk about models here. Models are.. made up.. Usual blurb about usefulness of models. Should you care about this distinction? 3/x
March 20, 2025 at 10:35 PM
One observes 1, there is epistemic uncertainty (the model could be the first or the second). Of course, nothing is black and white like this ever. And we talk about models here. Models are.. made up.. Usual blurb about usefulness of models. Should you care about this distinction? 3/x
Epistemic uncertainty refers to whether given the data (and prior information), we can surely identify the data generating model. Example: Model class has two distributions; one has support {0,1}, the other has support {1}. One observes 0. There is no epistemic uncertainty. 2/X
March 20, 2025 at 10:33 PM
Epistemic uncertainty refers to whether given the data (and prior information), we can surely identify the data generating model. Example: Model class has two distributions; one has support {0,1}, the other has support {1}. One observes 0. There is no epistemic uncertainty. 2/X
I don't get this:
In the context of this terminology, data comes from a model. Aleatoric uncertainty refers to the case when this model is a Dirac! In the second case, the model is a mixture of two Dirac's. This is not a Dirac. Hence, there is aleatoric uncertainty. 1/X
In the context of this terminology, data comes from a model. Aleatoric uncertainty refers to the case when this model is a Dirac! In the second case, the model is a mixture of two Dirac's. This is not a Dirac. Hence, there is aleatoric uncertainty. 1/X
March 20, 2025 at 10:31 PM
I don't get this:
In the context of this terminology, data comes from a model. Aleatoric uncertainty refers to the case when this model is a Dirac! In the second case, the model is a mixture of two Dirac's. This is not a Dirac. Hence, there is aleatoric uncertainty. 1/X
In the context of this terminology, data comes from a model. Aleatoric uncertainty refers to the case when this model is a Dirac! In the second case, the model is a mixture of two Dirac's. This is not a Dirac. Hence, there is aleatoric uncertainty. 1/X