Hidde Fokkema
hiddefokkema.bsky.social
Hidde Fokkema
@hiddefokkema.bsky.social
PhD candidate in Mathematical Machine Learning with @tverven | Researching formal XAI | Maths nerd | Occasional producer of electronic music

https://www.hidde-fokkema.com
I did some googling and this article has a surprisingly nice and pedagogical discussion on this, with a similar conclusion to your idea.

tinyurl.com/52a3whac

And I found that I missed the opportunity to make the joke that the posterior of the simpler model is "Sharper", keeping the razor theme.
Ockham's Razor and Bayesian Analysis on JSTOR
William H. Jefferys, James O. Berger, Ockham's Razor and Bayesian Analysis, American Scientist, Vol. 80, No. 1 (January-February 1992), pp. 64-72
tinyurl.com
November 3, 2025 at 5:03 PM
(2/2) if we see the complicated model and simple model as 2 different hypothesis classes, with 2 seperate priors, then the posterior for the more complicated class will be flatter than the posterior of the simple class, which is what you want I think.
November 3, 2025 at 4:38 PM
(1/2) Fair point, I think my point was that anything Bayesian is prior related. So with the correct prior you could at least recover Ockham's razor, but not really derive it. But my thinking is a bit different I think, as in my points above the hypothesis classes are the same.

In your idea, if ..
November 3, 2025 at 4:36 PM
(7/n=7) So, in the end, you can get Ockham's razor if your prior is that simple explanations (read explanations with less parameters) are more likely than complicated ones. For binary parameters you could write the prior explicitly. For real valued parameters this becomes impossible (I am guessing)
November 3, 2025 at 4:14 PM
(6/n) Now if you really want to derive Ockham's razor, in the sense of minimum assumptions, or really the number of parameters, you would need a prior distribution that assigns more probability mass to simple models.
November 3, 2025 at 4:11 PM
(5/n) Similarly, if β ~ Laplace(0, σ^2), then you get the Lasso objective

min ||y - <β, x> ||^2 + λ || β ||_1

where we now have the 1-norm as regularization penalty. This one has the added benefit that irrelevant parameters are set to 0, which resembles the original Ockham's razor principle more
November 3, 2025 at 4:09 PM
(4/n) writing out the posterior likelihood and performing maximum likelihood on the parameters (Maximum a postiori Bayesian inference). How much you regularise is determined by σ, and there is relation to λ.
November 3, 2025 at 4:07 PM
(3/n) Let's say we consider as possible models all linear models and complexity measure the euclidean norm of the parameters. (This is ridge regresions). Then, we would retrieve the optimisation problem:

min ||y - <β, x>||^2 + λ||β||^2

By assuming that β ~ N(0, σ^2) and ...
November 3, 2025 at 4:04 PM
(2/n) In particular, this would give you the model with the least amount of assumptions, if you consider 2 models that explain the data equally well, but one has less assumptions and that is the complexity measure you consider.
November 3, 2025 at 4:01 PM
Sure! Here are some thoughts

(1/n) I would see Ockham's razor as the following optimisation problem:

min Error(data, model) + Complexity(model)

Where you minimise over all models.
November 3, 2025 at 3:59 PM
If you see Ockham's razor as a regularization mechanism, because you optimize to fit the data and minimizing the parameters, then there are explicit connection. For example ridge regression follows from assuming a gaussian prior on the parameters and Lasso regression follows from a Laplace prior
November 3, 2025 at 3:29 PM
Ohw and the timing of the Q2B conference being this week probably also factors in. So they can hype it a bit more there
December 10, 2024 at 9:25 PM
My guess would be because the nature version of the article was just published?
December 10, 2024 at 9:22 PM
Aren't these dual numbers? I think Julia has some autodiff packages based in this idea
November 29, 2024 at 3:34 PM