anish144.bsky.social
@anish144.bsky.social
PhD researcher in Machine Learning at Imperial College. Visiting at University of Oxford.

Interested in all things involving causality and Bayesian machine learning. Recently I have also been interested in scaling theory.

https://anish144.github.io/
We will be presenting this work at #ICML2025 and are happy to discuss it further.

🗓️: Tue 15 Jul 4:30 p.m. PDT
📍: East Exhibition Hall A-B #E-1912

Joint 1st author: @ruby-sedgwick.bsky.social.
With: Avinash Kori, Ben Glocker, @mvdw.bsky.social.

🧵14/14
July 10, 2025 at 6:07 PM
Finally, we note the flexibility of our model comes at the cost of more difficult optimisation. However, random restarts and choosing the model with the highest score reliably improve the structure recovery metrics (commonly done in GPs).

🧵13/14
July 10, 2025 at 6:07 PM
We also test our method on semi-synthetic data generated from the Syntren gene regulatory network simulator.

🧵12/14
July 10, 2025 at 6:07 PM
When data are generated from an identifiable model (ANM), our more flexible model performs as well as an ANM restricted Bayesian model (CGP). Both Bayesian models again outperform other non-Bayesian approaches - even those that assume the correct ANM assumption.

🧵11/14
July 10, 2025 at 6:07 PM
With larger number of variables (50), where the discrete search blows up, and with complex data, our approach performs well. SDCD uses the same acyclicity regulariser but uses maximum likelihood with NNs. This shows the advantage of the Bayesian approach.

🧵10/14
July 10, 2025 at 6:07 PM
We first test on data generated from our model itself and where discrete model selection is tractable (3 variables). Here, we show that while the discrete model (DGP-CDE) recovers the true structure reliably, our continuous approximation (CGP-DCE) results in higher error.

🧵9/14
July 10, 2025 at 6:07 PM
We enforce acyclicity in the adjacency by adding an acyclicity constraint to the optimisation. Variational inference trains the rest of the parameters.

The final objective returns the adjacency of the causal structure that maximises the posterior.

🧵8/14
July 10, 2025 at 6:07 PM
Therefore, we can construct an adjacency matrix from the kernel hyperparameters. This amounts to Automatic Relevance Determination: maximising the marginal likelihood uncovers the dependency structure among the variables. However, the learnt adjacency must be acyclic.

🧵7/14
July 10, 2025 at 6:07 PM
Next, we construct a latent variable Gaussian process model that can model non-Gaussian densities with inputs according to a causal graph. To continuously parametrise the space of graphs, we note that the kernel hyperparameters control input dependence.

🧵6/14
July 10, 2025 at 6:07 PM
We first show that the guarantees of Bayesian model selection (BMS) hold in the multivariate case: 1) when the underlying model is identifiable, BMS identifies the true DAG, 2) for more flexible models, graphs stay distinguishable.

🧵5/14
July 10, 2025 at 6:07 PM
However, naive Bayesian model selection scales poorly because DAGs grow exponentially with no. of variables.

We propose a continuous Bayesian model selection approach that scales and allows for using more flexible assumptions.

🧵4/14
July 10, 2025 at 6:07 PM
While current causal discovery impose unrealistic model restrictions to ensure identifiability, Bayesian models relax identifiability but allow for causal and more realistic assumptions, yielding performance gains: arxiv.org/abs/2306.02931

🧵3/14
July 10, 2025 at 6:07 PM
Bayesian models encode soft restrictions in the form of priors. These priors also allow for encoding causal assumptions. Mainly that causal mechanisms do not inform each other. This is achieved by simply ensuring that the prior factorises over the mechanisms.

🧵2/14
July 10, 2025 at 6:07 PM
Excited to be presenting this work at #ICLR2025. Please do reach out if you are interested in a similar space!

🗓️: Hall 3 + Hall 2B #471
🕐: Fri 25 Apr, 3 p.m.
📜: openreview.net/forum?id=eeJ...

This was a great collaboration w/ @mashtro.bsky.social, James Requeima, @mvdw.bsky.social
A Meta-Learning Approach to Bayesian Causal Discovery
Discovering a unique causal structure is difficult due to both inherent identifiability issues, and the consequences of finite data. As such, uncertainty over causal structures, such as those...
openreview.net
April 19, 2025 at 5:39 PM
Why did that work? We are approximating the posterior of a causal model (from which data is generated), which may be different to the data generating process. Improving the causal model (more flexible, wider prior), and increasing the capacity of the neural process can help 14/15
April 19, 2025 at 5:39 PM
What if we don't know the data distribution? Our approach here is to encode a "wide prior", training on mixtures of all possible models (that we can think of). We show that this approach leads to good performance on datasets whose generation process was unknown at training. 13/15
April 19, 2025 at 5:39 PM
Next, we test with higher nodes (20), denser graphs, and more complicated functions. Here, we show that our model outperforms other baselines. Notably, a single model that is trained on all the data (labelled BCNP All Data) does not lose performance on specific datasets. 12/15
April 19, 2025 at 5:39 PM
We first show that our model outputs reasonable posterior samples: 2 node, graph with single edge, where the underlying data is not identifiable. Here we can see that the AVICI model, that does not correlate terms of the adjacency matrix, fails to output reasonable samples. 11/15
April 19, 2025 at 5:39 PM
We test against two baseline types: 1) Posterior approx. via marginal likelihood (DiBS, BayesDAG). 2) NP-like methods finding single structures, that can be used to obtain posterior samples, but missing key properties of the posterior (AVICI, CSIvA). 10/15
April 19, 2025 at 5:39 PM
The loss, targeting the KL divergence, simplifies to maximising the log probability of the true causal graph under our model. The final scheme: A model that efficiently outputs samples of causal structures approximating the true posterior — with just a forward pass! 9/15
April 19, 2025 at 5:39 PM
Our decoder uses lower triangular-permutation matrices (A, Q) to construct DAGs. A Gumbel-Sinkhorn distribution is parameterised, from which permutations (Q) can be sampled. The representation is further processed to parameterise the lower triangular matrix (A). 8/15
April 19, 2025 at 5:39 PM
We embed each node sample pair, and append a query vector of 0s to the sample axis. Our encoder alternates between attention over samples and nodes to preserve equivariance. We then perform cross attention with the query vector to encode permutation invariance over samples. 7/15
April 19, 2025 at 5:39 PM
What does our model look like? We encode key properties of the posterior: 1) Permutation Invariance with respect to the samples, 2) Permutation equivariance with respect to nodes, 3) Correlation between adjacency elements. We do with a transformer encoder-decoder structure. 6/15
April 19, 2025 at 5:39 PM
Our training objective reflects this: we minimise the KL between the true posterior and the neural process. The key property is that we only require samples of data and the true causal graph. This data forms the "prior", which can be synthetic or can be from real examples. 5/15
April 19, 2025 at 5:39 PM