Lightnews — Scholar-powered news

John Chodera

@jchodera.bsky.social

4.9K followers 5.9K following 140 posts

Achira | http://achira.ai

Research laboratory | http://choderalab.org

Antiviral drug discovery for pandemics | http://asapdiscovery.org

OpenADMET | http://openadmet.org

Employer-mandated disclaimer: http://choderalab.org/disclaimer

Pronouns: he/him

Posts Replies Media Videos

John Chodera

@jchodera.bsky.social

Chat, is this good?

New York Times headline, "In Retaliatory Move, Trump Threatens 100% Tariffs on Chinese Goods", with graph of the stock market plummeting immediately below.

October 11, 2025 at 1:53 AM

John Chodera

@jchodera.bsky.social

From Science:

On a Nature news article reporting that NIH planned to suspend subawards for foreign collaborations:
“No, that’s false. There’s going to be a policy on tracking subawards. The NIH and the government should be able to see where the money’s going.”

“I’m really uncomfortable with this conversation, because you’re like, actually spreading rumors that you don’t know anything about. … Nature also is spreading rumors. Halt foreign collaborations, that’s not true.”

“We’re working on the policy, Jocelyn. You shouldn’t be reporting rumors. I know there’s leaks all over here, but the leaks don’t actually reflect what’s happening. Don’t write about rumors. It actually makes the things that you and I care about worse. Like it spreads panic.”

Later that day, NIH released a policy that halted future subawards to foreign scientists and said they will need to apply directly for money under a system still in development.

May 6, 2025 at 11:48 AM

John Chodera

@jchodera.bsky.social

Excited to be joining the AI for Science Symposium in San Francisco on May 16 organized by ML Foundry!

Check it out here:
mlfoundry.com/ai-for-scien...

ASCII art poster for:

AI for Science Symposium
May 16

Connecting researchers, funders, and compute providers to accelerate AI for Science

Organized by Foundry Technologies

Breakthrough Research from Stanford, Berkeley, UT Austin, Lawrence Berkeley Labs, Gladstone Institutes, Deep Forest Sciences, Broad Institute, Enigma Project, IAIFI, Achira, Kerna Labs, Pacific Fusion

Keynote Panelists from Lawrence Berkeley Labs, OpenAI, NVIDIA, Foundry, IAIFI, Khosla Ventures, Xaira Therapeutics, Columbia, Amaranth Foundation, DARPA, Nightingale

Co-Sponsores include Open Athena, Invisible Technologies, Enigma Project

May 2, 2025 at 4:54 PM

John Chodera

@jchodera.bsky.social

I have so many of these!

Red sticker, inscribed with

EVERY "BOND" IS AN IMAGINARY CONSTRUCT
achira.ai

A parody of the BIRDS AREN'T REAL van.

February 21, 2025 at 9:31 PM

John Chodera

@jchodera.bsky.social

Oh, have we got coffee cup ideas for you...

Sign reading BONDS AREN'T REAL, a parody of BIRDS AREN'T REAL, with a link to achira.ai

February 21, 2025 at 5:11 PM

John Chodera

@jchodera.bsky.social

As a peek toward where we're headed:

Right now, CADD scientists are forced to use the same model week after week, even if new experimental data says the model is inaccurate.

If we can fine- models, we can exploit that data to systematically improve our predictions week by week!

Illustration showing how in 2025, CADD scientists are forced to use the same published force field model week after week in a manner than cannot learn from new experimental data that contradicts it.

In the future (2027?), CADD scientists will be able to make good general predictions with a foundation simulation model, but will be able to fine-tune that model after every new batch of data to deliver systematically more accurate predictions week after week.

February 19, 2025 at 7:36 PM

John Chodera

@jchodera.bsky.social

While this strategy is still highly limited by the constraints of legacy molecular mechanics force fields which were developed for late-1970s era hardware.

Since then, compute per dollar has increased by 160 billion times. We need a totally new approach. Stay tuned for something exciting...

Current-generation simulation models were built for ancient technology

Left: MM potentials in use today were built for the DEC PDP-11 (1976), $160,000 for 50 KFLOP/s (in today's dollars)

Right: New generation of simulation models are needed for new hardware like the NVIDIA RTX 5090 (2025) costs $2000 for 100 TFLOP/s.

February 19, 2025 at 7:30 PM

John Chodera

@jchodera.bsky.social

This simple regularization strategy was surprisingly effective at ensuring the reweighted free energy estimates at new parameters matched free energies from expensive new simulations at the new parameter set.

Figure 5. Fine-tuned Optimized/Reweighted hydration free energies are highly consistent with Optimized/Recalculated hydration free energies and demonstrate statistically significant accuracy improvements over the foundation model espaloma-0.3.2. (A) Correlation plot between Optimized/Reweighted free energies and Optimized/Resimulated free energies show RMSEs of 0.06 kcal mol−1, demonstrate agreement between Zwanzig reweighting used in fine-tuning optimization and the Bennet Acceptance Ratio (BAR) [7], which was used in the free energy recalculation experiments at optimized molecular partial charges. (B) Experimental vs calculated hydration free energy plots by dataset show consistent improvements in the Optimized/Resimulated calculations over the baseline foundation model espaloma0.3.2. (C) Absolute hydration free energy residual CDFs for each data split and experiment show high agreement between residuals of the Optimized/Resimulated (green) and Optimized/Reweighted (orange) data (consistent with panel A) and reliable improvement over the baseline foundation model (green) in all data splits.

From https://doi.org/10.1101/2025.01.06.631610

February 19, 2025 at 7:30 PM

John Chodera

@jchodera.bsky.social

Dominic showed that a regularized low-rank adjustment to the espaloma-0.3 MM foundation simulation model dramatically reduced errors in predicted hydration free energies in the held-out test set, shifting the error CDF far to the left, improving the RMSE by 0.82 kcal/mol--in just ten minutes!

A representative absolute free energy residual emprical cdf plot (75% training data, 100 principal components) from which the aggregate statistics in panels A,B were calculated shows significant qualitative
improvement in the free energy residuals (solid lines) as compared to baselines (dotted). Model parameters Θ of this optimization experiment correspond to the largest mean improvement in validation RMSE (0.82 kcal mol−1) of all the experiments with >100 test data points left after data partitioning.

From Figure 4 of https://doi.org/10.1101/2025.01.06.631610

February 19, 2025 at 7:30 PM

John Chodera

@jchodera.bsky.social

But this can be avoided! By adding a regularization term that penalizes effective sample size collapse, optimization takes a different path that maintains the ability of reweighting strategies to estimate accurate free energies for the new model parameters.

Figure 3 continued: (B) Left: Iterative changes in the probability density function upon reweighting cause the ESS of a molecule’s sampled conformations to fall as 𝚯 approaches the boundary of the trust region. Without ESS regularization (blue circles), 𝚯 may move outside the trust region and sample size collapse ensues, which would require generating new simulation data to continue reliable optimization. With ESS regularization (orange triangles), the ESS is restrained above some ESS threshold (black dotted horizontal line), enabling reliable optimization progress in directions that do not immediately lead to sample size collapse. Plot shown for FreeSolv molecule "mobley_1781152". Right: Optimizing the fine-tuning parameters 𝚯 to minimize training loss causes the magnitude of the free energy perturbation (blue/orange circles corresponding to with and without ESS regularization, respectively) to increase as 𝚯 (𝐿2 norm shown in blue/orange triangles corresponding to with and without ESS regularization) is perturbed from its initial zero matrix (at training iteration 1). (C) As the ESS for reweighting collapses, the combined reweighting bias and error begins to increase rapidly beyond a critical ESS threshold. We define the threshold for the acceptable mean total error using the original mean (taken over all molecules) hydration free energy calculation BAR uncertainty (0.02 kcal mol−1). The upper 95% CI of the total error envelope falls below the error threshold near 500 samples (10% of all frames collected for each molecule from an original sample size of 5000). (D) Optimization using ESS regularization affords more improvement in the residuals than optimization without regularization, as can be seen from the empirical cumulative distribution function (CDF) of absolute free energy residuals of the Optimized/Reweighted (solid lines) train/test/validate data partitions using ESS regularization as compared to that without (dashed lines).

February 19, 2025 at 7:30 PM

John Chodera

@jchodera.bsky.social

Dominic also wanted it to be FAST. Could we do this in one step, avoiding need for additional simulations?

Unfortunately, once you start training the LoRA model, the conformational probability density can change rapidly, quickly escaping the region where free energy reweighting gives low errors.

Figure 3. Single-shot model fine-tuning with effective sample size (ESS) regularization allows for parameter optimization via reweighting whilst maintaining a small reweighting error and avoiding sample size collapse. In order to fine-tune a model without requiring extensive re-simulation at updated parameter values, we develop a strategy to prevent ESS collapse (and therefore maintain small reweighting error) during parameter fine-tuning. (A) Left: Optimization of fine-tuning parameters 𝚯 on the loss surface (cartoon, color map with contour plot) should remain inside some well-defined trust region (black circle) to avoid ESS size collapse for reweighted free energy estimates as model parameters are adjusted to fit training data. Right: free energy reweighting by optimizing model parameters 𝚯 necessarily changes the probability density function of the conformation space of a representative small molecule, which can lead to a collapse in ESS when reweighting the original simulation data to estimate free energies at updated model parameters.

From https://doi.org/10.1101/2025.01.06.631610

February 19, 2025 at 7:30 PM

John Chodera

@jchodera.bsky.social

Borrowing ideas from fine-tuning LLMs, Dominic asked whether a low-rank adaptation (LoRA) could map from atom embedding vectors to learned perturbations to the charge-equilibration model parameters (electronegativity and hardness) to modify the small molecule partial charges.

Figure 1. Our approach to rapidly fine-tune an existing espaloma “foundation model” employs a a fast, low-rank perturbation to electrostatic parameters. An existing espaloma graph model for MM parameters (here, espaloma0.3.2), typically trained with large quantum chemical datasets of molecular energies and/or forces, provides an excellent foundation for fine-tuning to further improve its accuracy on a specific or heterogeneous dataset, possibly of experimental data. We adopt a rapid fine-tuning approach that uses a low-rank approximation to perturb the electrostatics component of the model without needing to re-optimize all model parameters. Fine-tuning proceeds in sequential steps:
1. Pool atom embeddings: the dataset’s embedding vectors 𝐡𝑖 from the foundation espaloma model are pooled into a data matrix 𝐇. 2. Principal component analysis/truncate covariance: the covariance matrix 𝚺ℎ of the data matrix 𝐇 is diagonalized and the first r principal component vectors are extracted as a truncated matrix 𝐐𝑚×𝑟 of orthonormal principal components. The cumulative variance as a function of the number of principal components is shown as an insert.
3. Change of basis: each 𝐡𝐢 is projected into the lower dimensional basis 𝐐𝑚×𝑟, yielding ̂𝐡𝑖. 4. Electrostatic perturbations: for each atom, the electrostatic parameter vector (𝑒, 𝑠) 𝑇 representing electronegativity and hardness, respectively, is perturbed by projecting ̂𝐡𝑖 onto two parameterized vectors 𝜃𝑒 and 𝜃𝑠. 5. Charge equilibration: the new atomic partial charges ̂𝑞∗𝑖 for each atom is recomputed using Charge equilibration (QEq).

Figure from https://doi.org/10.1101/2025.01.06.631610

February 19, 2025 at 7:30 PM

John Chodera

@jchodera.bsky.social

For example, hydration free energies can be computed with high precision using alchemical free energy methods with Ken's espaloma-0.3 foundation model, but are relatively inaccurate, with an RMSE ~ 1.72 kcal/mol and a disappointing error cumulative distribution function (CDF).

Figure 2. The espaloma-0.3.2 foundation model gives high correlation with experiment on the FreeSolv small
molecule hydration free energy benchmark, but leaves room for systematic improvement. (A) Absolute hydration free energy calculations transport small molecules from vacuum to water by alchemically coupling intermolecular
nonbonded interactions between small molecules and the water solvent. (B) Six representative molecules in the FreeSolv dataset, which consists of 642 neutral small organic molecules. (C) The espaloma-0.3.2 foundation model’s absolute
hydration free energy calculations correlate well with experimental hydration free energies. (D) Empirical cumulative
distribution function of absolute residuals between calculated and experimental hydration free energies.

From https://doi.org/10.1101/2025.01.06.631610

February 19, 2025 at 7:30 PM

John Chodera

@jchodera.bsky.social

What can you do with a foundation simulation model? It's good at many tasks, but may not achieve the performance you want for every application.

Former PhD student Dominic Rufa wondered whether we could fine-tune these models to deliver better performance.

www.linkedin.com/in/dominic-r...

February 19, 2025 at 7:30 PM

John Chodera

@jchodera.bsky.social

In a tour de force, visiting scientist Ken Takaba [https://kntkb.github.io/] built the first true molecular mechanics foundation simulation model for proteins, nucleic acids, and small molecules that excels at at protein:ligand binding free energy benchmarks.

pubs.rsc.org/en/content/a...

Table 1 Espaloma-0.3 can directly fit quantum chemical potential energies and forces more accurately than baseline force fields. Espaloma was fit to quantum chemical (QC) potential energies and forces from various gas-phase QC datasets sourced from QCArchive,70 covering a broad chemical space that includes small molecules, peptides, and RNA molecules (see ESI Section B). The entire dataset consists of 17 427 unique molecules and 1 188 317 conformations. These datasets were extracted from three different QCArchive workflows: BasicDataset, OptimizationDataset, and TorsionDriveDataset. The datasets were partitioned into train, validate, and test sets in an 80 : 10 : 10 ratio split by molecules, except for the RNA-Trinucleotide and RNA-Nucleoside datasets. Since RNA nucleosides and trinucleosides lack chemical diversity, the RNA-Nucleoside dataset was used for training, whereas the RNA-Trinucleotide dataset, which covers the same molecules as the RNA-Diverse dataset but with much more diverse conformers, was used as a test set. The number of molecules and total conformations for each dataset is annotated in the table. We report the root mean square error (RMSE) on the training and test sets, along with the performance of other force fields as baselines on the test set. The baseline force fields used were gaff-2.11,71 openff-2.0.0,72 and openff-2.1.0 (ref. 73) for small molecules, Amber ff14SB22 for peptides, and Amber RNA.OL3 (ref. 25) for RNA molecules. All statistics are computed with predicted and reference energies centered to have a zero mean for each molecule similar to the previous work.49 The 95% confidence intervals, as annotated in the results, were calculated by bootstrapping molecule replacement using 1000 replicates.

From https://arxiv.org/abs/2307.07085

Fig. 3 espaloma-0.3 reproduces experimental NMR scalar couplings of unstructured peptides better than well-established biomolecular force field, ff14sb. (a) c2 values (lower is better) quantifying deviations of simulated NMR scalar couplings computed from 500 ns trajectories from experimental NMR measurements.91,92 Error bars represent a 95% confidence interval constructed from the critical values of a Student's t distribution and the standard error of the mean across the NMR observables. (b) Comparison of the error in computed estimates of NMR scalar couplings versus experiment. Colors represent the identity of the amino acid associated with each scalar coupling. Horizontal error bars represent the estimate of the systematic error in the experimental scalar coupling, and vertical error bars represent the uncertainty due to the computed estimate (standard error of the mean across 3 replicates) and the uncertainty due to the experimental value (systematic error) added in quadrature.

From https://arxiv.org/abs/2307.07085

Fig. 5 espaloma-0.3 can be used for accurate protein–ligand alchemical free energy calculations. (a) Protein–ligand (PL) alchemical free energy calculations were calculated for Tyk2 (10 ns/replica), Cdk2 (10 ns/replica), Mcl1 (15 ns/replica), P38 (20 ns/replica) using a curated PL-benchmark dataset (see ESI Section F†) which comprises 76 ligands in total. The PL structures used to setup the alchemical free energy calculations for each target system is shown. Here, we used Perses 0.10.1 relative free energy calculation infrastructure,110 based on OpenMM 8.0.0,111 to assess the accuracy of espaloma-0.3 and openff-2.1.0 (ref. 73) combined with Amber ff14SB force field22 for comparison. (b) Schematic illustration of the alchemical ligand transformation network for Tyk2. The methyl R-group in the center is alchemically transformed into various R-groups. The binding free energy for each R-group is annotated alongside the respective R-groups. (c) The openff-2.1.0 (ref. 73) with protein parametrized with Amber ff14SB force field (ff14SB + openff-2.1.0) achieves an absolute free energy (DG) RMSE of 1.01 [95% CI: 0.73, 1.33] kcal mol−1. The espaloma-0.3 for predicting valence parameters and partial charges of small molecules combined with Amber ff14SB force field for proteins (ff14SB + espaloma-0.3) achieves an absolute free energy (DG) RMSE of 1.13 [95% CI: 0.86, 1.47] kcal mol−1. Parametrizing small molecule and protein self-consistently with espaloma-0.3 (espaloma-0.3) achieves absolute free energy (DG) RMSE of 1.02 [95% CI: 0.74, 1.37] kcal mol−1 which is comparable to those obtained by (ff14SB + openff-2.1.0) and (ff14SB + espaloma-0.3). All systems were solvated with TIP3P water26 and neutralized with 300 mM NaCl salt using Joung and Cheatham monovalent counterions.29 The light and dark gray regions depict the confidence bounds of 0.5 kcal mol−1 and 1.0 kcal mol−1, respectively.

From https://arxiv.org/abs/2307.07085

February 19, 2025 at 7:30 PM

John Chodera

@jchodera.bsky.social

Aside: Yuanqing Wang is tremendously talented, and is on the faculty job market right now!

Check out his website: www.wangyq.net
and papers: scholar.google.com/citations?us...

February 19, 2025 at 7:30 PM

John Chodera

@jchodera.bsky.social

Recently, Yuanqing Wang [https://www.wangyq.net/] (now a Simons and Schmidt Fellow at NYU) demonstrated how molecular mechanics force field construction can be cast as an end-to-end differentiable machine learning problem, presenting espaloma: espaloma.wangyq.net

Figure 1. Espaloma is an end-to-end differentiable molecular mechanics parameter assignment scheme for arbitrary organic molecules. espaloma (extendable surrogate potential optimized by message-passing) is a modular approach
for directly computing molecular mechanics force field parameters ΦFF from a chemical graph  such as a small molecule
or biopolymer via a process that is fully differentiable in the model parameters ΦNN. In Stage 1, a graph neural network is
used to generate continuous latent atom embeddings describing local chemical environments from the chemical graph.
In Stage 2, these atom embeddings are transformed into feature vectors that preserve appropriate symmetries for atom,
bond, angle, and proper/improper torsion inference via Janossy pooling. In Stage 3, molecular mechanics parameters
are directly predicted from these feature vectors using feed-forward neural nets. This parameter assignment process
is performed once per molecular species, allowing the potential energy to be rapidly computed using standard molecular mechanics or molecular dynamics frameworks thereafter. The collection of parameters ΦNN describing the espaloma
model can be considered as the equivalent complete specification of a traditional molecular mechanics force field such as
GAFF [26, 27]/AM1-BCC [28, 29] in that it encodes the equivalent of traditional typing rules, parameter assignment tables,
and even partial charge models. This final stage is modular, and can be easily extended to incorporate additional molecular mechanics parameter classes, such as parameters for a charge-equilibration model (Section 4), point polarizabilities,
or valence-coupling terms for Class II molecular mechanics force fields [30, 31].

Figure from Wang Y, Fass J, and Chodera JD “End-to-End Differentiable Construction of Molecular Mechanics Force Fields. https://arxiv.org/abs/2010.01196

February 19, 2025 at 7:30 PM

John Chodera

@jchodera.bsky.social

Everything is chaos, but I wanted to share some awesome recent science from the lab that hints at where the future of biomolecular simulation is headed:

Foundation simulation models that can be fine-tuned to experimental free energy data to produce systematically more accurate predictions.

Figure 1 from arXiv preprint https://doi.org/10.1101/2025.01.06.631610

Fig. 1 Espaloma is an end-to-end differentiable molecular mechanics parameter assignment scheme for arbitrary organic molecules. Espaloma (extensible surrogate potential optimized by message-passing) is a modular approach for directly computing molecular mechanics force field parameters FFF from a chemical graph G such as a small molecule or biopolymer via a process that is fully differentiable in the model parameters FNN. In Stage 1, a graph neural network is used to generate continuous latent atom embeddings describing local chemical environments from the chemical graph. In Stage 2, these atom embeddings are transformed into feature vectors that preserve appropriate symmetries for atom, bond, angle, and proper/improper torsion inference via Janossy pooling.54 In Stage 3, molecular mechanics parameters are directly predicted from these feature vectors using feed-forward neural networks. This parameter assignment process is performed once per molecular species, allowing the potential energy to be rapidly computed using standard molecular mechanics or molecular dynamics frameworks thereafter. The collection of parameters FNN describing the espaloma model can be considered as the equivalent complete specification of a traditional molecular mechanics force field such as GAFF38,39/AM1-BCC55,56 in that it encodes the equivalent of traditional typing rules, parameter assignment tables, and even partial charge models. Reproduced from ref. 49 with permission from the Royal Society of Chemistry.

February 19, 2025 at 7:30 PM

John Chodera

@jchodera.bsky.social

In case you want to do something about it:
resist.bot/petitions/PM...

Text SIGN PMCCED to 50409
Urgent opposition to federal grant and loan payment suspension

January 28, 2025 at 1:20 PM

John Chodera

@jchodera.bsky.social

Happy Caturday to all who celebrate.

Junko T. Cat snoozing in a circular shape on a heated blanket situated on a gray patterned comforter. Junko is a white short-haided cat with grey-black spots, a grey-and-black ringed tail, and a permanently bent left ear. Her left forepaw is gently holding her legs and tail so she can form a circle.

December 22, 2024 at 4:57 AM

John Chodera

@jchodera.bsky.social

Daniel Probst (@skepteis.bsky.social) is up next with "Less is More" at the ML4Molecules Workshop in Berlin, taking his own advice by presenting without the need for corporeal form.