Lightnews — Scholar-powered news

Tom McCoy

@rtommccoy.bsky.social

🤖 🧠 NEW BLOG POST 🧠 🤖

What skills do you need to be a successful researcher?

The list seems long: collaborating, writing, presenting, reviewing, etc

But I argue that many of these skills can be unified under a single overarching ability: theory of mind

rtmccoy.com/posts/theory...

Illustration of the blog post's main argument, summarized as: "Theory of Mind as a Central Skill for Researchers: Research involves many skills.If each skill is viewed separately, each one takes a long time to learn. These skills can instead be connected via theory of mind – the ability to reason about the mental states of others. This allows you to transfer your abilities across areas, making it easier to gain new skills."

September 30, 2025 at 3:14 PM

Tom McCoy

@rtommccoy.bsky.social

A linguistic note about David Copperfield and Demon Copperhead 🧵

[very minor spoilers for both]

1/n

On the left is the cover of David Copperfield by Charles Dickens. On the right is the cover of Demon Copperhead by Barbara Kingsolver.

August 24, 2025 at 4:57 PM

Tom McCoy

@rtommccoy.bsky.social

🤖 🧠 NEW PAPER ON COGSCI & AI 🧠 🤖

Recent neural networks capture properties long thought to require symbols: compositionality, productivity, rapid learning

So what role should symbols play in theories of the mind? For our answer...read on!

Paper: arxiv.org/abs/2508.05776

1/n

The top shows the title and authors of the paper: "Whither symbols in the era of advanced neural networks?" by Tom Griffiths, Brenden Lake, Tom McCoy, Ellie Pavlick, and Taylor Webb.

At the bottom is text saying "Modern neural networks display capacities traditionally believed to require symbolic systems. This motivates a re-assessment of the role of symbols in cognitive theories."

In the middle is a graphic illustrating this text by showing three capacities: compositionality, productivity, and inductive biases. For each one, there is an illustration of a neural network displaying it. For compositionality, the illustration is DALL-E 3 creating an image of a teddy bear skateboarding in Times Square. For productivity, the illustration is novel words produced by GPT-2: "IKEA-ness", "nonneotropical", "Brazilianisms", "quackdom", "Smurfverse". For inductive biases, the illustration is a graph showing that a meta-learned neural network can learn formal languages from a small number of examples.

August 15, 2025 at 4:27 PM

Tom McCoy

@rtommccoy.bsky.social

According to Jane Austen, linguists are extraordinarily cold-hearted.

(Though at least we're not as bad as mathematicians!)

Picture of a paragraph from Emma by Jane Austen. It reads: "Such an adventure as this, a fine young man and a lovely young woman thrown together in such a way, could hardly fail of suggesting certain ideas to the coldest heart and the steadiest brain. So Emma thought, at least. Could a linguist, could a grammarian, could even a mathematician have seen what she did, have witnessed their appearance together, and heard their history of it, without feeling that circumstances had been at work to make them peculiarly interesting to each other? How much more must an imaginist, like herself, be on fire with speculation and foresight? especially with such a groundwork of anticipation as her mind had already made."

August 3, 2025 at 4:10 PM

Tom McCoy

@rtommccoy.bsky.social

More dramatically, it substantially outperforms the standard neural network at learning recursion (left) and priming (right; a lower value on the y-axis shows a greater degree of priming).

13/n

Left: A plot showing recursion results for standard and prior-trained neural networks. The x-axis shows levels of recursion ranging from 0 to 10, and the y-axis shows accuracy. As the levels of recursion increase, the accuracy drops for both models, but it drops much more rapidly for the standard model than the prior-trained model.
Right: A plot showing priming results. There are 4 sub-plots, for 4 types of sentences: short plausible, long plausible, short implausible, and long implausible. In all 4 plots, the prior-trained network shows a greater degree of priming than the standard neural network.

May 20, 2025 at 7:16 PM

Tom McCoy

@rtommccoy.bsky.social

Here, its perplexity is slightly better (i.e., lower) than that of a standard neural network.

12/n

A plot showing perplexity values. Note that for perplexity, lower is better. A standard neural network achieves perplexity ranging from about 19.70 to 19.80, with a median around 19.75. A prior-trained neural network achieves perplexity ranging from about 19.63 to 19.74, with a median around 19.67. The best model from prior literature is indicated as having a perplexity of about 19.69.

May 20, 2025 at 7:15 PM

Tom McCoy

@rtommccoy.bsky.social

Even though it is a neural network, the prior-trained model can learn formal languages from small numbers of examples - far outperforming a standard neural network, and matching a Bayesian model at a fraction of the computational cost.

10/n

Plots showing results for formal languages. On the left is a line graph which has “number of training examples” as its x-axis and “F-score” as its y-axis. Three models have lines in this plot: a Bayesian model, a standard neural network, and a prior-trained neural network. The Bayesian model and prior-trained neural network perform similarly, while the standard neural network does much worse than both of them.
On the right is a table showing the amount of training time used by each approach. The Bayesian model uses from 1 minute to 7 days of training time. The neural networks (whether standard or prior-trained) use from 10 milliseconds to 2.5 minutes.

May 20, 2025 at 7:14 PM

Tom McCoy

@rtommccoy.bsky.social

Inspired by a model from Yang & @spiantado.bsky.social , the prior that we use is a distribution over formal languages (a formal language = a set of strings defined by an abstract rule). We have a neural network meta-learn by observing many formal languages sampled from this prior

8/n

Two examples of formal languages. The first example shown is AnBn, described as “n copies of A followed by n copies of B”, with some example strings from the formal language being AB, AABB, and AAABBB.
The second example shown is XXX, described as “any string X repeated three times”, with some example strings from the formal language being AAA, BABABA, and ABBABBABB

May 20, 2025 at 7:13 PM

Tom McCoy

@rtommccoy.bsky.social

In MAML, a model is exposed to many tasks. After each task, the model's weights are adjusted so that, if it were taught the same task again, it would perform better. As MAML proceeds, the model converges to a state from which it can learn any task in the distribution.

7/n

May 20, 2025 at 7:12 PM

Tom McCoy

@rtommccoy.bsky.social

Our approach (inductive bias distillation) has 3 steps:
1. Use a Bayesian model to define an inductive bias (a prior)
2. Sample learning tasks from the Bayesian model
3. Have a neural network meta-learn from these sampled tasks, to give it the Bayesian model's prior

5/n

A schematic diagram of our procedure. We start with a Bayesian model, here visualized with Bayes’ rule and some example grammatical rules that could be sampled from a Bayesian model’s prior. Then, we sample several tasks from that Bayesian model’s prior, which can serve as training data. Finally, we have a neural network meta-learn from these sampled tasks. The whole process is visualized, going from left to right as “Bayesian model’, then an arrow labeled “sampling”, then “training data”, then an arrow labeled “meta-learning”, and finally “neural network.”

May 20, 2025 at 7:07 PM

Tom McCoy

@rtommccoy.bsky.social

Neural networks have flexible representations that allow them to handle noisy natural data - as evidenced by the success of large language models. However, they notoriously require huge numbers of examples.

3/n

Left: A screenshot of ChatGPT describing itself as an AI language model developed by OpenAI. Right: A bar chart comparing the quantity of text seen by human children vs. GPT-3. The bar for GPT-3 is far higher than for humans, showing that neural networks get far more linguistic data than humans do.

May 20, 2025 at 7:06 PM

Tom McCoy

@rtommccoy.bsky.social

Bayesian models can learn from few examples because they have strong inductive biases - factors that guide generalization. But the costs of inference and the difficulty of specifying generative models can make naturalistic data a challenge.

2/n

Screenshot of a demo of Bayesian word learning (Xu & Tenenbaum 2007). After a few examples, the Bayesian learner figures out that naysayer means “horse” (rather than being more specific – “horse number 4” – or more general – “mammal”).

May 20, 2025 at 7:05 PM

Tom McCoy

@rtommccoy.bsky.social

🤖🧠 Paper out in Nature Communications! 🧠🤖

Bayesian models can learn rapidly. Neural networks can handle messy, naturalistic data. How can we combine these strengths?

Our answer: Use meta-learning to distill Bayesian priors into a neural network!

www.nature.com/articles/s41...

1/n

A schematic of our method. On the left are shown Bayesian inference (visualized using Bayes’ rule and a portrait of the Reverend Bayes) and neural networks (visualized as a weight matrix). Then, an arrow labeled “meta-learning” combines Bayesian inference and neural networks into a “prior-trained neural network”, described as a neural network that has the priors of a Bayesian model – visualized as the same portrait of Reverend Bayes but made out of numbers. Finally, an arrow labeled “learning” goes from the prior-trained neural network to two examples of what it can learn: formal languages (visualized with a finite-state automaton) and aspects of English syntax (visualized with a parse tree for the sentence “colorless green ideas sleep furiously”).

May 20, 2025 at 7:04 PM

Tom McCoy

@rtommccoy.bsky.social

I constructed today's NYT crossword!

This one has some personal connections, described at the WordPlay article by @samcorbin.bsky.social (contains spoilers): www.nytimes.com/2025/05/06/c...

I hope you enjoy!

Screenshot of the New York Times crossword page, saying "The Crossword. Wednesday, May 7, 2025. By Tom McCoy. Edited by Will Shortz."

May 7, 2025 at 5:32 PM

Tom McCoy

@rtommccoy.bsky.social

Made a new assignment for a class on Computational Psycholinguistics:
- I trained a Transformer language model on sentences sampled from a PCFG
- The students' task: Given the Transformer, try to infer the PCFG (w/ a leaderboard for who got closest)

Would recommend!

1/n

On the left is a probabilistic context free grammar (PCFG). On the right is an image of the Transformer architecture. There are arrows going back and forth between the PCFG and the Transformer, showing how the assignment goes back and forth between them.

May 2, 2025 at 3:30 PM

Tom McCoy

@rtommccoy.bsky.social

I added a new assignment to my Computational Linguistics class last semester:
- Choose a linguistic phenomenon in a language other than English
- Give a 3-minute presentation about that phenomenon & how it would pose a challenge for computational models

Would recommend!

1/n

A list of the languages that students presented on: Ancient Greek, Arabic, American Sign Language, Bahasa Indonesia, Bengali, Cantonese, Dutch, Esperanto, French, Georgian, German, Great Andamanese, Hindi, Inuktitut, Italian, Japanese, Korean, Maltese, Mandarin, Mongolian, Nahuatl, Nepali, Romanian, Sanskrit, Spanish, Swedish, Tamil, Telugu, Tshiluba, Unangam Tunuu, Welsh, Wolof, Yiddish

January 28, 2025 at 8:28 PM

Tom McCoy

@rtommccoy.bsky.social

Like previous models, o1-preview shows clear effects of output probability (performing better when the correct answer is a high-probability string).

Interestingly, the effects don’t just show up in accuracy (top) but also in how many tokens o1 consumes to perform the task (bottom)!

3/4

Top plot: o1 scores better on several tasks when the output log probability is high than when it is low.

Bottom plot: o1 uses more tokens when the output is low probability than when it is high probability.

January 20, 2025 at 7:31 PM

Tom McCoy

@rtommccoy.bsky.social

o1 shows especially big improvements over previous models when performing rare versions of tasks (left plot). But, when the tasks are hard enough, it still does better on common task variants than rare ones (right two plots)

2/4

Plots showing OpenAI o1's performance on common and rare versions of tasks. In general, on our basic evaluations, o1 is close to 100% accuracy on both common and rare task variants. However, on harder evaluations, it shows some separation, with stronger performance on common task variants than rare ones.

January 20, 2025 at 7:29 PM

Tom McCoy

@rtommccoy.bsky.social

🔥While LLM reasoning is on people's minds...

Here's a shameless plug for our work comparing o1 to previous LLMs (extending "Embers of Autoregression"): arxiv.org/abs/2410.01792

- o1 shows big improvements over GPT-4
- But qualitatively it is still sensitive to probability

1/4

A plot showing LLM performance on various algorithmic tasks. For all LLMs evaluated, including o1-preview, performance is highly influenced by the probability of the output to be produced, with lower performance on cases with low-probability outputs. The tasks being evaluated on are shift ciphers, Pig Latin, article swapping, and reversal.

January 20, 2025 at 7:28 PM

Tom McCoy

@rtommccoy.bsky.social

An image of a table with four columns. The first three columns are passages from translations of the Odyssey (by T.E. Lawrence, Robert Fagles, and Emily Wilson). The last column is labeled "Honda" and shows a passage from the Honda Odyssey owner's manual.

December 29, 2024 at 8:01 PM

Tom McCoy

@rtommccoy.bsky.social

Excited to be visiting the Simons Institute tomorrow for a debate with Sébastien Bubeck - billed as the Sparks/Embers debate! 🔥🤖🧠

Topic: Will scaling current LLMs be sufficient to resolve major open math conjectures?

Images of two paper titles: "Sparks of Artificial General Intelligence" and "Embers of Autoregression"

December 4, 2024 at 9:19 PM

Tom McCoy

@rtommccoy.bsky.social

🤖🧠 I'll be considering applications for postdocs & PhD students to start at Yale in Fall 2025!

If you are interested in the intersection of linguistics, cognitive science, and AI, I encourage you to apply!

Postdoc link: rtmccoy.com/prospective_...
PhD link: rtmccoy.com/prospective_...

Top: syntax tree for the sentence "the doctor by the lawyer saw the artist"
Bottom: a continuous vector

November 14, 2024 at 9:39 PM

Tom McCoy

@rtommccoy.bsky.social

"Worldpay" - anagram of "Wordplay"!

November 10, 2024 at 1:51 AM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news