Lightnews — Scholar-powered news

Danny Wood

@echostatements.bsky.social

100 followers 210 following 46 posts

AI Research Scientist || PhD in machine learning || Ensembles, probabilistic machine learning, recurrent neural networks || https://echostatements.github.io

Posts Replies Media Videos

Danny Wood

@echostatements.bsky.social

Answering my own question, the example from the "Infinite-dimensional vector function" Wikipedia page makes it feel clear that a curve with these properties will exist

Even so, without more thought, this feels more like it side-steps the issue in my intuition than directly tackles it

https://en.wikipedia.org/wiki/Infinite-dimensional_vector_function#Crinkled_arcs

September 28, 2025 at 10:30 AM

Danny Wood

@echostatements.bsky.social

Looking at the code in Hugging Face's library for the new GPT models, I was a bit disappointed by how similar they are to Llama & Mistral models, but there is one cool trick I hadn't seen before: attention sinks. These are a mechanism by which attention heads can say "I don't have anything to add"

The code for the attention mechanism in the `transformers` implementation of gpt_oss

August 7, 2025 at 4:29 PM

Danny Wood

@echostatements.bsky.social

A little etymological fact that I like is that "lemma" and "dilemma" are from the same ancient Greek origin

If you translate "lemma" as meaning a proposition, a dilemma is literally having two propositions to consider

July 22, 2025 at 12:59 PM

Danny Wood

@echostatements.bsky.social

And even better, these tricks are not mutually exclusive, by doing both simultaneously, you get a 2.5x speed up (dependent on batch size)

8/9 🧵

Graph from start of thread repeated, showing effects of faster model and faster training

July 18, 2025 at 12:17 PM

Danny Wood

@echostatements.bsky.social

Again, this gets some pretty significant speed-up

7/9 🧵

Graph showing training speed of classic vs faster training (~250 seconds vs ~150 seconds)

July 18, 2025 at 12:17 PM

Danny Wood

@echostatements.bsky.social

Optimisation 2:

When training Siamese networks, people tend to generate matching/non-matching pairs in equal ratio. However, you can train more efficiently if you generate only matching pairs, then creating the non-matching ones with some shifting of subnetwork outputs.

6/9 🧵

Image showing how paired samples can be offset by one in a batch to make unpaired samples

July 18, 2025 at 12:17 PM

Danny Wood

@echostatements.bsky.social

Optimisation 1:

In practice, when implementing these networks there is only one subnetwork, called twice, once for each input

But by stacking the inputs, we actually only need to make one call to the network:

Depending on batch size, the effect can be significant

5/9 🧵

A diagram of a faster implementation of a Siamese network vs a standard one

Comparison of inference and throughput for classic vs faster network (1.35 seconds for classic, 0.91 seconds for faster)

July 18, 2025 at 12:17 PM

Danny Wood

@echostatements.bsky.social

Firstly, what is a Siamese neural network?

A Siamese neural network consists of two identical subnetworks and a comparison layer

The two subnetworks generate representations for two different inputs, then the comparison layer measures the similarity between these representations

3/9 🧵

A diagram of a standard Siamese neural network

July 18, 2025 at 12:17 PM

Danny Wood

@echostatements.bsky.social

Siamese neural networks are a neat way of doing one-shot/few-shot classification but there's something about the way they're usually implemented that felt a bit inefficient to me

Here are a couple of relatively simple tricks you can do to speed up training Siamese networks by 2.5x! 1/9 🧵

A graph comparing faster model + faster training vs classic model + classic training, and the other two combinations. Faster Model/faster training shows a 2.5x speed up vs classic model/classic training

July 18, 2025 at 12:17 PM

Danny Wood

@echostatements.bsky.social

If you've not read it, I highly recommend "Admissible probability measurement procedures" by Shuford et al (1966) too

It's really well written and if you replaced "students" with "models" in the text, would read like a fairly modern machine learning paper

link.springer.com/article/10.1...

May 2, 2025 at 10:33 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news