Danny Wood
banner
echostatements.bsky.social
Danny Wood
@echostatements.bsky.social
AI Research Scientist || PhD in machine learning || Ensembles, probabilistic machine learning, recurrent neural networks || https://echostatements.github.io
Answering my own question, the example from the "Infinite-dimensional vector function" Wikipedia page makes it feel clear that a curve with these properties will exist

Even so, without more thought, this feels more like it side-steps the issue in my intuition than directly tackles it
September 28, 2025 at 10:30 AM
Looking at the code in Hugging Face's library for the new GPT models, I was a bit disappointed by how similar they are to Llama & Mistral models, but there is one cool trick I hadn't seen before: attention sinks. These are a mechanism by which attention heads can say "I don't have anything to add"
August 7, 2025 at 4:29 PM
A little etymological fact that I like is that "lemma" and "dilemma" are from the same ancient Greek origin

If you translate "lemma" as meaning a proposition, a dilemma is literally having two propositions to consider
July 22, 2025 at 12:59 PM
And even better, these tricks are not mutually exclusive, by doing both simultaneously, you get a 2.5x speed up (dependent on batch size)

8/9 🧵
July 18, 2025 at 12:17 PM
Again, this gets some pretty significant speed-up

7/9 🧵
July 18, 2025 at 12:17 PM
Optimisation 2:

When training Siamese networks, people tend to generate matching/non-matching pairs in equal ratio. However, you can train more efficiently if you generate only matching pairs, then creating the non-matching ones with some shifting of subnetwork outputs.

6/9 🧵
July 18, 2025 at 12:17 PM
Optimisation 1:

In practice, when implementing these networks there is only one subnetwork, called twice, once for each input

But by stacking the inputs, we actually only need to make one call to the network:

Depending on batch size, the effect can be significant

5/9 🧵
July 18, 2025 at 12:17 PM
Firstly, what is a Siamese neural network?

A Siamese neural network consists of two identical subnetworks and a comparison layer

The two subnetworks generate representations for two different inputs, then the comparison layer measures the similarity between these representations

3/9 🧵
July 18, 2025 at 12:17 PM
Siamese neural networks are a neat way of doing one-shot/few-shot classification but there's something about the way they're usually implemented that felt a bit inefficient to me

Here are a couple of relatively simple tricks you can do to speed up training Siamese networks by 2.5x! 1/9 🧵
July 18, 2025 at 12:17 PM
If you've not read it, I highly recommend "Admissible probability measurement procedures" by Shuford et al (1966) too

It's really well written and if you replaced "students" with "models" in the text, would read like a fairly modern machine learning paper

link.springer.com/article/10.1...
May 2, 2025 at 10:33 AM