David Espejo
davidmirror.bsky.social
David Espejo
@davidmirror.bsky.social
Avid learner. OSS maintainer.Books, running, DistSys, and dogs. Proud father and husband.
A tokenizer-free transformer architecture proposed by Meta makes a lot of sense. More efficient compute allocation and increased flexibility for complex use cases
ai.meta.com/research/pub...
Byte Latent Transformer: Patches Scale Better Than Tokens | Research - AI at Meta
We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at...
ai.meta.com
December 18, 2024 at 12:47 PM
TIL: weight tying improves performance by sharing(tying) the embedding and softmax matrixes as they both can be represented by vectors with low cosine distance between each other.
A comparison:
December 5, 2024 at 11:00 AM
TIL that this site exists: paperswithcode.com/method/layer...
Concise explanations and both paper and code available on a single place for deeper understanding. Super useful!
Papers with Code - Layer Normalization Explained
Unlike batch normalization, Layer Normalization directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any ...
paperswithcode.com
November 29, 2024 at 10:30 AM
TIL: GELU activation function is usually preferred over a simpler ReLU because it's smoother, allowing for small, yet non-zero output for negative values in the input tensor. It means that neurons with negative values will contribute to the learning process, albeit at a lesser extent than positives.
November 28, 2024 at 10:15 AM
TIL: ReLU is a simple activation function that swaps every negative value in the input tensor with zeroes. Activation functions like this introduce non-linearity to neural networks, forcing them to learn more complex patterns.
November 22, 2024 at 10:40 AM
TIL: Multi-headed attention (MHA) is a technique to make the model perform self-attention over a longer dataset context in parallel and then combine results. You can either 1) stack multiple single-head attention layers or ->
November 21, 2024 at 10:39 AM
What if an LLM receives a word out of its vocabulary?
GPT-like models use a byte-pair encoder that breaks words into subwords and even single characters to handle unknown words without introducing special tokens.
November 5, 2024 at 10:26 AM
TIL: padding is a technique (a trick?) to align text lengths when training LLMs with batch sizes > 1. Shorter texts are extended ("padded") up to the length of the longest text in the batch.
November 5, 2024 at 10:08 AM