Lightnews — Scholar-powered news

David Espejo

@davidmirror.bsky.social

16 followers 16 following 9 posts

Avid learner. OSS maintainer.Books, running, DistSys, and dogs. Proud father and husband.

Posts Replies Media Videos

David Espejo

@davidmirror.bsky.social

A tokenizer-free transformer architecture proposed by Meta makes a lot of sense. More efficient compute allocation and increased flexibility for complex use cases
ai.meta.com/research/pub...

Byte Latent Transformer: Patches Scale Better Than Tokens | Research - AI at Meta

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at...

ai.meta.com

December 18, 2024 at 12:47 PM

David Espejo

@davidmirror.bsky.social

TIL: weight tying improves performance by sharing(tying) the embedding and softmax matrixes as they both can be represented by vectors with low cosine distance between each other.
A comparison:

December 5, 2024 at 11:00 AM

David Espejo

@davidmirror.bsky.social

TIL that this site exists: paperswithcode.com/method/layer...
Concise explanations and both paper and code available on a single place for deeper understanding. Super useful!

Papers with Code - Layer Normalization Explained

Unlike batch normalization, Layer Normalization directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any ...

paperswithcode.com

November 29, 2024 at 10:30 AM

David Espejo

@davidmirror.bsky.social

TIL: GELU activation function is usually preferred over a simpler ReLU because it's smoother, allowing for small, yet non-zero output for negative values in the input tensor. It means that neurons with negative values will contribute to the learning process, albeit at a lesser extent than positives.

November 28, 2024 at 10:15 AM

David Espejo

@davidmirror.bsky.social

TIL: ReLU is a simple activation function that swaps every negative value in the input tensor with zeroes. Activation functions like this introduce non-linearity to neural networks, forcing them to learn more complex patterns.

November 22, 2024 at 10:40 AM

David Espejo

@davidmirror.bsky.social

TIL: Multi-headed attention (MHA) is a technique to make the model perform self-attention over a longer dataset context in parallel and then combine results. You can either 1) stack multiple single-head attention layers or ->

November 21, 2024 at 10:39 AM

David Espejo

@davidmirror.bsky.social

What if an LLM receives a word out of its vocabulary?
GPT-like models use a byte-pair encoder that breaks words into subwords and even single characters to handle unknown words without introducing special tokens.

November 5, 2024 at 10:26 AM

David Espejo

@davidmirror.bsky.social

TIL: padding is a technique (a trick?) to align text lengths when training LLMs with batch sizes > 1. Shorter texts are extended ("padded") up to the length of the longest text in the batch.

November 5, 2024 at 10:08 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news