François Fleuret
@francois.fleuret.org
Research Scientist Meta/FAIR, Prof. University of Geneva, co-founder Neural Concept SA. I like reality.
https://fleuret.org
https://fleuret.org
The voc corresponding to the logits
September 10, 2025 at 12:00 PM
The voc corresponding to the logits
- Ring Attention: takes advantage of multi-node hardware to scale the computation according to the sequence length
- Speculative decoding: a cheaper model generates tokens, and a rejection process corrects this generation to march the full-model distribution.
- Speculative decoding: a cheaper model generates tokens, and a rejection process corrects this generation to march the full-model distribution.
April 28, 2025 at 6:50 AM
- Ring Attention: takes advantage of multi-node hardware to scale the computation according to the sequence length
- Speculative decoding: a cheaper model generates tokens, and a rejection process corrects this generation to march the full-model distribution.
- Speculative decoding: a cheaper model generates tokens, and a rejection process corrects this generation to march the full-model distribution.
- Multi-token prediction: sums the training over multiple future tokens, possibly with additional readout heads.
- FlashAttention: computes the attention on the fly, avoiding a memory footprint O(T^2) (+ optimizes very carefully for the GPU!)
- FlashAttention: computes the attention on the fly, avoiding a memory footprint O(T^2) (+ optimizes very carefully for the GPU!)
April 28, 2025 at 6:49 AM
- Multi-token prediction: sums the training over multiple future tokens, possibly with additional readout heads.
- FlashAttention: computes the attention on the fly, avoiding a memory footprint O(T^2) (+ optimizes very carefully for the GPU!)
- FlashAttention: computes the attention on the fly, avoiding a memory footprint O(T^2) (+ optimizes very carefully for the GPU!)
- Warmup: very short ramping-up of the learning rate, starting from 0
- Cosine schedule: the learning rate varies less at the beginning and end of the schedule
- AdamW: decouples weight includes decay from Adam
- Cosine schedule: the learning rate varies less at the beginning and end of the schedule
- AdamW: decouples weight includes decay from Adam
April 28, 2025 at 6:49 AM
- Warmup: very short ramping-up of the learning rate, starting from 0
- Cosine schedule: the learning rate varies less at the beginning and end of the schedule
- AdamW: decouples weight includes decay from Adam
- Cosine schedule: the learning rate varies less at the beginning and end of the schedule
- AdamW: decouples weight includes decay from Adam
- RoPE (Rotary Positional Embedding): makes the attention depend only on the relative Q/K positions
- MoE (Mixture of Experts): The FFN block is implemented with multiple MLPs and a gating mechanism selects which ones process each token.
- MoE (Mixture of Experts): The FFN block is implemented with multiple MLPs and a gating mechanism selects which ones process each token.
April 28, 2025 at 6:49 AM
- RoPE (Rotary Positional Embedding): makes the attention depend only on the relative Q/K positions
- MoE (Mixture of Experts): The FFN block is implemented with multiple MLPs and a gating mechanism selects which ones process each token.
- MoE (Mixture of Experts): The FFN block is implemented with multiple MLPs and a gating mechanism selects which ones process each token.
- RMSNorm instead of Layernorm: normalize only the scaling
- MLA (Multi-head Latent Attention): stores a low-rank projection of the attention block input and compute the KV from it
- SwiGLU: non-linearity for the FFN block with per-component gating
- MLA (Multi-head Latent Attention): stores a low-rank projection of the attention block input and compute the KV from it
- SwiGLU: non-linearity for the FFN block with per-component gating
April 28, 2025 at 6:48 AM
- RMSNorm instead of Layernorm: normalize only the scaling
- MLA (Multi-head Latent Attention): stores a low-rank projection of the attention block input and compute the KV from it
- SwiGLU: non-linearity for the FFN block with per-component gating
- MLA (Multi-head Latent Attention): stores a low-rank projection of the attention block input and compute the KV from it
- SwiGLU: non-linearity for the FFN block with per-component gating
- Prenorm: normalization in the residual blocks before the attention operation and the FFN respectively
- GQA (Group Query Attention): more Q than (K, V)
- GQA (Group Query Attention): more Q than (K, V)
April 28, 2025 at 6:47 AM
- Prenorm: normalization in the residual blocks before the attention operation and the FFN respectively
- GQA (Group Query Attention): more Q than (K, V)
- GQA (Group Query Attention): more Q than (K, V)
Yes, it's awesome. The kind of work that opens up a whole new and important field.
March 12, 2025 at 6:09 AM
Yes, it's awesome. The kind of work that opens up a whole new and important field.
If your task is not resolution-agnostic, do not use normalized p-e.
All this being said, putting both normalized and non-normalized cannot hurt methinks.
All this being said, putting both normalized and non-normalized cannot hurt methinks.
February 28, 2025 at 7:41 AM
If your task is not resolution-agnostic, do not use normalized p-e.
All this being said, putting both normalized and non-normalized cannot hurt methinks.
All this being said, putting both normalized and non-normalized cannot hurt methinks.
You cannot be better off without p-e.
February 28, 2025 at 7:38 AM
You cannot be better off without p-e.
Why not a normalized positional encoding?
February 28, 2025 at 6:57 AM
Why not a normalized positional encoding?
After a long lecture, I recommend a coffee, a pain au chocolat, and leave-me-the-fuck-alone time.
February 28, 2025 at 6:55 AM
After a long lecture, I recommend a coffee, a pain au chocolat, and leave-me-the-fuck-alone time.
Why is it spooky?
February 27, 2025 at 2:17 PM
Why is it spooky?
I asked this because even though I am interested in the topic, I have not met so far "foundational" theory regarding the future of society with AI.
Someone linked this paper which is exactly the sort of thing I was looking for:
arxiv.org/abs/2502.12102
Someone linked this paper which is exactly the sort of thing I was looking for:
arxiv.org/abs/2502.12102
Relational Norms for Human-AI Cooperation
How we should design and interact with social artificial intelligence depends on the socio-relational role the AI is meant to emulate or occupy. In human society, relationships such as teacher-student...
arxiv.org
February 21, 2025 at 7:52 PM
I asked this because even though I am interested in the topic, I have not met so far "foundational" theory regarding the future of society with AI.
Someone linked this paper which is exactly the sort of thing I was looking for:
arxiv.org/abs/2502.12102
Someone linked this paper which is exactly the sort of thing I was looking for:
arxiv.org/abs/2502.12102
We can't complain, can we?
February 11, 2025 at 11:10 PM
We can't complain, can we?
To do so, you concatenate all the sequences to make a batch of a single sequence, and carve the attention matrix into a block-diagonal one (possibly with causal structure in each block) so that sequences cannot look at each other.
Magic!
3/3
Magic!
3/3
February 6, 2025 at 12:23 AM
To do so, you concatenate all the sequences to make a batch of a single sequence, and carve the attention matrix into a block-diagonal one (possibly with causal structure in each block) so that sequences cannot look at each other.
Magic!
3/3
Magic!
3/3
It does this by generating an optimized cuda kernel on the fly.
So it's cool for causal masks, but it also allows an amazing trick to deal with batches of sequences of various lengths *without padding*!
2/3
So it's cool for causal masks, but it also allows an amazing trick to deal with batches of sequences of various lengths *without padding*!
2/3
February 6, 2025 at 12:23 AM
It does this by generating an optimized cuda kernel on the fly.
So it's cool for causal masks, but it also allows an amazing trick to deal with batches of sequences of various lengths *without padding*!
2/3
So it's cool for causal masks, but it also allows an amazing trick to deal with batches of sequences of various lengths *without padding*!
2/3
I have to admit I am more on the other platform.
February 5, 2025 at 6:32 PM
I have to admit I am more on the other platform.