Cyrus Rashtchian
cyroid.bsky.social
Cyrus Rashtchian
@cyroid.bsky.social
Researcher at Google. Improving LLM factuality, RAG and multimodal alignment and evaluation. San Diego. he/him ☀️🌱🧗🏻🏐 Prev UCSD, MSR, UW, UIUC.
[6/6] The other idea is to do the weighted combination at an instance level. We look at intermediate layers for *each token* and slightly modify the overall distribution. This leads to consistent accuracy improvements for many models and datasets!

Would love to see some theory on why this works!
December 13, 2024 at 6:43 PM
[5/6] Here's a nice example. We want to do some math. Greedy decoding leads to 5 x $10 = $50 for the overtime pay. This is cus A x B = C is a common pattern. But we really need A x B x C = D to get the answer. SLED can help with this because the internal layers happen to predict 'x' instead of '='.
December 13, 2024 at 6:43 PM
[4/6] Our main decoding trick is to use a weighted combination of *all of the layers*. Precisely, we project the layers into the same output distribution (over vocab tokens). Then we combine the intermediate "logits" with the output logits based on our estimate of the LLM's internal knowledge
December 13, 2024 at 6:43 PM
[3/6] The key observation is that LLMs "know" a lot more than they "tell" -- basically the training process can favor more popular tokens (in the dataset) rather than more accuracy predictions for the query at hand.

So we can utilize this during decoding time...
December 13, 2024 at 6:43 PM
[2/6] Joint work with Jianyi Zhang · Da-Cheng Juan · Chun-Sung Ferng · Heinrich Jiang · Yiran Chen

ArXiv paper: arxiv.org/abs/2411.02433
Project page: jayzhang42.github.io/sled_page/
GitHub: github.com/JayZhang42/S...

But how does it work you ask?
December 13, 2024 at 6:43 PM
December 13, 2024 at 5:55 PM