Sergey Feldman
banner
sergeyf.bsky.social
Sergey Feldman
@sergeyf.bsky.social
Nope, it was our Israel team.
November 6, 2025 at 5:55 PM
(3) they also studied multiple rounds of the above. iterative self improvement. saturation happens after 2 or 3 rounds. I'm surprised it's not 1!

(4) Ensemble Heuristic: Simple verification ensemble heuristics can improve performance

6/6
December 13, 2024 at 3:35 AM

(2) CoT Verification is More Stable than MC: "Some MC verification incurs non-positive gap even for medium-sized models such as Qwen-1.5 14/32B, while CoT verification always has a positive gap for medium/large-sized models"

5/n
December 13, 2024 at 3:35 AM
Results
(1) Small Models can not Self-improve. Models such as Qwen-1.5, 0.5B, Qwen-2 0.5B and Llama-2 7B, gap(f ) is non-positive for nearly all verification methods, even though the models have non-trivial generation accuracy

4/n
December 13, 2024 at 3:35 AM
(3) Then they compute the gap which is the average accuracy diff between the filtered generations (those that are correct after step 2 according to self-verification) and the original 128 responses.

3/n
December 13, 2024 at 3:35 AM
(2) For each of the 128, they sample one verification for each response of one of 3 styles: (a) correct vs incorrect, (b) CoT + score 1 to 10, or (c) "Tournament" style, which you can find in the paper.

2/n
December 13, 2024 at 3:35 AM
Thanks!
November 26, 2024 at 2:31 AM
If you know papers or blog posts that address these, I'd be happy to have the links. Thanks!
November 22, 2024 at 6:00 PM
(7) Others found a good recipe for distilling: first fine-tune the biggest model on small gold data, then use that fine-tuned model to make silver data. Does that work for IR distilling? If we fine-tune a 405b before using it as the silver data source, what should we use as gold? How much do I need?
November 22, 2024 at 6:00 PM
(6) You can get better LLM labels if you do all pair comparisons on the passage set (citation needed, but I read a few papers showing this). Obviously much more expensive. Should I spend my fixed computer/money budget on all-pairs O(few_queries * passages^2) or pointwise O(more_queries * passages)?
November 22, 2024 at 6:00 PM
(5) Does the type of base model to be distilled matter much? Should I distill roberta-large or some modern 0.5b LM?
November 22, 2024 at 6:00 PM
(4) From our experience at AI2, LLM-generated search queries are weirdly out of distribution and non-human in various ways. Does this matter? Do we have to get human queries?
November 22, 2024 at 6:00 PM
(3) Can we do better than human labeled data because we have no gaps in the labels? And can get more data at will?
November 22, 2024 at 6:00 PM
(2) How to distill well? Do we use the same loss functions that we used when obtaining gold data from human labelers?
November 22, 2024 at 6:00 PM
(1) Say I have 10000 queries and 100 passages/docs for each query, labeled or ranked by the best LLM (with optimized prompt or fine-tuning), how close can we get to the LLM's performance? Result is a plot with number of distilled model parameters on the x-axis and NDCG vs LLM on y-axis.
November 22, 2024 at 6:00 PM