Lightnews — Scholar-powered news

Sergey Feldman

@sergeyf.bsky.social

310 followers 410 following 35 posts

ML/AI at AI2 http://semanticscholar.org, http://alongside.care, http://data-cowboys.com

Posts Replies Media Videos

Sergey Feldman

@sergeyf.bsky.social

Nope, it was our Israel team.

November 6, 2025 at 5:55 PM

Sergey Feldman

@sergeyf.bsky.social

(3) they also studied multiple rounds of the above. iterative self improvement. saturation happens after 2 or 3 rounds. I'm surprised it's not 1!

(4) Ensemble Heuristic: Simple verification ensemble heuristics can improve performance

6/6

December 13, 2024 at 3:35 AM

Sergey Feldman

@sergeyf.bsky.social

(2) CoT Verification is More Stable than MC: "Some MC verification incurs non-positive gap even for medium-sized models such as Qwen-1.5 14/32B, while CoT verification always has a positive gap for medium/large-sized models"

5/n

December 13, 2024 at 3:35 AM

Sergey Feldman

@sergeyf.bsky.social

Results
(1) Small Models can not Self-improve. Models such as Qwen-1.5, 0.5B, Qwen-2 0.5B and Llama-2 7B, gap(f ) is non-positive for nearly all verification methods, even though the models have non-trivial generation accuracy

4/n

December 13, 2024 at 3:35 AM

Sergey Feldman

@sergeyf.bsky.social

(3) Then they compute the gap which is the average accuracy diff between the filtered generations (those that are correct after step 2 according to self-verification) and the original 128 responses.

3/n

December 13, 2024 at 3:35 AM

Sergey Feldman

@sergeyf.bsky.social

(2) For each of the 128, they sample one verification for each response of one of 3 styles: (a) correct vs incorrect, (b) CoT + score 1 to 10, or (c) "Tournament" style, which you can find in the paper.

2/n

December 13, 2024 at 3:35 AM

Sergey Feldman

@sergeyf.bsky.social

Thanks!

November 26, 2024 at 2:31 AM

Sergey Feldman

@sergeyf.bsky.social

If you know papers or blog posts that address these, I'd be happy to have the links. Thanks!

November 22, 2024 at 6:00 PM

Sergey Feldman

@sergeyf.bsky.social

(7) Others found a good recipe for distilling: first fine-tune the biggest model on small gold data, then use that fine-tuned model to make silver data. Does that work for IR distilling? If we fine-tune a 405b before using it as the silver data source, what should we use as gold? How much do I need?

November 22, 2024 at 6:00 PM

Sergey Feldman

@sergeyf.bsky.social

(6) You can get better LLM labels if you do all pair comparisons on the passage set (citation needed, but I read a few papers showing this). Obviously much more expensive. Should I spend my fixed computer/money budget on all-pairs O(few_queries * passages^2) or pointwise O(more_queries * passages)?

November 22, 2024 at 6:00 PM

Sergey Feldman

@sergeyf.bsky.social

(5) Does the type of base model to be distilled matter much? Should I distill roberta-large or some modern 0.5b LM?

November 22, 2024 at 6:00 PM

Sergey Feldman

@sergeyf.bsky.social

(4) From our experience at AI2, LLM-generated search queries are weirdly out of distribution and non-human in various ways. Does this matter? Do we have to get human queries?

November 22, 2024 at 6:00 PM

Sergey Feldman

@sergeyf.bsky.social

(3) Can we do better than human labeled data because we have no gaps in the labels? And can get more data at will?

November 22, 2024 at 6:00 PM

Sergey Feldman

@sergeyf.bsky.social

(2) How to distill well? Do we use the same loss functions that we used when obtaining gold data from human labelers?

November 22, 2024 at 6:00 PM

Sergey Feldman

@sergeyf.bsky.social

(1) Say I have 10000 queries and 100 passages/docs for each query, labeled or ranked by the best LLM (with optimized prompt or fine-tuning), how close can we get to the LLM's performance? Result is a plot with number of distilled model parameters on the x-axis and NDCG vs LLM on y-axis.

November 22, 2024 at 6:00 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news