Lightnews — Scholar-powered news

Amanda Bertsch

@abertsch.bsky.social

2.2K followers 540 following 20 posts

PhD student @ CMU LTI. working on text generation + long context

https://www.cs.cmu.edu/~abertsch/

Posts Replies Media Videos

Amanda Bertsch

@abertsch.bsky.social

ooh, interesting! would the best xLSTM model to try be the xLSTM Large 7B ?

November 10, 2025 at 3:57 PM

Amanda Bertsch

@abertsch.bsky.social

Thank you so much!

November 8, 2025 at 12:10 PM

Amanda Bertsch

@abertsch.bsky.social

We’re excited about Oolong as a challenging benchmark for information aggregation! Let us know which models we should benchmark next 👀

Paper: arxiv.org/abs/2511.02817
Dataset: huggingface.co/oolongbench
Code: github.com/abertsch72/o...
Leaderboard: oolongbench.github.io

Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities

As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently...

arxiv.org

November 7, 2025 at 5:07 PM

Amanda Bertsch

@abertsch.bsky.social

While long-context models can do many retrieval tasks impressively well, they have a long way to go to solve realistic information synthesis problems!

Oolong is joint work with Adithya Pratapa, Teruko Mitamura, @gneubig.bsky.social , and Matt Gormley.

November 7, 2025 at 5:07 PM

Amanda Bertsch

@abertsch.bsky.social

Models show varying error patterns. Claude and some GPT-family models underperform on tasks that require outputting dates; Gemini and Deepseek-R1 frequently over-reason and fail to return an answer at all on Oolong-synth, although Gemini is the best model on Oolong-real.

Score by answer type and task type for Oolong-synth. The month+year and date types are the hardest for many models, corresponding with the difficulty of the timeline tasks.

November 7, 2025 at 5:07 PM

Amanda Bertsch

@abertsch.bsky.social

Why is this so hard? Models must identify relevant sections of input, label or categorize these sections, and then accumulate information to make distributional-level decisions. Adding labels in-context or specifying more reasoning effort has limited benefit.

Graph showing that the performance with labels in-context for Oolong synth is only slightly better.

Graph showing that increasing reasoning effort only helps marginally, and only in contexts shorter than 64K.

November 7, 2025 at 5:07 PM

Amanda Bertsch

@abertsch.bsky.social

Oolong has a synthetic setting that poses distributional questions over sets of classification examples and their metadata and a realistic setting using conversational data from game transcripts. Both splits require counting, temporal reasoning, and multi-step entity resolution.

A figure demonstrating the two splits of Oolong: the left side has the question “Were there more news articles about the economy in September or August?”, from Oolong-synth; the right side has the question “How many times in this set of episodes does the character Jester cast Healing Word?”, from Oolong-real. Both questions require the model to label sections of input, identify the important segments, and aggregate across these to answer the question.

November 7, 2025 at 5:07 PM

Amanda Bertsch

@abertsch.bsky.social

We'll be posting course content for anyone who would like to follow along!

The first four lecture videos are available now: youtube.com/playlist?lis...

September 12, 2025 at 5:14 PM

Amanda Bertsch

@abertsch.bsky.social

we also have a followup work, and ‪@emilyxiao.bsky.social will also be around the conference to discuss! bsky.app/profile/emil...

Emily Xiao @emilyxiao.bsky.social · Mar 18

Many-shot ICL (thousands of examples+) can match fine-tuning on many tasks, but its high inference cost makes deployment impractical.

We introduce DBSA, a training-free framework that achieves the best efficiency even under high request volumes, while maintaining strong accuracy 🧵

April 30, 2025 at 12:06 AM

Amanda Bertsch

@abertsch.bsky.social

our paper
(arxiv.org/abs/2405.00200) studies properties + tradeoffs of using long-context models for ICL, and we're very excited that it won the Language Modeling SAC award this year!

In-Context Learning with Long-Context Models: An In-Depth Exploration

As model context lengths continue to increase, the number of demonstrations that can be provided in-context approaches the size of entire training datasets. We study the behavior of in-context learnin...

arxiv.org

April 30, 2025 at 12:05 AM

Amanda Bertsch

@abertsch.bsky.social

I think @siree.sh was also looking at this! No marker of arxiv category in the url, unfortunately :/

November 25, 2024 at 2:18 PM

Amanda Bertsch

@abertsch.bsky.social

and just realized this post is a full two weeks old but! bsky showed it to me now 🥲

November 25, 2024 at 7:17 AM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news