Lightnews — Scholar-powered news

Ian Berlot-Attwell

@ianberlot.bsky.social

8 followers 2 following 10 posts

ML/NLP PhD Student. Interested in compositional generalization!
https://www.cs.toronto.edu/~ianberlot/

Posts Replies Media Videos

Ian Berlot-Attwell

@ianberlot.bsky.social

Running an ablation on a subset of miniF2F, we find that a model ablated to prevent the sharing of lemmas across tasks also exhibits strong performance.

December 11, 2024 at 3:55 PM

Ian Berlot-Attwell

@ianberlot.bsky.social

Studying the LEGO-Prover (a system for formalizing natural language proofs by learning reusable lemmas), we find that lemma reuse is very uncommon, and no lemma reused twice.

Table counting occurences of lemma reuse. No lemma is reused such that it is reproduced exactly (or even has its name appear in the solution) more than once. Only one lemma is reused verbatim out of the >400 proofs found.

December 11, 2024 at 3:55 PM

Ian Berlot-Attwell

@ianberlot.bsky.social

Studying TroVE (a system that learns reusable python functions), we find only 3 instances of a learned function being reused correctly, out of 3,201 test questions in the MATH dataset. Furthermore, our libraryless ablation outperforms the original on 3 of 4 MATH splits tested.

Table of TroVE performance on MATH for the ablation and the baseline. The models were tested on four MATH splits. On three of these splits the the ablation has stronger performance, in two of these cases at a statistically significant level.

December 11, 2024 at 3:55 PM

Ian Berlot-Attwell

@ianberlot.bsky.social

LLM powered library learning systems achieve SoTA performance on several tasks, but is this driven by the reuse of learned tools? We study two library learning systems for mathematics and find that the reuse of learned tools is extremely infrequent and can harm performance 🧵

Library Learning Doesn't: The Curious Case of the Single-Use Library.
Ian Berlot-Attwell, Frank Rudzicz, Xujie Si

December 11, 2024 at 3:55 PM

Ian Berlot-Attwell

@ianberlot.bsky.social

Same findings hold on a neurosymbolic NMN model, even though these models are specifically designed to be compositional!

Systematicity gap on the complex splits (top corner) and minimal splits (bottom corner) for all models trained on 560k training examples. The systematicity gap is averaged according to the attribute types of the HOPs, all 29 HOPs for LXMERT, HOPs 0-5 for Tensor-NMN — attributes are sorted by increasing diversity on the axes (e.g., SHAPE has 2 possible values, COLOR has 8 possible values). As expected, we see a worse systematicity gap (i.e. lighter colors) in the top left (low-diversity combinations), and better systematicity gap (i.e., darker colors) in the bottom right (high-diversity combinations).

November 15, 2023 at 11:05 PM

Ian Berlot-Attwell

@ianberlot.bsky.social

We stratify value pairs (e.g., blue + sphere) by attribute diversity, i.e., the number of possible train-time alternative values for each attribute. Low diversity combinations have a larger systematicity gap (difference in accuracy between seen and unseen combinations)!

Systematicity gap (difference between OOD and IID model accuracy), averaged by held-out pair (HOP) diversity over 29 HOPs, each with 3 runs. Subplot a) is for complex questions, Subplot b) is for minimal questions.

November 15, 2023 at 11:05 PM

Ian Berlot-Attwell

@ianberlot.bsky.social

For 29 different pairs of held-out object attributes (e.g., rubber cylinders), we create separate train and test splits in a modified CLEVR setting. Combinations of certain values for this attribute pair will be present at test time, but not train.

Example image-question pairs for the sub-dataset of CLEVR-HOPE corresponding to rubber cylinder.The test sets are in gray; rubber cylinder is omitted visually and textually in the train split and the IID test splits; rubber cylinder only occurs in the OOD splits; occurrences are emphasized in this figure. The train and complex sets are of comparable visual and textual complexity to CLEVR. The minimal sets consist only of existence questions, checking whether a single object matches a given pair of attribute values.

November 15, 2023 at 11:04 PM

Ian Berlot-Attwell

@ianberlot.bsky.social

Will multimodal models systematically generalize if trained on enough data? In a controlled VQA setting, we find it’s not data quantity, but data DIVERSITY that matters! 🧵

Joint w/ @ab-carrell.bsky.social @kumarkagrawal.bsky.social Yash Sharma @nsaphra.bsky.social
www.cs.toronto.edu/~ianberlot/d...

Authors of the paper "Attribute Diversity Determines the Systematicity Gap in VQA"

November 15, 2023 at 11:02 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news