Mete
mismayil.bsky.social
Mete
@mismayil.bsky.social
PhD in AI @EPFL
🙏 Huge thanks to my collaborators Antonio Laverghetta, Simone Luchini, Reet Patel, @abosselut.bsky.social Lonneke van der Plas and @roger-beaty.bsky.social
for making this possible! We also publicly release the code, models, and data!
www.mete.is/creative-pre...
Creative Preference Optimization
Novel preference optimization method and a multi-task creativity dataset promoting LLM output creativity
www.mete.is
September 22, 2025 at 1:43 PM
📊 Results
We fine-tuned small LLMs like Llama-3.1-8B-Instruct on MuCE using CrPO and achieved significant improvements on LLM output creativity across all dimensions while maintaining high output quality
CrPO models beat SFT, vanilla DPO, and large models such as GPT-4o 🔥
September 22, 2025 at 1:43 PM
📃Multi-task Creativity Evaluation (MuCE)
To apply CrPO, we also collect a large-scale preference dataset consisting of more than 200K human responses and ratings for more than 30 creativity assessments, and use a subset of it to train and evaluate our models.
September 22, 2025 at 1:43 PM
🧠 How do we compute creativity scores?
Instead of treating creativity as a single concept, we break it down into its major dimensions and employ metrics for each that provide measurable signals aligning with key cognitive theories and enable practical optimization within LLMs.
September 22, 2025 at 1:43 PM
🔧 How does it work?
CrPO = Direct Preference Optimization (DPO) × a weighted mix of creativity scores (novelty, surprise, diversity, quality).
This modular objective enables us to optimize LLMs for different dimensions of creativity tailored to a given domain.
September 22, 2025 at 1:43 PM
Check out our paper [https://arxiv.org/abs/2410.12656] for more details. Huge shoutouts to my amazing collaborators
@defnecirci.bsky.social, @jonnesaleva.bsky.social, Hale Sirin, Abdullatif Koksal, Bhuwan Dhingra, @abosselut.bsky.social, Duygu Ataman, Lonneke van der Plas.
February 20, 2025 at 5:28 PM
Our further analysis shows that tokenization is not likely to be the issue, adding context does not help, models are sensitive to the order of morphemes in the prompt and removing language-specific shortcuts from data lowers their performance even further.
February 20, 2025 at 5:28 PM
Our analysis shows that model performance is negatively correlated with the morphological complexity of the words (i.e. number of morphemes) while human performance is not systematically affected (results shown for GPT-4 below)
February 20, 2025 at 5:28 PM
We find that all models struggle to compose new words and fail to consistently recognize the validity of all compositions, especially when applied to novel (i.e. out-of-distribution) word roots. Humans on the other hand ace both tasks and easily generalize to novel words.
February 20, 2025 at 5:28 PM
...and evaluate several multilingual LLMs (GPT-4, Gemini-1.5, Aya-23, Qwen2.5) on these tasks in two typologically-related (i.e. agglutination), but unrelated languages: Turkish and Finnish. We also evaluate native human speakers of these languages.
February 20, 2025 at 5:28 PM
We design two novel compositional probing tasks to measure morphological productivity (i.e. ability to produce novel well-formed combinations of morphemes) and systematicity (i.e. ability to systematically understand novel combinations)...
February 20, 2025 at 5:28 PM