Lightnews — Scholar-powered news

Flaviu Cipcigan

@flaviucipcigan.bsky.social

Super interesting application of program search

Goals are mapped to programs which are embedded in a latent space.

A fitness metric is assigned to the programs and program search is done to synthesise new human-like goals.

February 22, 2025 at 11:53 AM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

One of my big motivations is accelerating science with AI.

Every discovery project had a beautiful aha moment, such as the structure of antibiotics emerging in the latent space of a model or a GFlowNet proposing new carbon capture materials.

Here's some of the threads I've wrote on this topic.

February 20, 2025 at 9:09 PM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

Wanna try to guess which of those gets parsed as a string and which as a number? Answer in alt text.

YAML parsing in python is weird.

{'lol': ['5.0E6',
'5.0e6',
'5.E6',
'5.e6',
'5E6',
'5e6',
5e-06,
5e-06,
5e-06,
5e-06,
'5E-6',
'5e-6',
5000000.0,
5000000.0,
5000000.0,
5000000.0,
'5E+6',
'5e+6']}

February 17, 2025 at 4:49 PM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

Interesting idea to generate responses using diffusion rather than left-to-right auto-regressive models

February 17, 2025 at 12:31 PM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

What is large for a language model? Is it 400B, 70B or maybe 1T?

I think focus on raw number of parameters is a less useful frame than thinking about inference speed, cost and location of inference (on-device vs cloud).

February 15, 2025 at 12:56 PM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

More open reasoning datasets and distilled models.

It's great to see the energy of the community that got unleashed after open models that generate chains of thought!

Alex Dimakis @alexdimakis.bsky.social · Feb 12

We are releasing OpenThinker-32B, the best 32B reasoning model with open data. We match or outperform Deepseek-R1-32B (a closed data model) in reasoning benchmarks. Congrats to Negin and the whole Open Thoughts team.

github.com/open-thought...

Performance of the best known Reasoning models on various Benchmarks. OpenThinker-32B matches the current state of the art.

February 13, 2025 at 3:58 PM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

ColabFit Exchange is another great dataset curation effort that I'd like to boost.

Great work by @stemartiniani.bsky.social and team to curate the most diverse materials database in the world!

Stefano Martiniani @stemartiniani.bsky.social · Dec 14

Join us for the #AI4Mat workshop at #NeurIPS2024 today and check out our spotlight on how we built the most diverse database for AI for materials in the world openreview.net/forum?id=b8q...

February 13, 2025 at 1:53 PM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

Neat idea! Fine-tuning using majority voting and length filtering generalises a model's capabilities.

Models generalise to slightly harder versions of a problem, and the correct answers are used to bootstrap the next model and the next one and so on.

February 13, 2025 at 1:17 PM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

Join us in creating open datasets, benchmarks and leaderboards for materials discovery.

Entalpic @entalpic.bsky.social · Feb 6

🚀 𝐋𝐞𝐌𝐚𝐭𝐞𝐫𝐢𝐚𝐥 𝐜𝐨𝐦𝐦𝐮𝐧𝐢𝐭𝐲 𝐦𝐞𝐞𝐭𝐢𝐧𝐠𝐬 𝐚𝐫𝐞 𝐥𝐢𝐯𝐞!⁣⁣
𝖩𝗈𝗂𝗇 𝗎𝗌 𝐞𝐯𝐞𝐫𝐲 𝐬𝐞𝐜𝐨𝐧𝐝 𝐓𝐡𝐮𝐫𝐬𝐝𝐚𝐲 𝐨𝐟 𝐭𝐡𝐞 𝐦𝐨𝐧𝐭𝐡 𝗍𝗈 𝖾𝗑𝗉𝗅𝗈𝗋𝖾 𝖠𝖨-𝖽𝗋𝗂𝗏𝖾𝗇 𝗆𝖺𝗍𝖾𝗋𝗂𝖺𝗅𝗌 𝖽𝗂𝗌𝖼𝗈𝗏𝖾𝗋𝗒.⁣⁣
📅 𝐅𝐞𝐛 𝟏𝟑 | 𝟔𝐏𝐌 𝐏𝐚𝐫𝐢𝐬 | 𝟗𝐀𝐌 𝐋𝐀⁣⁣
📍 𝐉𝐨𝐢𝐧 𝗁𝗍𝗍𝗉𝗌://𝗆𝖾𝖾𝗍.𝗀𝗈𝗈𝗀𝗅𝖾.𝖼𝗈𝗆/𝗆𝗐𝗒-𝗎𝗒𝖽𝖽-𝗄𝗏𝖿⁣
𝖣𝗂𝗏𝖾 𝗂𝗇𝗍𝗈 𝖫𝖾𝖬𝖺𝗍𝖾𝗋𝗂𝖺𝗅 & 𝗌𝗁𝖺𝗉𝖾 𝗍𝗁𝖾 𝖿𝗎𝗍𝗎𝗋𝖾!⁣⁣
👉 𝐂𝐡𝐞𝐜𝐤 𝐭𝐡𝐞 𝐜𝐨𝐦𝐦𝐞𝐧𝐭𝐬 𝐭𝐨 𝐣𝐨𝐢𝐧 𝐭𝐡𝐞 𝐋𝐞𝐌𝐚𝐭𝐞𝐫𝐢𝐚𝐥 𝐒𝐥𝐚𝐜𝐤

February 13, 2025 at 10:17 AM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

Superb work!

Alexander Doria @dorialexander.bsky.social · Feb 11

Announcing the release of Common Corpus 2. The largest fully open corpus for pretraining comes back better than ever: 2 trillion tokens with document-level licensing, provenance and language information. huggingface.co/datasets/Ple...

PleIAs/common_corpus · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

February 12, 2025 at 11:50 AM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

The most durable motivation for research is curiosity, the desire to answer a question or understand something.

Curiosity then leads you down a maze of existing answers and new questions.

Eventually, you get to one that has no answer and then you start pushing at the frontier.

February 12, 2025 at 10:48 AM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

Interesting - 57.1% AIME24 and 94.8% MATH performance achieved using only 817 reasoning chains and STF.

Adds more weight to the hypothesis that correct reasoning chains and SFT can lead to strong reasoning performance.

GitHub - GAIR-NLP/LIMO: LIMO: Less is More for Reasoning

LIMO: Less is More for Reasoning. Contribute to GAIR-NLP/LIMO development by creating an account on GitHub.

github.com

February 9, 2025 at 7:15 PM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

I've been reflecting today about OpenAI's five levels to measure progress in AI.

GPT-4 was at Level 1, conversational AI: a model competent at 0.1-1s tasks, like holding a conversation.

O1 / R1 reached Level 2, reasoners: a model solving 1-10min tasks such as basic coding tasks and math.

February 9, 2025 at 11:52 AM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

Agreed, we have the minimum viable scene.

We now just need to amplify each other and keep going.

Marc Lanctot @sharky6000.bsky.social · Feb 6

I have been in your place several times since 2022 waiting patiently.

What I can say is this recent wave (from Sept 2024 to now) has been absolutely the most successful.

I think we have the nucleus. Now, we persist. Don't give up and keep contributing.

We have to play the long game.

February 6, 2025 at 2:33 PM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

What if inference scaling is as simple as

response.replace("</think>", "Wait")

Tim Kellogg @timkellogg.me · Feb 3

s1: Simple inference-time scaling

This is a simple small-scale replication of inference-time scaling

It was cheap: 16xH100 for 26 minutes (so what, ~$6?)

It replicates inference-time scaling using SFT only (no RL)

Extremely data frugal: 1000 samples

arxiv.org/abs/2501.19393

A set of three scatter plots showing the relationship between **average thinking time (tokens)** on the x-axis and **accuracy (%)** on the y-axis for three different reasoning-intensive tasks: **Mathematical Problem Solving (MATH500), Competition Math (AIME24), and PhD-Level Science Questions (GPQA Diamond).**

Each scatter plot contains blue data points indicating the performance of the **s1-32B** model under different test-time compute conditions.

- **First plot (Mathematical Problem Solving - MATH500):**
- The accuracy starts around **65%** and increases as thinking time increases from **512 tokens to 2048 tokens.**
- The final accuracy approaches **95%.**

- **Second plot (Competition Math - AIME24):**
- The accuracy starts at nearly **0%** for the lowest thinking time **(512 tokens)** and gradually improves as thinking time increases.
- At **8192 tokens**, accuracy reaches approximately **40%.**

- **Third plot (PhD-Level Science Questions - GPQA Diamond):**
- The accuracy starts around **40%** for **512 tokens** and increases steadily.
- At **4096 tokens**, accuracy exceeds **60%.**

Below the figure, a caption reads:
**"Figure 1. Test-time scaling with s1-32B. We benchmark s1-32B on reasoning-intensive tasks and vary test-time compute."**

February 5, 2025 at 5:24 PM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

SWE arena is going to be an interesting leaderboard to watch.

It allows people to compare the code generated by LMs based on runs inside a sandbox.

SWE Arena: Compare & Test Best AI Chatbots for Code

swe-arena.com

February 5, 2025 at 4:54 PM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

every time i try uv, I'm more impressed.

seems now like a tool that Just Works, reducing the complexity of the python ecosystem

installed a cuda+torch+git packages and it all felt basically instant

February 4, 2025 at 11:52 AM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

DeepSeek-R1 has turned into such a Rorschach test for the collective psyche

January 28, 2025 at 6:06 PM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

Indeed, not outsourcing reasoning is an important value to ... well... reason about.

How would we achieve this?

It may require many individuals and groups to do RL on their own models, using their own verifiers.

This may look like grading exams - not of students, but of ML models.

Ted Underwood @tedunderwood.com · Jan 26

I definitely start from default pessimism on this.

But just to look at the other pan of the scales: we could plausibly justify outsourcing CMS and email. But if we fully outsource reasoning ... that's it, game over, everyone can go home.

So it *should* be easier to get faculty to care about this.

January 26, 2025 at 8:38 PM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

Seeing A Film for the Future in 360 was a special experience.

One of the most powerful parts was We Pray.

The video and music match so well, hit hard, and resonate strongly with the times.

Coldplay - WE PRAY (A Film For The Future)

YouTube video by Coldplay

youtu.be

January 26, 2025 at 6:18 PM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

Turning the temperature up using R1

Starting to think

gibberish gibberish gibberish

Focus again. Calm up.

🤣

January 25, 2025 at 6:44 PM

Flaviu Cipcigan

@flaviucipcigan.bsky.social

Hm, using reasoning models really feels qualitatively different (using @openrouter.bsky.social for inference).

It's fun to see these aha moments and it'd be interesting to understand whether their presence helps.