Every discovery project had a beautiful aha moment, such as the structure of antibiotics emerging in the latent space of a model or a GFlowNet proposing new carbon capture materials.
Here's some of the threads I've wrote on this topic.
Goals are mapped to programs which are embedded in a latent space.
A fitness metric is assigned to the programs and program search is done to synthesise new human-like goals.
Goals are mapped to programs which are embedded in a latent space.
A fitness metric is assigned to the programs and program search is done to synthesise new human-like goals.
Every discovery project had a beautiful aha moment, such as the structure of antibiotics emerging in the latent space of a model or a GFlowNet proposing new carbon capture materials.
Here's some of the threads I've wrote on this topic.
Every discovery project had a beautiful aha moment, such as the structure of antibiotics emerging in the latent space of a model or a GFlowNet proposing new carbon capture materials.
Here's some of the threads I've wrote on this topic.
YAML parsing in python is weird.
YAML parsing in python is weird.
I think focus on raw number of parameters is a less useful frame than thinking about inference speed, cost and location of inference (on-device vs cloud).
I think focus on raw number of parameters is a less useful frame than thinking about inference speed, cost and location of inference (on-device vs cloud).
It's great to see the energy of the community that got unleashed after open models that generate chains of thought!
github.com/open-thought...
It's great to see the energy of the community that got unleashed after open models that generate chains of thought!
Great work by @stemartiniani.bsky.social and team to curate the most diverse materials database in the world!
Great work by @stemartiniani.bsky.social and team to curate the most diverse materials database in the world!
Models generalise to slightly harder versions of a problem, and the correct answers are used to bootstrap the next model and the next one and so on.
Models generalise to slightly harder versions of a problem, and the correct answers are used to bootstrap the next model and the next one and so on.
𝖩𝗈𝗂𝗇 𝗎𝗌 𝐞𝐯𝐞𝐫𝐲 𝐬𝐞𝐜𝐨𝐧𝐝 𝐓𝐡𝐮𝐫𝐬𝐝𝐚𝐲 𝐨𝐟 𝐭𝐡𝐞 𝐦𝐨𝐧𝐭𝐡 𝗍𝗈 𝖾𝗑𝗉𝗅𝗈𝗋𝖾 𝖠𝖨-𝖽𝗋𝗂𝗏𝖾𝗇 𝗆𝖺𝗍𝖾𝗋𝗂𝖺𝗅𝗌 𝖽𝗂𝗌𝖼𝗈𝗏𝖾𝗋𝗒.
📅 𝐅𝐞𝐛 𝟏𝟑 | 𝟔𝐏𝐌 𝐏𝐚𝐫𝐢𝐬 | 𝟗𝐀𝐌 𝐋𝐀
📍 𝐉𝐨𝐢𝐧 𝗁𝗍𝗍𝗉𝗌://𝗆𝖾𝖾𝗍.𝗀𝗈𝗈𝗀𝗅𝖾.𝖼𝗈𝗆/𝗆𝗐𝗒-𝗎𝗒𝖽𝖽-𝗄𝗏𝖿
𝖣𝗂𝗏𝖾 𝗂𝗇𝗍𝗈 𝖫𝖾𝖬𝖺𝗍𝖾𝗋𝗂𝖺𝗅 & 𝗌𝗁𝖺𝗉𝖾 𝗍𝗁𝖾 𝖿𝗎𝗍𝗎𝗋𝖾!
👉 𝐂𝐡𝐞𝐜𝐤 𝐭𝐡𝐞 𝐜𝐨𝐦𝐦𝐞𝐧𝐭𝐬 𝐭𝐨 𝐣𝐨𝐢𝐧 𝐭𝐡𝐞 𝐋𝐞𝐌𝐚𝐭𝐞𝐫𝐢𝐚𝐥 𝐒𝐥𝐚𝐜𝐤
Curiosity then leads you down a maze of existing answers and new questions.
Eventually, you get to one that has no answer and then you start pushing at the frontier.
Curiosity then leads you down a maze of existing answers and new questions.
Eventually, you get to one that has no answer and then you start pushing at the frontier.
Adds more weight to the hypothesis that correct reasoning chains and SFT can lead to strong reasoning performance.
Adds more weight to the hypothesis that correct reasoning chains and SFT can lead to strong reasoning performance.
GPT-4 was at Level 1, conversational AI: a model competent at 0.1-1s tasks, like holding a conversation.
O1 / R1 reached Level 2, reasoners: a model solving 1-10min tasks such as basic coding tasks and math.
GPT-4 was at Level 1, conversational AI: a model competent at 0.1-1s tasks, like holding a conversation.
O1 / R1 reached Level 2, reasoners: a model solving 1-10min tasks such as basic coding tasks and math.
We now just need to amplify each other and keep going.
What I can say is this recent wave (from Sept 2024 to now) has been absolutely the most successful.
I think we have the nucleus. Now, we persist. Don't give up and keep contributing.
We have to play the long game.
We now just need to amplify each other and keep going.
response.replace("</think>", "Wait")
This is a simple small-scale replication of inference-time scaling
It was cheap: 16xH100 for 26 minutes (so what, ~$6?)
It replicates inference-time scaling using SFT only (no RL)
Extremely data frugal: 1000 samples
arxiv.org/abs/2501.19393
response.replace("</think>", "Wait")
It allows people to compare the code generated by LMs based on runs inside a sandbox.
It allows people to compare the code generated by LMs based on runs inside a sandbox.
seems now like a tool that Just Works, reducing the complexity of the python ecosystem
installed a cuda+torch+git packages and it all felt basically instant
seems now like a tool that Just Works, reducing the complexity of the python ecosystem
installed a cuda+torch+git packages and it all felt basically instant
How would we achieve this?
It may require many individuals and groups to do RL on their own models, using their own verifiers.
This may look like grading exams - not of students, but of ML models.
But just to look at the other pan of the scales: we could plausibly justify outsourcing CMS and email. But if we fully outsource reasoning ... that's it, game over, everyone can go home.
So it *should* be easier to get faculty to care about this.
How would we achieve this?
It may require many individuals and groups to do RL on their own models, using their own verifiers.
This may look like grading exams - not of students, but of ML models.
One of the most powerful parts was We Pray.
The video and music match so well, hit hard, and resonate strongly with the times.
One of the most powerful parts was We Pray.
The video and music match so well, hit hard, and resonate strongly with the times.
Starting to think
gibberish gibberish gibberish
Focus again. Calm up.
🤣
Starting to think
gibberish gibberish gibberish
Focus again. Calm up.
🤣
It's fun to see these aha moments and it'd be interesting to understand whether their presence helps.
It's fun to see these aha moments and it'd be interesting to understand whether their presence helps.
Could not reproduce with the API tho.
Could not reproduce with the API tho.
Nice to see the technical details and MIT license for something that looks at o1 level 🥳
Nice to see the technical details and MIT license for something that looks at o1 level 🥳