Ankit Maloo
Ankit Maloo
@ankit.bsky.social
AI/ML Research at clioapp.ai
What we need before that is benchmarks. New benchmarks that are not saturated, new ways to test a model to understand how good it is on certain tasks.
December 21, 2024 at 10:23 PM
Perhaps the progress to next stage would be slow. Large private companies like meta and Google hold a huge advantage here, but don’t discount the incentives, and power of deals. I think there is more data to be unlocked, and we will see better models soon.
December 21, 2024 at 10:23 PM
8/ Private data locked in silos in multiple private companies, like bcg, ge, etc. These are high value datasets and soon there would be a race to secure exclusive deals with these companies.
December 21, 2024 at 10:23 PM
7/ cctv cameras offer a new source of insights especially in the times of multimodal training data.
December 21, 2024 at 10:23 PM
5/ Print era newspapers are also in the same category. NYT and other long running media houses might have a huge resource on their hands.

6/ digitizing other work like research papers, journals, articles.
December 21, 2024 at 10:23 PM
4/ digitizing books from pre internet era. Many books are yet to be digitized. Google had digitized quite a few, archive had many, but they aren’t as easily accessible and it’s a huge copyright issue we don’t have any clarity on today.
December 21, 2024 at 10:23 PM
3/ another data source is videos. YouTube have so many videos that if transcribed well, could result in unlocking a truly useful source of new insights that could take models to the next level. Ofc, it’s behind copyrights, and there are privacy concerns that are to be solved for.
December 21, 2024 at 10:23 PM
2/ large swathes of data - many tweets, many conversations on discord, facebook groups, private forums, email threads, contain the kind of original knowledge that has not been put on public domain yet. Removing the private bits, using these could yield different results.
December 21, 2024 at 10:23 PM
1/n facebook trained Llama 3.1 on 15T tokens, comfortably double of any other model out there. There was a decent amount of synthetic data they used, especially for code. How? Given their size they had automated systems which could classify which generated code was high quality and they used that.
December 21, 2024 at 10:23 PM
Perhaps. I was thinking this one: arxiv.org/abs/2405.00451

The paper you cited is an older one with builds a base, but things have rapidly evolved.
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process inspired by the successful strategy employed by ...
arxiv.org
December 21, 2024 at 6:09 AM
So not very different from journalists?
December 3, 2024 at 4:10 PM
1. Do you train the model on combined dataset?
2. Do you use two separate models with a classifier? How do you generalize it?
3. Do you train one model and then finetune on the second one?

Would love to know how you think?
November 28, 2024 at 10:20 PM
Desired Outcome you want is:
- Generate stories when given a prompt about generating stories, using knowledge from Dataset A.
- Generate recipes when given a prompt about generating recipes, using knowledge from Dataset B.
- Produce coherent and relevant output as per the prompt.
November 28, 2024 at 10:20 PM
🙋‍♂️
May 8, 2023 at 1:29 PM
These discussion derailers and gotcha accounts are in full force, ready to misinterpret every word to get a sound byte they can report in some publication
April 29, 2023 at 6:51 AM
121
March 1, 2023 at 12:39 PM