— ERNIE 4.5: beats GPT 4.5 for 1% of price
— Reasoning model X1: beats DeepSeek R1 for 50% of price.
China continues to build intelligence too cheap to meter. The AI price war is on.
— ERNIE 4.5: beats GPT 4.5 for 1% of price
— Reasoning model X1: beats DeepSeek R1 for 50% of price.
China continues to build intelligence too cheap to meter. The AI price war is on.
Google Gemini really cooked with this one.
This is next gen photo editing.
Google Gemini really cooked with this one.
This is next gen photo editing.
"Make the steak vegetarian"
"Make the bridge go away"
"Make the keyboard more colorful"
And my favorite
"Give the OpenAI logo more personality"
"Make the steak vegetarian"
"Make the bridge go away"
"Make the keyboard more colorful"
And my favorite
"Give the OpenAI logo more personality"
The Nature published that reasoning LLMs found errors in 1% of the 10,000 research papers it analyzed with 35% false positive rate for $0.15-1/paper.
Anthropic founder’s view of “a country of geniuses in a data center” is happening.
The Nature published that reasoning LLMs found errors in 1% of the 10,000 research papers it analyzed with 35% false positive rate for $0.15-1/paper.
Anthropic founder’s view of “a country of geniuses in a data center” is happening.
LADDER:
— Generate variants of problem
— Solve, verify, use GRPO (DeepSeek) to learn
TTRL:
— Do 1&2 when you see a new problem
New form of test time compute scaling!
LADDER:
— Generate variants of problem
— Solve, verify, use GRPO (DeepSeek) to learn
TTRL:
— Do 1&2 when you see a new problem
New form of test time compute scaling!
SortBenchmark, in distributed systems, measures this.
— How fast? 134s
— How cheap? $97
— How many in 1 minute? 370B numbers
— How much energy? ~59kJ or walking for 15mins
Every software engineer should know this.
SortBenchmark, in distributed systems, measures this.
— How fast? 134s
— How cheap? $97
— How many in 1 minute? 370B numbers
— How much energy? ~59kJ or walking for 15mins
Every software engineer should know this.
Revenue (/day): $562k
Cost (/day): $87k
Revenue (/yr): ~$205M
This is all while charging $2.19/M tokens on R1, ~25x less than OpenAI o1.
If this was in the US, this would be a >$10B company.
Revenue (/day): $562k
Cost (/day): $87k
Revenue (/yr): ~$205M
This is all while charging $2.19/M tokens on R1, ~25x less than OpenAI o1.
If this was in the US, this would be a >$10B company.
Fork a repo.
Select a folder.
Ask it anything.
It even shows you what %age of the context window each folder takes.
Here it visualizes yt-dlp's (Youtube downloader) flow:
Fork a repo.
Select a folder.
Ask it anything.
It even shows you what %age of the context window each folder takes.
Here it visualizes yt-dlp's (Youtube downloader) flow:
The winner was OpenAI.
It had the most detailed, high-quality and accurate answer, but you do pay $200/mo for it.
The winner was OpenAI.
It had the most detailed, high-quality and accurate answer, but you do pay $200/mo for it.
Excellence is boring. It's making the same boring "correct" choice over and over again. You win by being consistent for longer.
Our short attention spans tend to forget that.
Excellence is boring. It's making the same boring "correct" choice over and over again. You win by being consistent for longer.
Our short attention spans tend to forget that.
The model was NOT contaminated with this data and the 50 submission limit was used.
We will likely see superhuman coding models this year.
The model was NOT contaminated with this data and the 50 submission limit was used.
We will likely see superhuman coding models this year.
I'm surprised more people don't know about it. Benjamin Bycroft made this beautiful interactive visualization to show exactly how the inner workings of each of the weights of an LLM work.
Here's a link:
I'm surprised more people don't know about it. Benjamin Bycroft made this beautiful interactive visualization to show exactly how the inner workings of each of the weights of an LLM work.
Here's a link:
Perfect needle-in-the-haystack scores are easy—attention mechanisms can match the word. When you require 1-hop of reasoning, performance degrades quickly.
This is why guaranteeing correctness for agents is hard.
Perfect needle-in-the-haystack scores are easy—attention mechanisms can match the word. When you require 1-hop of reasoning, performance degrades quickly.
This is why guaranteeing correctness for agents is hard.
Gemini 2 Flash's $0.40/M tokens and 1M token context means you can now parse 6000 long PDFs at near perfect quality for $1
Gemini 2 Flash's $0.40/M tokens and 1M token context means you can now parse 6000 long PDFs at near perfect quality for $1
Deep research generates ~10 page reports in ~15mins by scouring 100s of websites. This could replace a lot of human work. I tried both so you don't have to.
The verdict: OpenAI is faster and better quality despite being more $$
Deep research generates ~10 page reports in ~15mins by scouring 100s of websites. This could replace a lot of human work. I tried both so you don't have to.
The verdict: OpenAI is faster and better quality despite being more $$
Per million input (cached), input and output tokens:
Gemini 2 Flash Lite: $0.01875, $0.075, $0.30
Gemini 2 Flash: $0.025, $0.1, $0.40
GPT 4o-mini: $0.075, $0.15, $0.60
Per million input (cached), input and output tokens:
Gemini 2 Flash Lite: $0.01875, $0.075, $0.30
Gemini 2 Flash: $0.025, $0.1, $0.40
GPT 4o-mini: $0.075, $0.15, $0.60
"write a table of the 25 most important chip architectures over time, and come up with 12 columns to compare them on"
"write a table of the 25 most important chip architectures over time, and come up with 12 columns to compare them on"
– Readers: free 200+ page book covering pre-training, generative models, prompting and alignment
– Programmers: Karpathy’s neural networks zero to hero playlist including implementing GPT-2 from scratch
– Readers: free 200+ page book covering pre-training, generative models, prompting and alignment
– Programmers: Karpathy’s neural networks zero to hero playlist including implementing GPT-2 from scratch
If India wants to build its foundational model, it should digitize all its records, hope its 1T+ tokens, keep it on lock to its own models and then have the best models in Indic languages.
If India wants to build its foundational model, it should digitize all its records, hope its 1T+ tokens, keep it on lock to its own models and then have the best models in Indic languages.
Misses some of the awesome analysis in the system card, but pretty nicely covers where we are.
Cheaper, better models.
Misses some of the awesome analysis in the system card, but pretty nicely covers where we are.
Cheaper, better models.
— one of the most high quality non-reasoning LLMs
— super fast (150tok/s+)
— 1M tok context window.
API price isnt out but was previously $0.075/$0.30 per M input/output tokens. Big move from Google.
— one of the most high quality non-reasoning LLMs
— super fast (150tok/s+)
— 1M tok context window.
API price isnt out but was previously $0.075/$0.30 per M input/output tokens. Big move from Google.