burkov.bsky.social
@burkov.bsky.social
AI experts in 2025: 1) Asked an LLM to solve the business problem, 2) Said that it's an agent, so the solution should be OK, 3) Vibe coded it to production.
March 22, 2025 at 3:57 AM
Will the billions poured into LLM companies during the last two years allow them to unlock access to proprietary data at a scale that justifies the hype continuation? I bet not. And you?
December 8, 2024 at 6:38 AM
Limited access to data is what killed the big data/Hadoop hype and what kept machine/deep learning as a niche skill.
December 8, 2024 at 6:38 AM
This will require proprietary data, and this is the problem because we are back to traditional data science/AI, where models were empty shells looking for data, but the data was hard to find/adapt for ML.
December 8, 2024 at 6:38 AM
The next stage will be companies trying to specialize LLMs to do something they cannot do well enough from pretraining.
December 8, 2024 at 6:38 AM
If not, the word is split into individual characters, and those characters are merged by using the learned merge rules in the same order those rules were added to the merges collection during BPE training.

Don't trust online information. Trust the source code and good books.
December 1, 2024 at 1:48 AM
This is not how it works, and doing so would not result in correct tokenization. The real algorithm takes a word, checks if the word is also a token, and if it is, it returns the token.
December 1, 2024 at 1:48 AM
Once the BPE model is trained, most explain the process of tokenizing a new sequence as scanning it from left to right and looking for the longest token in the vocabulary that matches the upcoming characters.
December 1, 2024 at 1:48 AM