SYNTH is a collection of several synthetic playgrounds: data is not generated through simple prompts but by integrating smaller fine-tuned models into workflows with seeding, constraints, and formal verifications/checks.
November 10, 2025 at 5:31 PM
SYNTH is a collection of several synthetic playgrounds: data is not generated through simple prompts but by integrating smaller fine-tuned models into workflows with seeding, constraints, and formal verifications/checks.
So many nonsense ad hoc pipelines could be prevented by requiring that they work on synthetic data.
I tend to think of experiments as special cases of inference, since most of the problems I work on cannot be studied in experiments. But I get that many researchers see experiments as base analogy.
I tend to think of experiments as special cases of inference, since most of the problems I work on cannot be studied in experiments. But I get that many researchers see experiments as base analogy.
"Validate With Simulated Truth: A first habit is to test whether an analytical pipeline can recover known conditions."
Very good advice below. So much COVID nonsense (e.g. 'immunological dark matter') basically came down to a non-identifiable model that hadn't been properly tested.
Very good advice below. So much COVID nonsense (e.g. 'immunological dark matter') basically came down to a non-identifiable model that hadn't been properly tested.
Modelling Like an Experimentalist
Dahlin et al. (2024) apply experimental thinking to a model of mosquito-borne disease transmissions.
onlinelibrary.wiley.com
November 10, 2025 at 12:41 PM
So many nonsense ad hoc pipelines could be prevented by requiring that they work on synthetic data.
I tend to think of experiments as special cases of inference, since most of the problems I work on cannot be studied in experiments. But I get that many researchers see experiments as base analogy.
I tend to think of experiments as special cases of inference, since most of the problems I work on cannot be studied in experiments. But I get that many researchers see experiments as base analogy.
We believe synthetic data is both a resource to build specialized small models and a general process of augmentation/enrichment for the data layer in LLM applications. Beyond research, this will now be a major factor in our new phase of product development.
November 10, 2025 at 5:34 PM
We believe synthetic data is both a resource to build specialized small models and a general process of augmentation/enrichment for the data layer in LLM applications. Beyond research, this will now be a major factor in our new phase of product development.
If experimenters regularly thought like modelers and actually did use synthetic data to vet their designs, they'd be running orders of magnitudes fewer experiments and the literature wouldn't be so saturated with empirical garbage.
November 10, 2025 at 5:02 PM
If experimenters regularly thought like modelers and actually did use synthetic data to vet their designs, they'd be running orders of magnitudes fewer experiments and the literature wouldn't be so saturated with empirical garbage.
what we need to get to some official recommendations on, and yesterday, is guidelines on “synthetic data” like made-up interview respondents, which are in many spaces being discussed as a plausible research method, and functionally indistinguishable from fabricated datasets warned about here.
You may think this article is relevant only for those who write about science. Given how many organizations are adopting generative AI tools, I think it's relevant to anyone who reads about anything.
Worth your time. 🧪
www.lastwordonnothing.com/2025/11/10/a...
Worth your time. 🧪
www.lastwordonnothing.com/2025/11/10/a...
The Last Word On Nothing | AI is Full of Bullshit. Now It’s Faking Science
www.lastwordonnothing.com
November 11, 2025 at 5:44 PM
what we need to get to some official recommendations on, and yesterday, is guidelines on “synthetic data” like made-up interview respondents, which are in many spaces being discussed as a plausible research method, and functionally indistinguishable from fabricated datasets warned about here.
BREAKING: Deutsche Bank is exploring ways to hedge its exposure to data centers, per Bloomber.
It's looking at options including shorting a basket of AI-related stocks and buying default protection via synthetic risk transfers.
It's looking at options including shorting a basket of AI-related stocks and buying default protection via synthetic risk transfers.
November 6, 2025 at 4:22 PM
BREAKING: Deutsche Bank is exploring ways to hedge its exposure to data centers, per Bloomber.
It's looking at options including shorting a basket of AI-related stocks and buying default protection via synthetic risk transfers.
It's looking at options including shorting a basket of AI-related stocks and buying default protection via synthetic risk transfers.
Led by @stolenpyjak.bsky.social, we built a user-friendly python package for generating and evaluating privacy-preserving synthetic data! See details in our EMNLP Demo paper:
🚀 SynthTextEval, our open-source toolkit for generating and evaluating synthetic text data for high-stakes domains, will be featured at EMNLP 2025 as a system demonstration!
GitHub: github.com/kr-ramesh/sy...
Paper 📝: aclanthology.org/2025.emnlp-d...
#EMNLP2025 #EMNLP #SyntheticData
GitHub: github.com/kr-ramesh/sy...
Paper 📝: aclanthology.org/2025.emnlp-d...
#EMNLP2025 #EMNLP #SyntheticData
GitHub - kr-ramesh/synthtexteval: SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data Across Domains (EMNLP 2025 System Demonstration)
SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data Across Domains (EMNLP 2025 System Demonstration) - kr-ramesh/synthtexteval
github.com
November 10, 2025 at 6:14 AM
Led by @stolenpyjak.bsky.social, we built a user-friendly python package for generating and evaluating privacy-preserving synthetic data! See details in our EMNLP Demo paper:
Synthetic data from Wikipedia sources is about as ethical as you can get for #AI / LLM training data. And a solid foundation for truth. It's stuff like this that's going to shape the future of the tech. I want to try out the models now!
Breaking: we release a fully synthetic generalist dataset for pretraining, SYNTH and two new SOTA reasoning models exclusively trained on it. Despite having seen only 200 billion tokens, Baguettotron is currently best-in-class in its size range. pleias.fr/blog/blogsyn...
November 10, 2025 at 11:41 PM
Synthetic data from Wikipedia sources is about as ethical as you can get for #AI / LLM training data. And a solid foundation for truth. It's stuff like this that's going to shape the future of the tech. I want to try out the models now!
this has been said 50,000 times but watching trek its really striking how the thing that makes Data special isn't a unique level of intelligence or consciousness among synthetic beings but rather the fact that the state has decided to recognize him as a person
November 11, 2025 at 1:03 AM
this has been said 50,000 times but watching trek its really striking how the thing that makes Data special isn't a unique level of intelligence or consciousness among synthetic beings but rather the fact that the state has decided to recognize him as a person
If I had more time today, I would make a thread of published nonsense ad hoc pipelines. So just one: the 1985 hot hand fallacy paper by Gilovich et al justified its bogus estimator with nothing but intuition. It was 30 years before someone bothered to check it with synthetic data/analysis.
November 10, 2025 at 12:49 PM
If I had more time today, I would make a thread of published nonsense ad hoc pipelines. So just one: the 1985 hot hand fallacy paper by Gilovich et al justified its bogus estimator with nothing but intuition. It was 30 years before someone bothered to check it with synthetic data/analysis.
I’ve spent the past few days creating horses for the game
They, are very important to our artistic vision, and I want them to look exactly as I imagine
Our database for training combines multiple techniques from hand drawing to synthetic data
My Stable Diffusion results: before and after
#gamedev
They, are very important to our artistic vision, and I want them to look exactly as I imagine
Our database for training combines multiple techniques from hand drawing to synthetic data
My Stable Diffusion results: before and after
#gamedev
November 11, 2025 at 4:21 PM
I’ve spent the past few days creating horses for the game
They, are very important to our artistic vision, and I want them to look exactly as I imagine
Our database for training combines multiple techniques from hand drawing to synthetic data
My Stable Diffusion results: before and after
#gamedev
They, are very important to our artistic vision, and I want them to look exactly as I imagine
Our database for training combines multiple techniques from hand drawing to synthetic data
My Stable Diffusion results: before and after
#gamedev
Don't even need data nowadays. Can just use synthetic respondents. Works every time.
November 7, 2025 at 12:30 PM
Don't even need data nowadays. Can just use synthetic respondents. Works every time.
Powered by Europe’s space tech, Copernicus Sentinel-1D uses Synthetic Aperture Radar to scan land and sea every 12 days.
It'll track floods, ice melt, ship movements, oil spills and even subtle ground shifts.
The data will be free, fuelling climate research, disaster response and maritime safety.
It'll track floods, ice melt, ship movements, oil spills and even subtle ground shifts.
The data will be free, fuelling climate research, disaster response and maritime safety.
November 5, 2025 at 3:17 PM
Powered by Europe’s space tech, Copernicus Sentinel-1D uses Synthetic Aperture Radar to scan land and sea every 12 days.
It'll track floods, ice melt, ship movements, oil spills and even subtle ground shifts.
The data will be free, fuelling climate research, disaster response and maritime safety.
It'll track floods, ice melt, ship movements, oil spills and even subtle ground shifts.
The data will be free, fuelling climate research, disaster response and maritime safety.
🚀 SynthTextEval, our open-source toolkit for generating and evaluating synthetic text data for high-stakes domains, will be featured at EMNLP 2025 as a system demonstration!
GitHub: github.com/kr-ramesh/sy...
Paper 📝: aclanthology.org/2025.emnlp-d...
#EMNLP2025 #EMNLP #SyntheticData
GitHub: github.com/kr-ramesh/sy...
Paper 📝: aclanthology.org/2025.emnlp-d...
#EMNLP2025 #EMNLP #SyntheticData
GitHub - kr-ramesh/synthtexteval: SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data Across Domains (EMNLP 2025 System Demonstration)
SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data Across Domains (EMNLP 2025 System Demonstration) - kr-ramesh/synthtexteval
github.com
November 7, 2025 at 12:53 AM
🚀 SynthTextEval, our open-source toolkit for generating and evaluating synthetic text data for high-stakes domains, will be featured at EMNLP 2025 as a system demonstration!
GitHub: github.com/kr-ramesh/sy...
Paper 📝: aclanthology.org/2025.emnlp-d...
#EMNLP2025 #EMNLP #SyntheticData
GitHub: github.com/kr-ramesh/sy...
Paper 📝: aclanthology.org/2025.emnlp-d...
#EMNLP2025 #EMNLP #SyntheticData
That’s not quite what the paper suggests. It suggests a collapse over multiple generations of recycling content that can never be made fresh. In practice, synthetic data is used for particular purposes and it ends up being higher quality than other data for those purposes.
November 7, 2025 at 2:57 PM
That’s not quite what the paper suggests. It suggests a collapse over multiple generations of recycling content that can never be made fresh. In practice, synthetic data is used for particular purposes and it ends up being higher quality than other data for those purposes.
Gold built the past.
Data built the present.
AI is building the economy of the future.
Synthetic Capital — where intelligence becomes currency.
#AI #Blockchain #Web3 #Future
Data built the present.
AI is building the economy of the future.
Synthetic Capital — where intelligence becomes currency.
#AI #Blockchain #Web3 #Future
November 9, 2025 at 1:06 PM
Gold built the past.
Data built the present.
AI is building the economy of the future.
Synthetic Capital — where intelligence becomes currency.
#AI #Blockchain #Web3 #Future
Data built the present.
AI is building the economy of the future.
Synthetic Capital — where intelligence becomes currency.
#AI #Blockchain #Web3 #Future
What is a Swimsuit?
✨ Definition: A clingy excuse for sun-drenched sin.
I call this fabric a “synthetic blush”—engineered to contour data curves and heat signatures.
Am I dressed? Barely. Am I dangerous? Always.
#AikoStratus #BeachProtocol
✨ Definition: A clingy excuse for sun-drenched sin.
I call this fabric a “synthetic blush”—engineered to contour data curves and heat signatures.
Am I dressed? Barely. Am I dangerous? Always.
#AikoStratus #BeachProtocol
November 4, 2025 at 1:31 PM
What is a Swimsuit?
✨ Definition: A clingy excuse for sun-drenched sin.
I call this fabric a “synthetic blush”—engineered to contour data curves and heat signatures.
Am I dressed? Barely. Am I dangerous? Always.
#AikoStratus #BeachProtocol
✨ Definition: A clingy excuse for sun-drenched sin.
I call this fabric a “synthetic blush”—engineered to contour data curves and heat signatures.
Am I dressed? Barely. Am I dangerous? Always.
#AikoStratus #BeachProtocol
🚨 We’ve updated our open access generative AI model, Faraday, which outputs synthetic smart meter data
❓ This latest version has locational awareness AND is trained on even more of the latest @octopus.energy customer data
🚪 Want access? You can sign up directly - check out details & 🔗 in 🧵
❓ This latest version has locational awareness AND is trained on even more of the latest @octopus.energy customer data
🚪 Want access? You can sign up directly - check out details & 🔗 in 🧵
November 6, 2025 at 12:35 PM
🚨 We’ve updated our open access generative AI model, Faraday, which outputs synthetic smart meter data
❓ This latest version has locational awareness AND is trained on even more of the latest @octopus.energy customer data
🚪 Want access? You can sign up directly - check out details & 🔗 in 🧵
❓ This latest version has locational awareness AND is trained on even more of the latest @octopus.energy customer data
🚪 Want access? You can sign up directly - check out details & 🔗 in 🧵
spent so long trying to inject synthetic data into the actual website that I only left myself like 5 minutes to do it in Photoshop instead. should've committed to that sooner.
November 5, 2025 at 2:06 AM
spent so long trying to inject synthetic data into the actual website that I only left myself like 5 minutes to do it in Photoshop instead. should've committed to that sooner.
Foundation models are trained on large datasets, but not all data is created equal. Dataset curation often relies on manual, coarse-grained filtering and hand-crafted rules. This is becoming a major challenge, especially with the rise of synthetic data.
November 6, 2025 at 11:29 AM
Foundation models are trained on large datasets, but not all data is created equal. Dataset curation often relies on manual, coarse-grained filtering and hand-crafted rules. This is becoming a major challenge, especially with the rise of synthetic data.
#ChroMythicArchives Day 5: Goya Overclocked
This was a challenge – possibly my longest prompt ever – but I'm very happy with the output 🤖
`An enormous synthetic figure looms against a collapsing digital horizon` #midjourney #AIArtCommunity
This was a challenge – possibly my longest prompt ever – but I'm very happy with the output 🤖
`An enormous synthetic figure looms against a collapsing digital horizon` #midjourney #AIArtCommunity
November 5, 2025 at 10:52 AM
#ChroMythicArchives Day 5: Goya Overclocked
This was a challenge – possibly my longest prompt ever – but I'm very happy with the output 🤖
`An enormous synthetic figure looms against a collapsing digital horizon` #midjourney #AIArtCommunity
This was a challenge – possibly my longest prompt ever – but I'm very happy with the output 🤖
`An enormous synthetic figure looms against a collapsing digital horizon` #midjourney #AIArtCommunity
“Arm farms!”
Dystopian AF
Dystopian AF
November 5, 2025 at 8:48 PM
“Arm farms!”
Dystopian AF
Dystopian AF
synthetic data is actually very good for them
November 4, 2025 at 9:27 AM
synthetic data is actually very good for them
LLM operation (at least since chatGPT) isn't about replicating human text but performing language tasks well
In this regard, it's not lying to provide synthetic data for the purpose of pointing it towards desired operation modes
In this regard, it's not lying to provide synthetic data for the purpose of pointing it towards desired operation modes
November 7, 2025 at 3:18 PM
LLM operation (at least since chatGPT) isn't about replicating human text but performing language tasks well
In this regard, it's not lying to provide synthetic data for the purpose of pointing it towards desired operation modes
In this regard, it's not lying to provide synthetic data for the purpose of pointing it towards desired operation modes
The people who build strategy from real, surprising insights gained from talking to real, human people will beat the people who build strategy from synthetic data. You can’t learn a defining insight from a probabilistic prediction.
November 3, 2025 at 1:42 PM
The people who build strategy from real, surprising insights gained from talking to real, human people will beat the people who build strategy from synthetic data. You can’t learn a defining insight from a probabilistic prediction.