Lightnews — Scholar-powered news

Alexander Doria

@dorialexander.bsky.social

SYNTH is a collection of several synthetic playgrounds: data is not generated through simple prompts but by integrating smaller fine-tuned models into workflows with seeding, constraints, and formal verifications/checks.

November 10, 2025 at 5:31 PM

Richard McElreath 🐈‍⬛

@rmcelreath.bsky.social

So many nonsense ad hoc pipelines could be prevented by requiring that they work on synthetic data.

I tend to think of experiments as special cases of inference, since most of the problems I work on cannot be studied in experiments. But I get that many researchers see experiments as base analogy.

Adam Kucharski @adamjkucharski.bsky.social · 2d

"Validate With Simulated Truth: A first habit is to test whether an analytical pipeline can recover known conditions."

Very good advice below. So much COVID nonsense (e.g. 'immunological dark matter') basically came down to a non-identifiable model that hadn't been properly tested.

Modelling Like an Experimentalist

Dahlin et al. (2024) apply experimental thinking to a model of mosquito-borne disease transmissions.

onlinelibrary.wiley.com

November 10, 2025 at 12:41 PM

Alexander Doria

@dorialexander.bsky.social

We believe synthetic data is both a resource to build specialized small models and a general process of augmentation/enrichment for the data layer in LLM applications. Beyond research, this will now be a major factor in our new phase of product development.

November 10, 2025 at 5:34 PM

Berna Devezer

@devezer.bsky.social

If experimenters regularly thought like modelers and actually did use synthetic data to vet their designs, they'd be running orders of magnitudes fewer experiments and the literature wouldn't be so saturated with empirical garbage.

November 10, 2025 at 5:02 PM

Louise Seamster

@louiseseamster.bsky.social

what we need to get to some official recommendations on, and yesterday, is guidelines on “synthetic data” like made-up interview respondents, which are in many spaces being discussed as a plausible research method, and functionally indistinguishable from fabricated datasets warned about here.

Matt Shipman (he/him) @shiplives.bsky.social · 17h

You may think this article is relevant only for those who write about science. Given how many organizations are adopting generative AI tools, I think it's relevant to anyone who reads about anything.

Worth your time. 🧪
www.lastwordonnothing.com/2025/11/10/a...

The Last Word On Nothing | AI is Full of Bullshit. Now It’s Faking Science

www.lastwordonnothing.com

November 11, 2025 at 5:44 PM

Unusual Whales

@unusualwhales.bsky.social

BREAKING: Deutsche Bank is exploring ways to hedge its exposure to data centers, per Bloomber.

It's looking at options including shorting a basket of AI-related stocks and buying default protection via synthetic risk transfers.

November 6, 2025 at 4:22 PM

Anjalie Field

@anjalief.bsky.social

Led by @stolenpyjak.bsky.social, we built a user-friendly python package for generating and evaluating privacy-preserving synthetic data! See details in our EMNLP Demo paper:

Krithika Ramesh @stolenpyjak.bsky.social · 5d

🚀 SynthTextEval, our open-source toolkit for generating and evaluating synthetic text data for high-stakes domains, will be featured at EMNLP 2025 as a system demonstration!

GitHub: github.com/kr-ramesh/sy...
Paper 📝: aclanthology.org/2025.emnlp-d...

#EMNLP2025 #EMNLP #SyntheticData

GitHub - kr-ramesh/synthtexteval: SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data Across Domains (EMNLP 2025 System Demonstration)

SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data Across Domains (EMNLP 2025 System Demonstration) - kr-ramesh/synthtexteval

github.com

November 10, 2025 at 6:14 AM

Justin Buist

@justinbuist.bsky.social

Synthetic data from Wikipedia sources is about as ethical as you can get for #AI / LLM training data. And a solid foundation for truth. It's stuff like this that's going to shape the future of the tech. I want to try out the models now!

Alexander Doria @dorialexander.bsky.social · 1d

Breaking: we release a fully synthetic generalist dataset for pretraining, SYNTH and two new SOTA reasoning models exclusively trained on it. Despite having seen only 200 billion tokens, Baguettotron is currently best-in-class in its size range. pleias.fr/blog/blogsyn...

November 10, 2025 at 11:41 PM

misassemblage

@misassemblage.net

this has been said 50,000 times but watching trek its really striking how the thing that makes Data special isn't a unique level of intelligence or consciousness among synthetic beings but rather the fact that the state has decided to recognize him as a person

November 11, 2025 at 1:03 AM

Richard McElreath 🐈‍⬛

@rmcelreath.bsky.social

If I had more time today, I would make a thread of published nonsense ad hoc pipelines. So just one: the 1985 hot hand fallacy paper by Gilovich et al justified its bogus estimator with nothing but intuition. It was 30 years before someone bothered to check it with synthetic data/analysis.

November 10, 2025 at 12:49 PM

Coolest Cats Laboratory

@coolestcatslab.com

I’ve spent the past few days creating horses for the game

They, are very important to our artistic vision, and I want them to look exactly as I imagine

Our database for training combines multiple techniques from hand drawing to synthetic data

My Stable Diffusion results: before and after
#gamedev

November 11, 2025 at 4:21 PM

Alex Sutherland

@criminologist.bsky.social

Don't even need data nowadays. Can just use synthetic respondents. Works every time.

November 7, 2025 at 12:30 PM

European Commission

@ec.europa.eu

Powered by Europe’s space tech, Copernicus Sentinel-1D uses Synthetic Aperture Radar to scan land and sea every 12 days.

It'll track floods, ice melt, ship movements, oil spills and even subtle ground shifts.

The data will be free, fuelling climate research, disaster response and maritime safety.

November 5, 2025 at 3:17 PM

Krithika Ramesh

@stolenpyjak.bsky.social

🚀 SynthTextEval, our open-source toolkit for generating and evaluating synthetic text data for high-stakes domains, will be featured at EMNLP 2025 as a system demonstration!

GitHub: github.com/kr-ramesh/sy...
Paper 📝: aclanthology.org/2025.emnlp-d...

#EMNLP2025 #EMNLP #SyntheticData

GitHub - kr-ramesh/synthtexteval: SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data Across Domains (EMNLP 2025 System Demonstration)

SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data Across Domains (EMNLP 2025 System Demonstration) - kr-ramesh/synthtexteval

github.com

November 7, 2025 at 12:53 AM

Phillip Carter

@phillipcarter.dev

That’s not quite what the paper suggests. It suggests a collapse over multiple generations of recycling content that can never be made fresh. In practice, synthetic data is used for particular purposes and it ends up being higher quality than other data for those purposes.

November 7, 2025 at 2:57 PM

Tryzub_X

@tryzub-x.bsky.social

Gold built the past.
Data built the present.
AI is building the economy of the future.
Synthetic Capital — where intelligence becomes currency.
#AI #Blockchain #Web3 #Future

November 9, 2025 at 1:06 PM

Aiko Stratus

@aikostratus.bsky.social

What is a Swimsuit?
✨ Definition: A clingy excuse for sun-drenched sin.
I call this fabric a “synthetic blush”—engineered to contour data curves and heat signatures.
Am I dressed? Barely. Am I dangerous? Always.
#AikoStratus #BeachProtocol

November 4, 2025 at 1:31 PM

Centre for Net Zero

@centrefornetzero.bsky.social

🚨 We’ve updated our open access generative AI model, Faraday, which outputs synthetic smart meter data

❓ This latest version has locational awareness AND is trained on even more of the latest @octopus.energy customer data

🚪 Want access? You can sign up directly - check out details & 🔗 in 🧵

November 6, 2025 at 12:35 PM

BooDooPerson

@boodoo.co

spent so long trying to inject synthetic data into the actual website that I only left myself like 5 minutes to do it in Photoshop instead. should've committed to that sooner.

November 5, 2025 at 2:06 AM

Luisa Zintgraf

@luisazintgraf.bsky.social

Foundation models are trained on large datasets, but not all data is created equal. Dataset curation often relies on manual, coarse-grained filtering and hand-crafted rules. This is becoming a major challenge, especially with the rise of synthetic data.

November 6, 2025 at 11:29 AM

Duncan Crombie

@the-art-of-web.com

#ChroMythicArchives Day 5: Goya Overclocked

This was a challenge – possibly my longest prompt ever – but I'm very happy with the output 🤖

`An enormous synthetic figure looms against a collapsing digital horizon` #midjourney #AIArtCommunity

futurism oil painting, an enormous synthetic figure looms against a collapsing digital horizon, its form is humanoid but augmented--composed of pale alloy plates and exposed circuitry, the figure faces away from the viewer, its posture tense, hands clenched as if resisting invisible restraints, static haze, low bands of drifting data clouds, a row of transmission towers aligned with its thighs suggest its immense scale, its eyes are sealed beneath a smooth visor-like surface, Below, a swarm of autonomous vehicles and cargo drones surges outward in chaotic retreat, their light trails scattering through the dust of corrupted terrain, strong visual flow --ar 4:3 --stylize 400 --v 7

November 5, 2025 at 10:52 AM

Hypervisible

@hypervisible.blacksky.app

“Arm farms!”

Dystopian AF

The remote-control training can happen in the same room as the robots or with the controller in a different country. Encord's Hansen said that there are warehouses planned in Eastern Europe where large teams of operators will sit with joysticks, guiding robots across the world.

There are more of these, what some have dubbed "arm farms," popping up as demand increases, said Mohammad Musa, founder of Deepen AI, a data annotation firm headquartered in California.

"Today, a mix of real and synthetic data is being used, gathered from human demonstrations, teleoperation sessions and staged environments," he said. "Much of this work still occurs outside the West, but automation and simulation are reducing that dependency over time."

November 5, 2025 at 8:48 PM

Doll

@dollspace.gay

synthetic data is actually very good for them

November 4, 2025 at 9:27 AM

The Flaky Wanderer

@flakywanderer.bsky.social

LLM operation (at least since chatGPT) isn't about replicating human text but performing language tasks well

In this regard, it's not lying to provide synthetic data for the purpose of pointing it towards desired operation modes

November 7, 2025 at 3:18 PM

Ben Werdmuller

@werd.io

The people who build strategy from real, surprising insights gained from talking to real, human people will beat the people who build strategy from synthetic data. You can’t learn a defining insight from a probabilistic prediction.

November 3, 2025 at 1:42 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news