Stefan Baack
banner
sbaack.com
Stefan Baack
@sbaack.com
Senior researcher studying data governance and AI training data. Mastodon: @tootbaack@infosec.exchange he/him
Pinned
Most #generativeAI models were trained on Common Crawl, a massive archive of web crawl data. Yet most people never heard of it. My new research studies Common Crawl in-depth and highlights its influence on LLM research and development foundation.mozilla.org/en/research/... (1/10)
Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI
Mozilla research finds that Common Crawl's outsized role in the generative AI boom has improved transparency and competition, but is also contributing to biased and opaque generative AI models.
foundation.mozilla.org
Reposted by Stefan Baack
A little-known nonprofit has been lying to news publishers while funneling millions of paywalled articles to tech companies for AI training. Read my investigation in The Atlantic. www.theatlantic.com/technology/2...
The Nonprofit Doing the AI Industry’s Dirty Work
The web archive Common Crawl has been quietly funneling paywalled articles to AI companies—and lying to publishers about it.
www.theatlantic.com
November 4, 2025 at 3:59 PM
Check in if you're interested in my thoughts about what open source AI should aspire to be in relation to proprietary AI
What should open source AI aspire to be? Watch Stefan Baack and Kasia Odrozek's keynote at OSI Deep Dive: Data Governance conference taking place October 1-3. Register for free at: https://opensource.org/datagovernanceconf
October 2, 2025 at 11:03 AM
"The update is yet another signal that payment processors...are currently the ultimate arbiter of what kind of content can be made easily available online, or not."
July 16, 2025 at 8:08 PM
The key questions we always should ask when people talk about AI: What is being automated and why? @alexhanna.bsky.social @weizenbauminstitut.bsky.social
June 30, 2025 at 4:47 PM
"AI is a labor disciplining device" @alexhanna.bsky.social
June 30, 2025 at 4:25 PM
Reposted by Stefan Baack
“The reporter is a man of critical value. No amount of money or effort spent in fitting the right men for this work could possibly be wasted, for the health of society depends upon the quality of the information it receives.” — Walter Lippmann [a century later, I’d swap “man” for “person” though]
May 11, 2025 at 2:38 PM
Reposted by Stefan Baack
New Release! Most AI deepfakes aren't political. 90% of deepfakes are non-consensual intimate imagery. 99% of victims are women. Max Hoppensted, @rechercheur.bsky.social, @romanhoefner.bsky.social, and I uncover a deepfake community and the business behind undress apps www.spiegel.de/netzwelt/web...
(S+) Deepfake-Pornos: Das perfide Geschäft mit gefälschten Sexvideos
Tausende Frauen werden Opfer von gefakten Pornos, in denen ihr Gesicht zu sehen ist. Betroffen sind minderjährige Mädchen, Prominente, Politikerinnen. Dahinter stecken skrupellose Geschäftsleute. Der ...
www.spiegel.de
December 9, 2024 at 1:56 PM
"brainstorming and iteration is...a crucial everyday part of game development...and is not a problem to be solved...I have had many discussions with other game developers who interact with AI engineers and savants who believe our industry pipelines need 'fixing' by them and them alone"
‘An overwhelmingly negative and demoralizing force’: what it’s like working for a company that’s forcing AI on its developers

aftermath.site/ai-video-game-...
April 8, 2025 at 3:28 PM
Reposted by Stefan Baack
Die Union will das Informationsfreiheitsgesetz abschaffen.
@arnesemsrott.bsky.social: „Öffentliche Kontrolle &Transparenz sind der Union offenbar ein Dorn im Auge. Sie will unbehelligt durchregieren. Rechte der Öffentlichkeit stören dabei offenbar."
Pressemitteilung: fragdenstaat.de/newsletter/a...
Union will Informationsfreiheitsgesetz abschaffen: Frontalangriff auf Transparenz und Demokratie - FragDenStaat
Das Portal für Informationsfreiheit für Bürger, Initiativen und Vereine. Stellen Sie eine IFG-Anfrage nach Behördendokumenten, die für Sie und Ihr Engagement wichtig sind! Informieren Sie sich über In...
fragdenstaat.de
March 26, 2025 at 5:49 PM
Reposted by Stefan Baack
«By moving fast and breaking things, DOGE forces a collapse of the system where unanswered questions are met with technological solutions. Shifting the conversation to the technical is a way of locking policymakers and the public out of decisions and shifting that power to the code they write.»
My latest for @techpolicypress.bsky.social, “Anatomy of an AI Coup.” With the takeover of the US government by tech elites underway, we must examine its goals and next steps — and how we will know if it has succeeded. www.techpolicy.press/anatomy-of-a...
Anatomy of an AI Coup | TechPolicy.Press
DOGE is gutting federal agencies to install AI across the government. Democracy is on the line, writes Tech Policy Press fellow Eryk Salvaggio.
www.techpolicy.press
February 9, 2025 at 7:08 AM
Reposted by Stefan Baack
You can’t post your way out of fascism

Authoritarians and tech CEOs now share the same goal: to keep us locked in an eternal doomscroll instead of organizing against them

🔗 www.404media.co/you-cant-pos...
You Can’t Post Your Way Out of Fascism
Authoritarians and tech CEOs now share the same goal: to keep us locked in an eternal doomscroll instead of organizing against them, Janus Rose writes.
www.404media.co
February 5, 2025 at 5:03 PM
Reposted by Stefan Baack
Auschwitz was at the end of a long process. It did not start from gas chambers.

This hatred was gradually developed by humans. From ideas, words, stereotypes & prejudice through legal exclusion, dehumanization & escalating violence... to systematic and industrial murder.

Auschwitz took time.
January 27, 2025 at 10:00 AM
Reposted by Stefan Baack
“AI is fake and sucks” vs “AI is real and dangerous” is a Twitter argument. In reality I think the debate also has a lot of “AI is real but not for how you’re using it,” to “AI is fake and that is dangerous,” to “things are happening to real people because of AI hype and that should stop.”
December 6, 2024 at 7:29 AM
My reading for this week, delivered to me by the great
@aschrock.bsky.social
themself! Thank you, looking forward to reading :-)
December 3, 2024 at 3:53 PM
Reposted by Stefan Baack
Labelers training AI say they're overworked, underpaid and exploited by big American tech companies
Labelers training AI say they're overworked, underpaid and exploited by big American tech companies
Digital workers in Kenya had to sift through horrific online content to train AI, but say they were underpaid, overworked, and got inadequate mental health support. So they're fighting back.
www.cbsnews.com
December 3, 2024 at 10:50 AM
Reposted by Stefan Baack
Dieser Report gibt Hoffnung!

Immer mehr neue, ambitionierte Medien haben sich in Deutschland und Europa gegründet. Medien mit dem Ziel, die Öffentlichkeit hochwertig zu informieren.

@netzwerkrecherche.org hat für den „Journalism Value Report“ 174 Medien in 31 Ländern befragt und kann zeigen:
December 3, 2024 at 11:12 AM
Reposted by Stefan Baack
I have a new piece out with @aisvarya17.bsky.social in @columjournreview.bsky.social in which we test how OpenAI's new search feature surfaces and attributes news content. Our findings were not promising for news publishers (1/9) www.cjr.org/tow_center/h...
How ChatGPT (Mis)represents Publisher Content
ChatGPT search — which is positioned as a competitor to search engines like Google and Bing — launched with a press release from OpenAI touting claims that the company had “collaborated extensively wi...
www.cjr.org
November 27, 2024 at 7:31 PM
Reposted by Stefan Baack
“Without facts, you can’t have truth, and without truth, you can’t have trust”. - Maria Ressa, 2021 Nobel Peace Prize
November 20, 2024 at 11:43 AM
Reposted by Stefan Baack
The Onion should buy Elsevier next
November 14, 2024 at 8:28 PM
Most #generativeAI models were trained on Common Crawl, a massive archive of web crawl data. Yet most people never heard of it. My new research studies Common Crawl in-depth and highlights its influence on LLM research and development foundation.mozilla.org/en/research/... (1/10)
Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI
Mozilla research finds that Common Crawl's outsized role in the generative AI boom has improved transparency and competition, but is also contributing to biased and opaque generative AI models.
foundation.mozilla.org
February 6, 2024 at 4:01 PM
Reposted by Stefan Baack
www.elgaronline.com
November 21, 2023 at 7:47 PM
Generative AI is shaped by the values, practices and objectives of people. I wrote an explainer showing how to help demystifying the tech and focus on questions of accountability https://foundation.mozilla.org/en/blog/the-human-decisions-that-shape-generative-ai-who-is-accountable-for-what/
August 10, 2023 at 1:33 PM