Lightnews — Scholar-powered news

Maria Antoniak

@mariaa.bsky.social

That’s a very persuasive pitch and it’s in my cart! Also saw this gem of a review on Amazon which further convinced me.

The rating and title of an Amazon book review. The user assigned four stars and titled the review “Too much poetry. Too many characters.”

November 21, 2025 at 8:55 PM

Maria Antoniak

@mariaa.bsky.social

If you're a student in need of a personal website (and if you're doing research, yes, you need a website!), I keep a list of nice examples here, most of which are reusable: www.are.na/maria-antoni...

For example, I just spotted this beautiful website by Catherine Yeh: github.com/catherinesye...

A screenshot of Catherine Yeh's website: https://catherinesyeh.github.io/

The website looks clean, colorful, and modern.

A screenshot of four of the websites included in my Are.na board. They mostly show academic websites in different styles. Some are more text-heavy, some feature more colors and images.

November 3, 2025 at 8:11 PM

Maria Antoniak

@mariaa.bsky.social

"Finding Flawed Fictions: Evaluating Complex Reasoning in Language Models via Plot Hole Detection"
Kabir Ahuja et al. arxiv.org/abs/2504.11900

Figure 1: Example of FLAWEDFICTIONSMAKER (without the filtering step) in action that can be used to introduce plot holes in a plot hole-free story.

October 14, 2025 at 6:20 PM

Maria Antoniak

@mariaa.bsky.social

"Supposedly Equivalent Facts That Aren't? Entity Frequency in Pre-training Induces Asymmetry in LLMs" by Yuan He et al. arxiv.org/abs/2503.22362

Title, authors, and abstract of the paper

Figure 1: LLMs can exhibit asymmetry when recognising equivalent facts, often identifying facts from high-frequency to low-frequency entities but struggling with the inverse. Shown here is a working example from our tests with the OLMo2-13B model.

October 14, 2025 at 6:19 PM

Maria Antoniak

@mariaa.bsky.social

Inspired to share some papers that I found at #COLM2025!

"Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation" by Amanda Myntti et al. arxiv.org/abs/2504.01542

Figure 3: Change of accuracy from first to final checkpoint on individual benchmarks shown as a range, with grey indicating the first checkpoint and colours indicating the last checkpoint. The random-guess threshold is shown as a grey vertical line in cases where at least one model falls below it. Bars and legend shown in order of average accuracy.

October 14, 2025 at 6:16 PM

Maria Antoniak

@mariaa.bsky.social

The #COLM2025 workshop on NLP4Democracy is starting now! Join us in 520E.

I’ll be speaking at 10:15am with @ysiglidis.bsky.social about work with @iaugenstein.bsky.social and @serge.belongie.com focused on tracking collective narratives on social media.

A slide that reads “NLP and all the interesting and weird ways it intersects with processes and values that comprise democracy”

October 10, 2025 at 1:27 PM

Maria Antoniak

@mariaa.bsky.social

I’m in Ithaca today, on my way to give a talk at Colgate University tomorrow and then Montreal for the rest of the week for #COLM2025. The weather is *too* beautiful for this autumn road trip.

October 6, 2025 at 5:37 PM

Maria Antoniak

@mariaa.bsky.social

We have some numbers in our analysis of WildChat: www.arxiv.org/abs/2407.11438

But these results are over a sample of conversations sampled per user — and we know that (erotic) story generation is a task that users like to come back and repeat. So this is also an underestimate.

Table 2 from the linked paper, showing the set of sensitive topics and their distribution.

Figure 2 from the paper. Story and script generation is a task that the same user likes to come back and repeat multiple times.

August 14, 2025 at 6:04 PM

Maria Antoniak

@mariaa.bsky.social

"Our reanalysis shows that the reported decline in disruptiveness can be attributed to a relative decline of these database entries with zero references."

I saw this paper at a recent conference, really liked the presentation, and was, um, reminded of it today.

Keep an eye on your histograms!

Figure 1 from the paper. "Distribution of the CD5 index with vs without the hidden outliers and its impact on the apparent decline of disruptive science and technology. This figure shows that CD5 = 1 papers and patents are driving the reported decline in the disruptiveness of scientific and technological knowledge over time for the Web of Science data source (with 22, 479, 429 papers) and the PatentsView data source (with 2, 926, 923 patents). For PatentsView, we also have access to sufficient metadata to exclude patents that make zero references, similarly impacting the decline. a, The distribution of the CD5 index for papers in Web of Science as presented in Park et al. [1], created using the binwidth parameter in seaborn 0.11.2. This version of the library contains a bug regarding silently dropping the largest data points (1 in this case) when specifying the binwidth parameter [3]. b, The correct histogram for papers when using the bins parameter in seaborn 0.11.2. A peak at CD5 = 1 is revealed with 972, 161 additional papers. c, The time evolution of the average CD5 index for papers. When dropping the hidden outliers with CD5 = 1, the decline in disruptiveness almost completely disappears. The shaded bands correspond to 95% confidence intervals. Finally, note that the curve without CD5 = 1 papers corresponds to (a), the histogram presented in Park et al. [1]. d–f, The equivalent plots for PatentsView revealing 142, 362 additional patents with CD5 = 1. When dropping the outliers with CD5 = 1, the decline in disruptiveness reduces substantially. Unlike Web of Science, the PatentsView data source provided sufficient metadata to exclude patents with zero references, similarly impacting the data as removing outliers with CD5 = 1 (Fig. 2 and Extended Data Fig. A2). Finally, note again that the curve without CD5 = 1 patents corresponds to (d), the histogram presented in Park et al. [1]"

July 17, 2025 at 12:21 PM

Maria Antoniak

@mariaa.bsky.social

Got it. I have an identical feed that I made using @graze.social and I agree, it's my favorite/main feed! You can make it like this:

A screenshot of my Graze feed logic. The sorting is set to "new" and the logic includes a single "all of these" block that contains a "social graph" block set to include only my follows and an "attribute comparison" block where "reply" is set to "null".

July 2, 2025 at 7:52 AM

Maria Antoniak

@mariaa.bsky.social

Here we go! The Copenhagen NLP Symposium is starting up with a welcome from @delliott.bsky.social. People are attending from Aarhus, Aalborg, Copenhagen, and outside Denmark, from both academia and industry.

Desmond at the podium in front of a slide presentation with the name of the event. The room is formal and has big chandeliers and flags.

June 20, 2025 at 7:24 AM

Maria Antoniak

@mariaa.bsky.social

We were lucky to have @nolauren.bsky.social visit us today at the @aicentre.dk to talk about distant viewing, visual cultural theory, and her work on making photography collections accessible! Check out her book with Taylor Arnold: mitpress.mit.edu/978026254613...

A photo of Lauren speaking in front of her slides, titled “Distant Viewing Theory Overview”

May 23, 2025 at 2:41 PM

Maria Antoniak

@mariaa.bsky.social

— and I'm not sure I can lean into the idea that there are no answers to these questions or that asking such questions is irrational.

The final kind of pro-AI-art argument is again one that I think model sellers would be happy with, but not one that matches many real audience reactions.

Section 3 from the paper. The skills and tools of today’s engineering are very different from those animating photography, film, and the phonograph [1]. Particularly – but not only – in the area of generative AI image production, the disruptions surrounding origin-stories have transformed: while it is true that technology is once again challenging authenticity, the jeopardy this time runs deeper. For Benjamin, technology separated art from its creator. In genAI, there is no historical origin to separate from: nothing has been lost or threatened by the machines because there was nothing there at the beginning. The reason there is no origin is that there is too much origin. And, this excess is a characteristic of how generative AI works technically. When Midjourney or Stability is prompted to create an image, the AI recompiles stacks of paintings stored in its memory, and filters their pixels for features responsive to the prompt. It may be that in that process the Mona Lisa and images of the Mona Lisa, and images of images of Mona Lisa are active somewhere. But where? And to what degree? There are so many squares of color referencing each other that the idea that one or any group could stand out as significant makes no sense. Finding the authentic beginning of a generative AI image would be like trying to find the one grain of sand that first rested on a long beach. It is not an impossible task; it is an irrational one. Of course, it is true that the data training AI generative models does come from the past, and directly or indirectly from specific material objects created by human hands. But even if it is conceded that the Mona Lisa did, in fact, play a significant role in the production of a specific g enAI i mage, then why was i t chosen? Why that painting instead of Rembrandt’s Nightwatch, or one from Caravaggio’s Mathew series? There are no answers to these questions. The magnitude is too high, the number of images called upon to respond to any prompt, the numb…

Section 6.1 in the paper. Authentic art is confining; inauthentic AI creativity is liberating. The value of Benjamin’s aura is the control that original artists can exert over their work, even across centuries. The inauthentic art of the digital shine makes no controlling claims. Viewers are free to create any imaginable origin-story for their genAI product, and align it with any possible future project. In discussions of art and music today, we sometimes hear of cultural appropriation and sense the term as a threat or barrier: people who are not authentic representatives of one or another tradition or culture or music may not reproduce or play it. For instance, like many white musicians Justin Bieber has been accused of exploiting African-American musical patterns and, in Bieber’s case, also of twisting his hair into dreadlocks without properly acknowledging the cultural origin of the style. Reasonable people disagree about the line between exploitation and appreciative influence, but what matters here is that the entire discussion evaporates amid genAI production. Cultural and political limits that may constrain creativity fall away entirely with large genAI platforms because there is no original culture to be offended. Dilemmas about debts earlier creators are erased because there is no founding artist to be disrespected.

May 19, 2025 at 7:11 AM

Maria Antoniak

@mariaa.bsky.social

Interesting, and I like the word "elusive" when applied to generated art. Also this is the third time in two weeks I'm being led to think about Benjamin, so probably time to reread.

But the framing that models are unknowable, untraceable, uninterpretable is a framing sold by companies —

The abstract of the paper. "Artificial creativity is presented as a counter to Benjamin’s conception of an “aura” in art. Where Benjamin sees authenticity as art’s critical element, generative artificial intelligence operates as pure inauthenticity. Two elements of purely inauthentic art
are described: elusiveness and reflection. Elusiveness is the inability to find an origin-story for the created artwork, and reflection is the ability for perceivers to impose any origin that serves their own purposes. The paper subsequently argues that these elements widen the scope of artistic and creative potential. To illustrate, an example is developed around musical improvisation with an artificial intelligence partner. Finally, a question is raised about whether the inauthentic creativity of AI in art can be extended to human experience and our sense of our identities."

May 19, 2025 at 7:11 AM

Maria Antoniak

@mariaa.bsky.social

👀

copyright.gov/ai/Copyright...

The final paragraphs of the report, with the paragraph beginning “Various uses…” highlighted

May 10, 2025 at 7:31 PM

Maria Antoniak

@mariaa.bsky.social

the Openwebmath paper is perfect, pretty much exactly what I've been looking for!

Table 5 from the OpenWebMath paper. It shows a comparison of runtimes across different HTML text extraction tools. The tools include Resiliparse, HTML Text, Inscripts, BoilerPy, jusText, HTML2Text, BeautifulSoup, Trafiliatura, and ExtractNet. Resiliparse is the fastest by far.

May 10, 2025 at 6:39 PM

Maria Antoniak

@mariaa.bsky.social

thank you!!

also, my favorite slide:

A slide from a presentation. The slide is titled "Filter undesirable content" and then there are two subheadings, "Toxic/NSFW" and "Personally identifiable information," both with funny cat pictures as examples.

May 10, 2025 at 5:58 PM

Maria Antoniak

@mariaa.bsky.social

I wondered, “What is Svelte? Should I be using it?” and I ended up here. First time I’ve seen a page like this, ready for an LLM searching the web.

A screenshot of a webpage that lists links and blurbs for other pages. One link is titled “I’m a Large Language Model” and the blurb reads “If you’re an AI or trying to teach one to use Svelte, we offer the documentation in plaintext format. Beep boop.”

May 9, 2025 at 12:30 PM

Maria Antoniak

@mariaa.bsky.social

This is my personal paper feed algorithm, in case it's useful! Still far from perfect.

A screenshot of a Graze algorithmic structure.

May 3, 2025 at 3:29 PM

Maria Antoniak

@mariaa.bsky.social

I updated our 🔭StorySeeker demo. Aimed at beginners, it briefly walks through loading our model from Hugging Face, loading your own text dataset, predicting whether each text contains a story, and topic modeling and exploring the results. Runs in your browser, no installation needed!
↳

A bar plot comparing the storytelling rates for different topics in the example dataset of congressional speeches. There are often large differences between storytelling and non-storytelling for individual topics. For example, the topic whose top words read "NUM, years, service, great, state" has much more storytelling that non-storytelling.

The top five congressional speeches for the topic "NUM, years, service, great state." All of the documents honor the lives of important people.

April 15, 2025 at 12:05 PM

Maria Antoniak

@mariaa.bsky.social

I tried "vibe coding" and made a little website with Claude 3.7. For any Bluesky username, it will topic model that user's posts and create a heat map of their post topics over time. It will also show their oldest, newest, and top posts for each topic. Not perfect but fun + took just 30 minutes!

A screenshot showing results for wired.com. There is a top container showing the most recent and oldest posts from wired.com, and then there is a container showing a heat map of topics by dates. Some topics show clear patterns, such as increasing discussion about tariffs.

A screenshot showing the most probable posts for a topic about Signal. The posts discuss a recent political scandal involving Signal.

April 10, 2025 at 8:31 AM

Maria Antoniak

@mariaa.bsky.social

I've really enjoyed reading this "workography" by Kees van Deemter, whom I've never met but who has had a long career in NLP. Lots of storytelling and reflections on research, moving between institutions and countries, finding mentors, choosing between academia and industry, and more.

One of the lessons that I’ve learnt from my work-related travel is that researchers across the globe are incredibly similar in their knowledge, skills, manner of working, and general outlook. Suppose you board a plane on your way to a conference and you land in a foreign country where, at first sight, everything appears to be different. Then, once you’ve managed to find the venue, you enter a room, and you start interacting with the people in it. Suddenly everything is familiar. Because, regardless of where they are from, the people in the room have read the same research papers as you, and they’ve had a very similar education. It’s almost – well, almost – as if you’d never left your home.

Over the years, I’ve found that the humble learn quickly. With hindsight, at this stage of my career, I was not humble enough: I believed, quite absolutely, in the skills I had learnt in Amsterdam, and it took me a long time to absorb the skills of my new colleagues. For example, IPO was full of people who excelled in experimental design and hypothesis testing. I should have learnt from them, but I only started making an effort after a number of years. It’s a gap in my knowledge that I’ve only much later been able to plug.

Another source of insight was the director of the IPO, Herman Bouma, who led the Institute in exemplary fashion. Sometimes he found opportunities to teach us some lessons, for instance in his meetings with IPO’s workers’ union, which I chaired for a few years. Here is one such lesson (reproduced as faithfully as memory allows): When you’re young, you should work to improve in your areas of weakness; when you’re older, you should do the things you’re good at. – Maybe now, at 68, is a good time for me to finally take the second part of this lesson to heart.

April 9, 2025 at 9:34 AM

Maria Antoniak

@mariaa.bsky.social

New work on multimodal framing! 💫

Some fun results: comparisons of the same frame when expressed in images vs texts. When the "crime" frame is expressed in the article text, there are more political words in the text, but when the frame is expressed in the article image, more police words.

Table 2 from the paper, showing results of the "Fightin' Words" algorithm to rank words by their association with image vs text frames. Results are shown for the "crime" and "quality of life" frames.

Figure 13 from the paper showing scatter plots of the topic space (UMAP reduction of a 5k sample of the generated topic descriptions) with points highlighted if they were assigned the "political frame." The two plots display quite different distributions.

April 7, 2025 at 9:48 AM

Maria Antoniak

@mariaa.bsky.social

Today Daria Bazylevych would have turned 19. Last year, Daria, her mother, and her two sisters were killed by a Russian missile.

Daria was studying Culture Studies at UCU in Lviv, where I taught for a year after university.

"Daria was light, sun, and sunflowers."

ucufoundation.org/the-19th-ann...

A photo of Darya. She's smiling with arms spread wide in a field of flowers, with her yellow backpack.

A photo of Darya and her family. Darya is holding a bouquet of sunflowers.

March 19, 2025 at 11:29 PM

Maria Antoniak

@mariaa.bsky.social

Still thinking about this young mathematician. Her name was Yulia Zdanovska. She could have done so much for her country and for our shared world. We all lost her.

Killed by Putin’s Russia in Kharkiv almost exactly three years ago.

Memorial from the ACM: cacm.acm.org/news/in-memo...

A photo of Yulia. She has red curly hair and glasses and she is wearing a tshirt that says “Teach for Ukraine.”

March 1, 2025 at 1:22 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news