Lightnews — Scholar-powered news

Reposted by Antonis Anastasopoulos

Antonios Dimakis

@antoniosdimakis.bsky.social

Proud to work with John Pavlopoulos and @antonisa.bsky.social on this publication!

Check out the data and code here: github.com/andhmak/rule...

4/4

GitHub - andhmak/rule_dialnorm: Code and datasets associated with the paper titled "Dialect Normalization using Large Language Models and Morphological Rules"

Code and datasets associated with the paper titled "Dialect Normalization using Large Language Models and Morphological Rules" - GitHub - andhmak/rule_dialnorm: Code and datasets associa...

github.com

July 25, 2025 at 5:52 PM

Antonis Anastasopoulos

@antonisa.bsky.social

Usually on an ipad these days...
I think I just hate having to write notes in the middle of a line in the 1-col papers, so it's probably about how close the space is to the text as opposed to how abundant it is

February 20, 2025 at 4:07 PM

Antonis Anastasopoulos

@antonisa.bsky.social

I find the 2-col format easier for reviewing/note-taking/suggesting edits, because the info is spread out vertically and I have more margin space for notes closer to the actual text.

But for just reading, agreed, we should just produce dynamic pubs that people can customize to their preferences.

February 19, 2025 at 5:18 PM

Antonis Anastasopoulos

@antonisa.bsky.social

Another example of a point that supports this argument is the observation that more than 50% of the "facts" that are available in Wikipedia/Wikidata, they are only available or retrievable in a _single_ language. The observation is hidden somewhere in this paper
aclanthology.org/2020.emnlp-m...

aclanthology.org

December 5, 2024 at 3:34 PM

Antonis Anastasopoulos

@antonisa.bsky.social

Hm, I think I have a couple of examples.
The first is the linked PNAS paper, titled "Language extinction triggers the loss of unique medicinal knowledge" www.pnas.org/doi/epdf/10....

PNAS

Proceedings of the National Academy of Sciences (PNAS), a peer reviewed journal of the National Academy of Sciences (NAS) - an authoritative source of high-impact, original research that broadly spans...

www.pnas.org

December 5, 2024 at 3:34 PM

Antonis Anastasopoulos

@antonisa.bsky.social

In the above examples, some people had a problem, and a computer scientist steps in to help produce a solution, and they write a paper about it so that if anyone else has a similar problem in the future, there's a guide to solving it. How's that not enough of a contribution?

December 5, 2024 at 4:09 AM

Antonis Anastasopoulos

@antonisa.bsky.social

next thing you know you realize that you need and you starr building a simplification dataset for the contact language (let's pick Rioplatense Spanish for this example), or building NER tools that can handle the specific regional orthographic variations of Cypriot Greek.

December 5, 2024 at 4:09 AM

Antonis Anastasopoulos

@antonisa.bsky.social

Or they might result from the specific needs of a scientific (or not) team. Hypothetical example: a sociologist teams up with a meteorologist and a computer scientist to figure out how to best convey changing climate threats to an indigenous community, and ...

December 5, 2024 at 4:09 AM

Antonis Anastasopoulos

@antonisa.bsky.social

Or the leaders of a different community might actually want an LLM to ensure their language has the same perceived prestige and tool access as a more dominant language that might be threatening theirs.

December 5, 2024 at 4:09 AM

Antonis Anastasopoulos

@antonisa.bsky.social

A lot of the "narrow"-focus datasets on otherwise underserved languages might be the result of the specific needs of the community: a community might not need an LLM, but they might need a morphosyntactic analyser that they can deploy in a classroom to teach their language.

December 5, 2024 at 4:09 AM

Antonis Anastasopoulos

@antonisa.bsky.social

Sure, If your space of scientific questions is only "how can I train a model to do X?", then the slight variation of "how can I train a model to do X in language Y?" is not too interesting in and of itself (although there might be arguments just for that, see above)

December 5, 2024 at 4:09 AM

Antonis Anastasopoulos

@antonisa.bsky.social

The extent to which we understand them, in our current setting, is measured by the datasets that you complain about.
Yes, we might _believe_ that model X will be able to perform task Y in some language Z and context W, but we don't _know_, not until we actually try (and often find "...Not quite").

December 5, 2024 at 4:09 AM

Antonis Anastasopoulos

@antonisa.bsky.social

Human beings have come up with 7000+ ways to communicate. And each of these modes encodes unique sociocultural, historical, (and potentially more types of *al) information. So being able to understand (or have machines understand) them ensures that we don't lose part of our collective knowledge.

December 5, 2024 at 4:09 AM

Antonis Anastasopoulos

@antonisa.bsky.social

Late to the party, but after reading most subthreads, I'll bite.
I think your whole premise suggests a very narrow view of what is science or scientific contributions.

December 5, 2024 at 4:09 AM