Lightnews — Scholar-powered news

Krithika Ramesh

@stolenpyjak.bsky.social

Catch @zihaozhao.bsky.social at today’s poster session (10:30–12) where he'll be presenting SynthTextEval! Stop by if you're interested in synthetic text for high-stakes domains. Zihao also has another EMNLP paper on private text generation, for people interested in this space!
@jhuclsp.bsky.social

Krithika Ramesh @stolenpyjak.bsky.social · 4d

🚀 SynthTextEval, our open-source toolkit for generating and evaluating synthetic text data for high-stakes domains, will be featured at EMNLP 2025 as a system demonstration!

GitHub: github.com/kr-ramesh/sy...
Paper 📝: aclanthology.org/2025.emnlp-d...

#EMNLP2025 #EMNLP #SyntheticData

GitHub - kr-ramesh/synthtexteval: SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data Across Domains (EMNLP 2025 System Demonstration)

SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data Across Domains (EMNLP 2025 System Demonstration) - kr-ramesh/synthtexteval

github.com

November 7, 2025 at 12:55 AM

Krithika Ramesh

@stolenpyjak.bsky.social

🚀 SynthTextEval, our open-source toolkit for generating and evaluating synthetic text data for high-stakes domains, will be featured at EMNLP 2025 as a system demonstration!

GitHub: github.com/kr-ramesh/sy...
Paper 📝: aclanthology.org/2025.emnlp-d...

#EMNLP2025 #EMNLP #SyntheticData

GitHub - kr-ramesh/synthtexteval: SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data Across Domains (EMNLP 2025 System Demonstration)

SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data Across Domains (EMNLP 2025 System Demonstration) - kr-ramesh/synthtexteval

github.com

November 7, 2025 at 12:53 AM

Reposted by Krithika Ramesh

Zihao Zhao

@zihaozhao.bsky.social

Thank you to @anjalief.bsky.social for advising. Hands-on with DP-SGD? Start with our another paper and open-source package
(arxiv.org/abs/2507.07229
github.com/kr-ramesh/sy...)

SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains

We present SynthTextEval, a toolkit for conducting comprehensive evaluations of synthetic text. The fluency of large language model (LLM) outputs has made synthetic text potentially viable for numerou...

arxiv.org

October 15, 2025 at 8:24 PM

Reposted by Krithika Ramesh

Zihao Zhao

@zihaozhao.bsky.social

🔗 Paper & code
Paper is accepted to EMNLP 2025 Main
arXiv: arxiv.org/abs/2509.25729
Code: github.com/zzhao71/Cont...
#SyntheticData #Privacy #NLP #LLM #Deidentification #HealthcareAI #LLM

Controlled Generation for Private Synthetic Text

Text anonymization is essential for responsibly developing and deploying AI in high-stakes domains such as healthcare, social services, and law. In this work, we propose a novel methodology for privac...

arxiv.org

October 15, 2025 at 8:24 PM

Krithika Ramesh

@stolenpyjak.bsky.social

Take a look at this EMNLP 2025 paper by @zihaozhao.bsky.social, which proposes novel methods for generating high utility, privacy-preserving synthetic text!

Zihao Zhao @zihaozhao.bsky.social · 26d

🚀 Text anonymization is hard; DP often hurts utility.
We use entity-aware control codes + either ICL (with bad-token blocking) or prefix-tuning w/ masking to get strong privacy–utility tradeoffs on legal & clinical data, outperforming DP-SGD in practice (EMNLP 2025).
www.arxiv.org/abs/2509.25729

October 16, 2025 at 2:39 AM

Krithika Ramesh

@stolenpyjak.bsky.social

‼️‼️

Niyati Bafna @niyatibafna.bsky.social · Jul 4

🔈When LLMs solve tasks with a mid-to-low resource input or target language, their output quality is poor. We know that. But can we put our finger on what breaks inside the LLM? We introduce the 💥 translation barrier hypothesis 💥 for failed multilingual generation with LLMs. arxiv.org/abs/2506.22724

July 8, 2025 at 4:04 PM

Reposted by Krithika Ramesh

Niyati Bafna

@niyatibafna.bsky.social

This hypothesis says that 1) Multilingual generation uses a model-internal task-solving→translation cascade. 2) Failure of the translation stage *despite task-solving success* is a large part of the problem. That is, the model often solves the task but fails to articulate the answer.

July 4, 2025 at 5:05 PM

Krithika Ramesh

@stolenpyjak.bsky.social

⁉️

Kaiser Sun @kaiserwholearns.bsky.social · Jun 16

What happens when an LLM is asked to use information that contradicts its knowledge? We explore knowledge conflict in a new preprint📑
TLDR: Performance drops, and this could affect the overall performance of LLMs in model-based evaluation.📑🧵⬇️ 1/8
#NLProc #LLM #AIResearch

What Is Seen Cannot Be Unseen: The Disruptive Effect of Knowledge Conflict on Large Language Models

Large language models frequently rely on both contextual input and parametric knowledge to perform tasks. However, these sources can come into conflict, especially when retrieved documents contradict…

arxiv.org

June 18, 2025 at 2:09 AM

Reposted by Krithika Ramesh

Niyati Bafna

@niyatibafna.bsky.social

We know that speech LID systems flunk on accented speech. But why? And what can we do about it? 🤔
Our work arxiv.org/abs/2506.00628 (Interspeech '25) finds that *accent-language confusion* is an important culprit, ties it to the length of feature that the model relies on, and proposes a fix.

June 7, 2025 at 5:27 PM

Reposted by Krithika Ramesh

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

Go find new linguidtic changes, compare corpora and invent
huggingface.co/Hplm
arxiv.org/abs/2504.05523

Hplm (Historical Perspectival LM)

Org profile for Historical Perspectival LM on Hugging Face, the AI community building the future.

huggingface.co

April 15, 2025 at 12:45 PM

Reposted by Krithika Ramesh

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

Historical analysis is a good example, as historical periods can get lost in blended information from different eras. Finetuning large models isn't enough, they “leak” future/modern concepts, making historical analysis impossible. Did you know cars existed in the 1800s? 🤦

April 15, 2025 at 12:45 PM

Reposted by Krithika Ramesh

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

arxiv.org/abs/2504.05523

Typical Large Language Models (LLMs) are trained on massive, mixed datasets, so the model's behaviour can't be linked to a specific subset of the pretraining data. Or in our case, to time eras.

Pretraining Language Models for Diachronic Linguistic Change Discovery

Large language models (LLMs) have shown potential as tools for scientific discovery. This has engendered growing interest in their use in humanistic disciplines, such as historical linguistics and lit...

arxiv.org

April 15, 2025 at 12:45 PM

Reposted by Krithika Ramesh

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

How should the humanities leverage LLMs?
▶️Domain-specific pretraining!

Pretraining models can be a research tool, it's cheaper than LoRA, and allows studying
💠grammatical change
💠emergent word senses
💠who knows what more…

Train on your data with our pipeline or use ours!
#AI #LLM 🤖📈

April 15, 2025 at 12:45 PM

Reposted by Krithika Ramesh

Niyati Bafna

@niyatibafna.bsky.social

Dialects lie on continua of (structured) linguistic variation, right? And we can’t collect data for every point on the continuum...🤔
📢 Check out DialUp, a technique to make your MT model robust to the dialect continua of its training languages, including unseen dialects.
arxiv.org/abs/2501.16581

February 27, 2025 at 2:44 AM

Reposted by Krithika Ramesh

masc-conference.bsky.social

@masc-conference.bsky.social

Form here: forms.gle/6DRkaP1CTMYk...

MASC 2025 Call for Locations

Are you able to host MASC this year, sometime in Spring 2025? Responsibilities include: Space for ~150 ish people Managing the review process (really just paper submissions) Organizing the event Choo...

forms.gle

December 16, 2024 at 9:26 PM

Reposted by Krithika Ramesh

masc-conference.bsky.social

@masc-conference.bsky.social

📢 Want to host MASC 2025?

The 12th Mid-Atlantic Student Colloquium is a one day event bringing together students, faculty and researchers from universities and industry in the Mid-Atlantic.

Please submit this very short form if you are interested in hosting! Deadline January 6th. #MASC2025

December 16, 2024 at 9:19 PM

Reposted by Krithika Ramesh

Mark Dredze

@mdredze.bsky.social

📢 It's PhD admissions season! 🎓

The PhD admissions process is stressful! 😅

Want a behind-the-scenes look at the process? 👀✨ You have questions, we have answers. 📝🤝

Watch my Admissions AMA for @jhuclsp.

https://youtu.be/YlwpIPFNXjo?si=O7n5QwGT5sQdpg7u

December 1, 2024 at 11:02 PM

Reposted by Krithika Ramesh

Anjalie Field

@anjalief.bsky.social

I'm super excited about this program and happy to connect if you're interested in working with me through it!

Mark Dredze @mdredze.bsky.social · Nov 17

Postdoc opportunities! The Johns Hopkins Data Science and AI Institute has a new postdoc program!

We’re looking for candidates across data science and AI, including science, health, medicine, the humanities, engineering, policy, and ethics.

Spread the word and apply!

ai.jhu.edu/postdoctoral...

Postdoctoral Fellowship Program - Johns Hopkins Data Science and AI Institute

Data Science and AI Institute Postdoctoral Fellowship Program The Johns Hopkins Data Science and AI Institute welcomes applications for its postdoctoral fellowship program, seeking scholars to advance...

ai.jhu.edu

November 20, 2024 at 7:28 PM

Reposted by Krithika Ramesh

Kate Sanders

@kesnet50.bsky.social

Putting together a JHU Center for Language and Speech Processing starter pack!

Please reply or DM me if you're doing research at CLSP and would like to be added - I'm still trying to find out which of us are on here so far.

go.bsky.app/JtWKca2

CLSP

Join the conversation

go.bsky.app

November 19, 2024 at 3:37 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news