Rachel Wicks
banner
rewicks.bsky.social
Rachel Wicks
@rewicks.bsky.social
PhD student @jhuclsp

I work on multilingual data for training and evaluation.

rewicks.github.io
Could you give an example of the input/output you're looking for on which function call (encode, tokenize, etc)? And maybe which tokenizer it's inheriting from 😅 (looks like maybe the OPT models inherit from a GPT2Tokenizer?)
November 26, 2024 at 7:32 PM
Happy to talk about any of these topics and more!

I will also likely end up talking a lot about my pride and joy (my dog).
November 20, 2024 at 12:04 AM
And if you think sentence-level machine translation is good-enough, I encourage you to run your systems on our evaluation data (ctxpro, an extension to ContraPro and other similar evaluation datasets)

github.com/rewicks/ctxpro
GitHub - rewicks/ctxpro: Data and annotation toolkit for finding translation ambiguities in bitext
Data and annotation toolkit for finding translation ambiguities in bitext - rewicks/ctxpro
github.com
November 20, 2024 at 12:04 AM
Most recently I've released the ParaDocs dataset which reconstructs document annotations on large, parallel machine translation datasets. Contextual information is integral to machine translation, but often overlooked!

Data: huggingface.co/datasets/jhu...
jhu-clsp/paradocs · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
November 20, 2024 at 12:04 AM