Ben Kirwin
banner
bkirwi.bsky.social
Ben Kirwin
@bkirwi.bsky.social
just setting up my twttr
congrats! last time we caught up you were i think just acquiring a much smaller electric boat... cool to hear you've been Scaling Up. is the cat in the water already?
May 3, 2025 at 9:52 PM
afaict you either need to argue that i've infringed by producing a copy of an article that i've never seen; or that the model creator infringed, and the model does "contain" a copy of the article in some sense, even though the model is definitely not "just" a copy of those inputs...
April 7, 2025 at 3:32 AM
suppose: the nyt example was for a open-weights model like llama; i get the model and recover an nyt article from it, like they demonstrated in court. i now have an illegal copy; where's the copy from?
April 7, 2025 at 3:30 AM
sure, happy to leave it here, and ultimately this is something a judge will decide as you say! but i will drop a last thought at the end here anyways since i already typed it up...
April 7, 2025 at 3:27 AM
sorry if i'm being pedantic! but this kind of hair-split is the sort of the thing the law cares about and i think the article is a little fuzzy on... 😅
April 7, 2025 at 2:59 AM
is it? in a section with a summary like "it’s still critical that training not involve copying", it seems relevant that quite a bit of copying happens in practice, and that it's hard to prevent.
April 7, 2025 at 2:54 AM
for sure, but "my system only copies a small percentage of (the ~entire internet)" and "i wish my system did not copy data so often" are not arguments that copying is not happening...
April 6, 2025 at 12:57 AM
good news! they shared the prompts: nytco-assets.nytimes.com/2023/12/Laws...
nytco-assets.nytimes.com
April 5, 2025 at 8:33 PM
and if they do it in public it can be copyright infringement!
April 5, 2025 at 5:12 PM
and i don't find the article's treatment of this super convincing: it agrees that all models do this, says it's bad, and then ignores it in the conclusions...
April 5, 2025 at 5:11 PM
i was happy to see you share this article; i think it's more right than most things written on this topic! but eg. when the nyt can get a model to spit out its articles nearly word for word, i think there's a pretty clear argument that a copy has been made and distributed...
April 5, 2025 at 4:46 PM
for ~all common models, it's quite easy to get an llm to spit out portions of its training data verbatim... hard to argue that distributing those models is not distributing that data in a legal sense!
April 5, 2025 at 4:36 PM
thanks for sharing! read the vulnerability report from citizenlab... looks like the issue was in the keyboard, and citizenlab still recommend using signal. (with all the security settings turned on!)
March 8, 2025 at 5:07 PM
oh hey congrats! i remember you were taking another swing at this - glad to see it over the line
November 18, 2024 at 3:30 AM