🌟Answer: We found compute-optimal crossover points for every model size.
Rough rule of thumb: finetune if your compute budget C is < 10^10 x N ^1.54, otherwise pretrain.
8/
🌟Answer: We found compute-optimal crossover points for every model size.
Rough rule of thumb: finetune if your compute budget C is < 10^10 x N ^1.54, otherwise pretrain.
8/
🌟Answer: We derived closed-form equations! To go from K to 4K languages while maintaining performance: scale data by 2.74×, model size by 1.4×.
6/
🌟Answer: We derived closed-form equations! To go from K to 4K languages while maintaining performance: scale data by 2.74×, model size by 1.4×.
6/
Languages sharing writing systems (e.g., Latin) show dramatically better transfer (mean: -0.23) vs different scripts (mean: -0.39).
Also important: transfer is often asymmetric—A helping B ≠ B helping A.
5/
Languages sharing writing systems (e.g., Latin) show dramatically better transfer (mean: -0.23) vs different scripts (mean: -0.39).
Also important: transfer is often asymmetric—A helping B ≠ B helping A.
5/
🌟Answer: We measure this empirically. We built a 38×38 transfer matrix, or 1,444 language pairs—the largest such resource to date.
We highlight the top 5 most beneficial source languages for each target language.
4/
🌟Answer: We measure this empirically. We built a 38×38 transfer matrix, or 1,444 language pairs—the largest such resource to date.
We highlight the top 5 most beneficial source languages for each target language.
4/
Without modeling transfer, existing laws fail on multilingual settings.
3/
Without modeling transfer, existing laws fail on multilingual settings.
3/
🌟Answer: Yes! ATLAS outperforms prior work with R²(N)=0.88 vs 0.68, and R²(M)=0.82 vs 0.69 for mixture generalization.
2/
🌟Answer: Yes! ATLAS outperforms prior work with R²(N)=0.88 vs 0.68, and R²(M)=0.82 vs 0.69 for mixture generalization.
2/
🌍 Is scaling diff by lang?
🧙♂️ Can we model the curse of multilinguality?
⚖️ Pretrain vs finetune from checkpoint?
🔀 X-lingual transfer scores across langs?
1/🧵
🌍 Is scaling diff by lang?
🧙♂️ Can we model the curse of multilinguality?
⚖️ Pretrain vs finetune from checkpoint?
🔀 X-lingual transfer scores across langs?
1/🧵
BigGen Bench introduces fine-grained, scalable, & human-aligned evaluations:
📈 77 hard, diverse tasks
🛠️ 765 exs w/ ex-specific rubrics
📋 More human-aligned than previous rubrics
🌍 10 languages, by native speakers
1/
BigGen Bench introduces fine-grained, scalable, & human-aligned evaluations:
📈 77 hard, diverse tasks
🛠️ 765 exs w/ ex-specific rubrics
📋 More human-aligned than previous rubrics
🌍 10 languages, by native speakers
1/
Empirically, it shows:
1️⃣ Soaring synthetic text data: ~10M tokens (pre-2018) to 100B+ (2024).
2️⃣ YouTube is now 70%+ of speech/video data but could block third-party collection.
3️⃣ <0.2% of data from Africa/South America.
1/
Empirically, it shows:
1️⃣ Soaring synthetic text data: ~10M tokens (pre-2018) to 100B+ (2024).
2️⃣ YouTube is now 70%+ of speech/video data but could block third-party collection.
3️⃣ <0.2% of data from Africa/South America.
1/
1/
1/
For transferable AI flaws (e.g., jailbreaks affecting multiple systems), we need to inform all relevant developers and stakeholders who must act to mitigate these issues.
See the Figure to understand the before and after of flaw disclosure.
6/
For transferable AI flaws (e.g., jailbreaks affecting multiple systems), we need to inform all relevant developers and stakeholders who must act to mitigate these issues.
See the Figure to understand the before and after of flaw disclosure.
6/
Security-only or invite-only bug bounties from OpenAI and Anthropic are a great start.
But eventually we need disclosure programs to cover the full range of AI issues, and protect independent researchers.
5/
Security-only or invite-only bug bounties from OpenAI and Anthropic are a great start.
But eventually we need disclosure programs to cover the full range of AI issues, and protect independent researchers.
5/
1️⃣ adoption of standardized AI flaw reports, to improve flaw reproducibility, triaging, coordination across stakeholders, and ultimately AI safety.
4/
1️⃣ adoption of standardized AI flaw reports, to improve flaw reproducibility, triaging, coordination across stakeholders, and ultimately AI safety.
4/
Today, GPAI serves 300M+ users globally, w/ diverse & unforeseen uses across modalities and languages.
➡️ We need third-party evaluation for its broad expertise, participation and independence, including from real users, academic researchers, white-hat hackers, and journalists
2/
Today, GPAI serves 300M+ users globally, w/ diverse & unforeseen uses across modalities and languages.
➡️ We need third-party evaluation for its broad expertise, participation and independence, including from real users, academic researchers, white-hat hackers, and journalists
2/
Our new paper, “In House Evaluation is Not Enough” has 3 calls-to-actions to empower evaluators:
1️⃣ Standardized AI flaw reports
2️⃣ AI flaw disclosure programs + safe harbors.
3️⃣ A coordination center for transferable AI flaws.
1/🧵
Our new paper, “In House Evaluation is Not Enough” has 3 calls-to-actions to empower evaluators:
1️⃣ Standardized AI flaw reports
2️⃣ AI flaw disclosure programs + safe harbors.
3️⃣ A coordination center for transferable AI flaws.
1/🧵
We’re presenting the state of transparency, tooling, and policy, from the Foundation Model Transparency Index, Factsheets, the the EU AI Act to new frameworks like @MLCommons’ Croissant.
1/
We’re presenting the state of transparency, tooling, and policy, from the Foundation Model Transparency Index, Factsheets, the the EU AI Act to new frameworks like @MLCommons’ Croissant.
1/
➡️ why is copyright an issue for AI?
➡️ what is fair use?
➡️ why are memorization and generation important?
➡️ how does it impact the AI data supply / web crawling?
🧵
➡️ why is copyright an issue for AI?
➡️ what is fair use?
➡️ why are memorization and generation important?
➡️ how does it impact the AI data supply / web crawling?
🧵
Increasingly, they block or charge all non-human traffic, not just AI crawlers.
3/
Increasingly, they block or charge all non-human traffic, not just AI crawlers.
3/
It covers:
- data sourcing,
- documentation,
- environmental impact,
- risk eval
- model release & licensing
- ++
It covers:
- data sourcing,
- documentation,
- environmental impact,
- risk eval
- model release & licensing
- ++
This isn’t the first time OpenAI has accused a Chinese company of breaking its Terms and training on ChatGPT outputs.
Dec 2023: They suspended ByteDance’s accounts.
1/
This isn’t the first time OpenAI has accused a Chinese company of breaking its Terms and training on ChatGPT outputs.
Dec 2023: They suspended ByteDance’s accounts.
1/
➡️ Open releases (BLOOMZ, Llama 2, SD2) are more transparent than closed releases
➡️ Significant room for improvement in Downstream Usage Policy, Feedback, and Impact
2/
➡️ Open releases (BLOOMZ, Llama 2, SD2) are more transparent than closed releases
➡️ Significant room for improvement in Downstream Usage Policy, Feedback, and Impact
2/
1️⃣ All 10 score poorly, particularly Data, Labor, Compute
2️⃣ Transparency is possible! 82/100 are scored by >1
3️⃣ Transparency is a precondition for informed & responsible AI policy.
1/
1️⃣ All 10 score poorly, particularly Data, Labor, Compute
2️⃣ Transparency is possible! 82/100 are scored by >1
3️⃣ Transparency is a precondition for informed & responsible AI policy.
1/
-> Join us on Dec 15 in New Orleans
-> Submit by Oct 1
-> See speaker lineup https://an-instructive-workshop.github.io
Stay tuned for updates!
-> Join us on Dec 15 in New Orleans
-> Submit by Oct 1
-> See speaker lineup https://an-instructive-workshop.github.io
Stay tuned for updates!