We built one that's hard & trustworthy:
👉 AstaBench tests agents w/ *standardized tools* on 2400+ scientific research problems
👉 SOTA results across 22 agent *classes*
👉 AgentBaselines agents suite
🆕 arxiv.org/abs/2510.21652
🧵👇
We built one that's hard & trustworthy:
👉 AstaBench tests agents w/ *standardized tools* on 2400+ scientific research problems
👉 SOTA results across 22 agent *classes*
👉 AgentBaselines agents suite
🆕 arxiv.org/abs/2510.21652
🧵👇
Searching for relevant work is a multi-step process that requires iteration. Paper Finder mimics this workflow — and helps researchers find more papers than ever 🔍
Searching for relevant work is a multi-step process that requires iteration. Paper Finder mimics this workflow — and helps researchers find more papers than ever 🔍
🗂️ You can now sign in via Google to save your query history across devices and browsers.
📚 We added 108M+ paper abstracts to our corpus - expect to get even better responses!
More below…
🗂️ You can now sign in via Google to save your query history across devices and browsers.
📚 We added 108M+ paper abstracts to our corpus - expect to get even better responses!
More below…
Meet Ai2 ScholarQA, an experimental solution that allows you to ask questions that require multiple scientific papers to answer. It gives more in-depth and contextual answers with table comparisons and expandable sections 💡
Try it now: scholarqa.allen.ai
Meet Ai2 ScholarQA, an experimental solution that allows you to ask questions that require multiple scientific papers to answer. It gives more in-depth and contextual answers with table comparisons and expandable sections 💡
Try it now: scholarqa.allen.ai
Their experiments:
(1) They get 128 responses from a LLM for some prompt. p = 0.9, t = 0.7, max length of 512 and 4-shot in-context samples
1/n
Their experiments:
(1) They get 128 responses from a LLM for some prompt. p = 0.9, t = 0.7, max length of 512 and 4-shot in-context samples
1/n
neurips.cc/virtual/2024...
neurips.cc/virtual/2024...
Person: fuck this I'm going to Linux
Narrator: and they quickly learned to hate two operating systems.
Person: fuck this I'm going to Linux
Narrator: and they quickly learned to hate two operating systems.
I really like this paper. They study whether LLMs do reasonable things like ask follow-up questions and acknowledge what the users are saying. The answer is "not really".
I really like this paper. They study whether LLMs do reasonable things like ask follow-up questions and acknowledge what the users are saying. The answer is "not really".
overparameterized deep networks -> WD changes enhances the implicit regularization of SGD
underparameterized models trained with nearly online SGD -> WD balances bias/variance and lowers training loss.
#mlsky
arxiv.org/abs/2310.04415
overparameterized deep networks -> WD changes enhances the implicit regularization of SGD
underparameterized models trained with nearly online SGD -> WD balances bias/variance and lowers training loss.
#mlsky
arxiv.org/abs/2310.04415
sinking way too much time into defining what qualifies as NLP HCI accounts.
realizing I’m the sole annotator for my own annotation task 😵
sinking way too much time into defining what qualifies as NLP HCI accounts.
realizing I’m the sole annotator for my own annotation task 😵
Turning all languages off fixes the issue.
Turning all languages off fixes the issue.
Interested in language models of science, evaluating AI-generated text, challenging retrieval settings, and human-AI collaborative reading/writing?
Come work with meeee! 😸
Learn more: kyleclo.github.io/mentorship
Interested in language models of science, evaluating AI-generated text, challenging retrieval settings, and human-AI collaborative reading/writing?
Come work with meeee! 😸
Learn more: kyleclo.github.io/mentorship
(eg adding trees to an RF has marginal impact on accuracy, feature selection gets worse past the optimal number of features, &c)
(eg adding trees to an RF has marginal impact on accuracy, feature selection gets worse past the optimal number of features, &c)
Do GBDTs work well in high dimensions or for text data?
Do GBDTs work well in high dimensions or for text data?
bsky.app/profile/did:...
Tag your posts with #mlsky
I'll probably also add some ML jargon keywords later.
bsky.app/profile/did:...
Tag your posts with #mlsky
I'll probably also add some ML jargon keywords later.