tksiia
tksiia.bsky.social
tksiia
@tksiia.bsky.social
machine learning and LLMs 🤖🗼 https://takashiishida.github.io
We're excited to announce the launch of Google Developer Group AI for Science Japan!🎉

If you're interested, we’d love to have you join our community.

GDG AI for Science Japan
gdg.community.dev/gdg-ai-for-s...
GDG AI for Science - Japan | Google Developer Groups
gdg.community.dev
November 28, 2025 at 6:21 AM
Reposted by tksiia
Reward models do not have the capacity to fully capture human preferences.
If they can't represent human preferences, how can we hope to use them to align a language model?

In our #COLM2025 "Off-Policy Corrected Reward Modeling for RLHF", we investigate this issue 🧵
July 29, 2025 at 10:22 AM
Released bibfixer 🎉 A tiny AI tool that cleans & standardizes your BibTeX files using LLMs + web search.

No more tedious edits like fixing capitalization (ai -> AI), swapping arXiv for the conference version, or expanding "and others" into full author lists. Let bibfixer do the grunt work for you!
GitHub - takashiishida/bibfixer: A Python tool that automatically cleans, completes, and standardizes BibTeX entries using LLMs and web search.
A Python tool that automatically cleans, completes, and standardizes BibTeX entries using LLMs and web search. - takashiishida/bibfixer
github.com
September 29, 2025 at 12:54 PM
Reposted by tksiia
EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements

Paper: pub.sakana.ai/edinet-bench/

We just released a Japanese financial benchmark designed to evaluate the performance of AI Agents on challenging financial tasks like accounting fraud detection.
June 9, 2025 at 2:02 AM
Excited to announce EDINET-Bench, a financial LLM benchmark built from 40k annual reports in Japan!

It features accounting fraud detection, earnings forecasting, industry classification, and includes our tool edinet2dataset as a foundation for designing new tasks.

Hope researchers find it useful!
日本語金融ベンチマーク「EDINET-Bench」を公開

ブログ: sakana.ai/edinet-bench/
論文: pub.sakana.ai/edinet-bench/

金融庁の電子開示システムであるEDINETの有価証券報告書を活用し、高度な金融タスクにてAIがどの程度対応できるかを測るための日本語金融ベンチマークを構築しました。

EDINET-Bench での評価の結果、現状のLLMを単純に適用するだけでは、会計不正検知等において実用的な性能は出ないという課題が確認された一方、入力情報を工夫することによる性能向上の可能性も示唆されました。
June 9, 2025 at 10:12 AM
Reposted by tksiia
Our discussion period just started. Authors, please read our instructions carefully. We require responses by June 2.

But, what you really want to hear about is stats .... right? -> 🧵
May 27, 2025 at 5:41 PM