(a friend sent me this screenshot, I'm married 😅)
(a friend sent me this screenshot, I'm married 😅)
The task that most exemplifies our ability to automate knowledge work is “doing your taxes”.
At Column Tax we’re now within line of sight to fully automating taxes. We started the company at the perfect moment, with LLMs just on the horizon.
The task that most exemplifies our ability to automate knowledge work is “doing your taxes”.
At Column Tax we’re now within line of sight to fully automating taxes. We started the company at the perfect moment, with LLMs just on the horizon.
Glad to hear the founder whisper networks are still sharing this knowledge around.
Glad to hear the founder whisper networks are still sharing this knowledge around.
especially because it's knowledge cutoff is still September 2024
but it's not the leader in tax calculation today
(even with maximal test time compute)
especially because it's knowledge cutoff is still September 2024
but it's not the leader in tax calculation today
(even with maximal test time compute)
we proved it via our tax calculation benchmark:
we proved it via our tax calculation benchmark:
Here's what it taught me about B2B2C tax software:
Just kidding :) but I do really recommend getting married to the love of your life with all your friends & family around!
Here's what it taught me about B2B2C tax software:
Just kidding :) but I do really recommend getting married to the love of your life with all your friends & family around!
using pass^k (a measure of reliability of a model across multiple runs on the same task), performance degrades with additional runs meaning models mess up in new & surprising ways when calculating tax returns.
using pass^k (a measure of reliability of a model across multiple runs on the same task), performance degrades with additional runs meaning models mess up in new & surprising ways when calculating tax returns.
but not for the best model (Gemini 2.5 Pro), suggesting alternative techniques/scaffolding/orchestration is required to get AI to do this tax calculation task.
but not for the best model (Gemini 2.5 Pro), suggesting alternative techniques/scaffolding/orchestration is required to get AI to do this tax calculation task.
1. Misuse tax tables
2. Make calculation errors
For example, models will hallucinate line numbers on Forms or use incorrect eligibility limits.
1. Misuse tax tables
2. Make calculation errors
For example, models will hallucinate line numbers on Forms or use incorrect eligibility limits.
Even on this simplified data set and allowing the models to output to a simplified format, the best model only calculates 32.35% of returns correctly.
Even on this simplified data set and allowing the models to output to a simplified format, the best model only calculates 32.35% of returns correctly.
We made the task easy for the models. We provide:
- all of the data (e.g. W-2s) needed to file a return
- the expected output in IRS XML format
We made the task easy for the models. We provide:
- all of the data (e.g. W-2s) needed to file a return
- the expected output in IRS XML format
75k pages of English text define the transformations required to do this.
Companies like @ColumnTax use deterministic tax engines to do these calculations.
75k pages of English text define the transformations required to do this.
Companies like @ColumnTax use deterministic tax engines to do these calculations.
We tested the latest frontier models and the results were full of catastrophic errors.
Letting AI do your taxes would mean IRS rejections, audits, and penalties:
We tested the latest frontier models and the results were full of catastrophic errors.
Letting AI do your taxes would mean IRS rejections, audits, and penalties:
So we’ve had to build lots of custom infrastructure to deploy AI agents.
It’s working. We’re now able to build our tax engine much faster than traditional methods:
So we’ve had to build lots of custom infrastructure to deploy AI agents.
It’s working. We’re now able to build our tax engine much faster than traditional methods: