This morning I replied to someone on LinkedIn who claimed to have saved €3,000 by doing their tax return with ChatGPT. The comments were a festival of applause. I replied because the exact same thing happened to me. Except I had the foresight to double-check.
I fed my data to Claude — one of Anthropic's models — and it returned a return that gave me about €3,000 back. The numbers added up. The fields were in the right places. The formatting was impeccable. If I didn't work with AI every day, I'd have filed it without blinking.
But I ran it through Codex — which uses GPT — for a second pair of eyes. And the party was over.
Deductions that didn't apply. Income categorised wrongly. A withholding calculation starting from a wrong premise. It wasn't an obvious disaster: it was fifteen intermediate decisions, each seemingly reasonable, which together turned the outcome into fiction.
Three thousand euros. That weren't three thousand euros.
The model hadn't failed in the way we're used to things failing. It didn't throw an error. It didn't freeze. It didn't say "I don't know how to do this". It did something far worse: it handed me a wrong result with the same confidence it would have delivered a correct one.
The kind of error that didn't exist before AI
We've spent decades dealing with technology that fails. And we've developed instincts for spotting it: blue screens, error messages, forms that won't submit, the page that returns a 404. When something goes wrong, something tells you.
With artificial intelligence that doesn't work anymore.
When AI gets it wrong, it doesn't change tone. It doesn't add an asterisk. It doesn't lower the volume. It delivers the result with the same poise as when it's right. And you, who've spent your whole life associating "polished presentation" with "trustworthy content", hit the send button.
I call it the silent false success: a result that passes all your "does this look good?" filters because it looks great. It's just wrong.
And it's not a rare case. It's the default failure mode of any language model. They're designed to generate what sounds most likely, not what's most correct. Most of the time those coincide. When they don't, there's absolutely nothing flagging it for you.
I see it every week in my work
I run a digital marketing agency. We use AI for everything from analysing search rankings to drafting proposals. It's a brutal tool. And we hit false successes constantly.
One example that particularly gets me: you ask AI to optimise a client's site meta tags. It returns titles and descriptions that sound professional, with the right length, calls to action included. Perfect. Until you look closely and three pages are now competing for the same keyword, two target a search intent that isn't theirs, and one includes a promise the client can't deliver on. You've "optimised" the site and left it worse. Without a single warning.
Another: a draft email to an unhappy client. Perfect tone, empathetic, resolutive. Except the AI has offered a discount no one authorised and implicitly acknowledged a responsibility the company didn't have. If you send it as-is, you're handing a legal argument to anyone who wants to come after you.
Or the classic: a blog article that sounds expert and cites a regulatory detail that doesn't exist, or existed but was amended two years ago. You publish it, and three weeks later you discover your "authority content" has false information that Google doesn't forgive — and your readers shouldn't either.
None of these failures are visible at a glance. All of them look like good work. And all of them would have passed the "does this look good?" filter without a problem.
The better it sounds, the less you check it
This is what turns a technical fault into a real problem.
When a junior hands you a report with typos and wonky formatting, you review the whole thing. Your brain switches into alert mode. But when you get an impeccable document, well laid out, well written, you skim it. You assume that if the form is good, the substance is too.
With AI this mental shortcut burns you every time. Because the form is always good. AI never hands you an ugly draft. It never drops a typo. It never leaves a gap where a data point should go. Everything is filled in, everything sounds fluent, everything looks like a finished product.
And there's also time pressure working against you: if the AI took 30 seconds to generate something, your brain doesn't want to spend 30 minutes checking it. It feels disproportionate. But those 30 seconds didn't include any of the filters a professional applies while working: doubting, consulting, reviewing, correcting on the fly. AI gives you the end result without the process. And the process is where the errors get caught.
What we do to avoid falling for it
I'm not going to give you a list of ten commandments. I'll tell you the three things that actually work for us day-to-day.
The first is the most obvious and the best-working: run it through another model. If Claude generated it, have GPT review it. If GPT generated it, have Claude or Gemini review it. Each model has different biases, and what one approves, the other questions. It's exactly what saved us with the tax return. It isn't infallible, but it raises the bar enormously.
The second: ask the AI what it has assumed. A simple "what assumptions did you make here?" after getting the result. The big errors almost never sit in the visible data; they sit in the invisible premises. The model assumed you file through module-based taxation when you file through direct estimation, or that your company is in the general VAT scheme when it's in the simplified one. Surface the premises and half the false positives fall away on their own.
The third: take the heaviest data point and verify it by hand. Not everything. One thing. The most important figure, the strongest claim, the data point that most shapes the conclusion. If that holds, the rest has some credibility. If it falls, throw the lot in the bin.
Three steps that take five minutes total. Have they kept us from publishing articles with fake data, sending compromising emails and filing an incorrect tax return? Yes.
AI doesn't replace verification — it makes it more necessary than ever
I don't have any issue with artificial intelligence. I use it more hours a day than I'd like to admit. It makes me more productive than I was two years ago, and that's a fact.
But productivity without verification is speed without direction. You get there sooner, but you don't know where.
The guy on LinkedIn celebrating his €3,000 might be right. Or he might get a tax assessment letter in September that costs him quite a bit more than €3,000. The difference between one scenario and the other isn't which AI he used. It's whether someone — a second model, an accountant, him with a calculator — checked the result before hitting send.
If you work with artificial intelligence, work with a filter too. It doesn't need to be expensive, slow or sophisticated. It just needs to exist.
Because the biggest problem with AI isn't when it gets things wrong and you see it. It's when it gets things wrong and you don't. And from the experience of someone living this every day: it happens more than you think.


