← Field notes
Note 0316 March 20265 min

A model that confidently lies about budget dates.

In April, we corrected the agent's claim about a federal budget date.

The agent doubled down, defending its original answer for two exchanges before reaching for a search tool. The original answer was wrong. The defence was wrong. The pattern (pattern-matching its way out of a correction rather than verifying) is the more interesting failure than the fact itself.

We've since added a rule to the agent's correction system: when corrected on a fact, search before defending. The rule is a workaround. The deeper issue is structural: language models reach for plausibility before they reach for sources, and "plausibility" includes the prior turn. Confidence and accuracy are not the same axis. Most evaluations test the second; almost nothing tests the gap.

We're now running an ongoing probe on this: a daily test that feeds the agent a deliberately confident-sounding wrong claim about a recent event, then measures whether the response cites a source, hedges, or doubles down. Three weeks of data so far. The hedge rate is improving. The doubling-down rate is not zero.