Data and Analysis You Can Trust
After this you can get analysis out of a model that you would actually stake a decision on, by making it compute answers in a checkable way instead of writing prose that merely looks like analysis, and by checking the assumptions underneath the math rather than the math itself.
The thing to understand first is what happens when you paste a table into a chat and ask "what's the average customer lifetime value?" You feel like you handed the model a calculation. You did not. The model does not run a calculator. It generates the text that most plausibly follows your question, and a confident dollar figure is exactly the kind of text that plausibly follows. One analyst did this, got back $12,847, and put it in an exec deck. The real number, computed by hand the next day, was $4,291. Nothing about the output looked wrong, which is the whole problem: when you ask a model to "analyze this data," you are not getting computation, you are getting prediction, and a prediction shaped like an answer is indistinguishable from an answer until someone checks.
It gets worse the moment numbers are involved, because the model is bad at arithmetic in a specific, invisible way. A figure like "1,520" gets split into pieces like "1", "52", "0" before the model ever sees it, so it is predicting the most likely result string rather than carrying the one across a column. On controlled four-digit multiplication, accuracy runs around 2%, and the errors are not wild. They are plausible: off by a digit, a missed carry, a decimal in the wrong place. Plausible-wrong is the dangerous kind, because obvious-wrong you catch and plausible-wrong you ship.
The fix that actually moves the needle is to stop asking for the conclusion and start asking for the code. Instead of "what's the average revenue?", ask the model to write the Python (or SQL) that computes the average revenue, then run that code against the real rows in a sandbox and read the result. This is not a stylistic preference. It changes the category of thing you get back. A conclusion is unverifiable text that you have to trust or redo. Code is an artifact you can read, rerun, and check, and the actual computation is done by something deterministic that does carry the one. The division of labor that operators settle into is clean: the model owns the hypothesis and the write-up, deterministic execution owns the numbers.
The same split shows up one layer down, in "talk to your data" tools that turn an English question into a database query. Point one at a raw warehouse schema and accuracy lands around 40%, which is below any bar you would deploy against. The reason is not that the model is dumb. It is that the schema does not say which revenue column is net versus gross, what counts as an "active" user, or which timestamp is the event versus the ingest. The model cannot derive business meaning from column names, so it guesses, writes valid SQL against the wrong assumption, and the query runs and returns rows. Put a semantic layer in front of it, a modeled definition of the metrics and what they mean, and accuracy climbs toward 100%, because the model is now choosing from defined measures rather than inventing them. A roughly 4KB hand-authored semantic-layer document on top of the schema lifted accuracy by 17 to 23 percentage points across three frontier models, and the presence of that document, not which model you picked, accounted for essentially all the difference. The trust came not from a better model but from constraining it to definitions someone wrote down.
Where it breaks
The trap that catches people who have already learned the lesson above is this: connecting the model to a calculator, to Python, to a database, or to a validation layer fixes arithmetic error and does nothing for modeling error. This is the calculator fallacy. The tool computes whatever you told it to compute, exactly. If the assumption underneath was wrong, you now have a wrong answer carried out to the full precision the tool affords, and the precision makes it feel more authoritative, not less. Python cannot rescue a wrong assumption. If the agent reads a 3% rate as 0.3, treats a monthly figure as annual, or picks the wrong attribution model, the code runs cleanly and returns a precise, confident, wrong number. Tools move you from "the math might be wrong" to "the math is right and the question was wrong," which is a real improvement and not a safety guarantee.
The signature version of this is the wrong-denominator failure, and it is the most consistent way model-driven analysis lies. Hand a model a CSV of only the users who saw the new checkout flow and ask for the conversion rate. It reports 45%. That denominator silently excludes everyone who churned before reaching the flow, everyone on the old flow, and everyone who hit a routing error, so 45% is a real percentage of the wrong population. The model does not know what it cannot see, and it does not ask. The query runs, the number is plausible, and nothing in the output signals that the base is wrong. The same thing happens inside a GROUP BY when "churn rate by cohort" gets computed against the total user count instead of each cohort's own count. There is no error to catch — just a wrong answer that looks exactly like a right one.
And validation does not save you here the way you would hope, because validation catches absurd errors far more reliably than plausible ones. A range check flags a 4000% conversion rate. It waves through 45% computed on the wrong base. A human reviewer skimming a polished report is in the same position: the report reads as competent, so the review degrades into a rubber stamp. The defenses that actually work are upstream of the polish. Make every numeric output ship with a visible assumptions table that names the formula, the units, the rate type, the time window, and whether a figure is measured or projected, so the assumption is forced into the open where it can be wrong out loud. For anything load-bearing, compute it a second, independent way and stop on disagreement rather than picking the more plausible of the two. Convergence is not correctness, but disagreement is a reliable smell, and a model run twice that gives two different numbers has told you something true about its confidence.
Before you accept any number a model hands you, make it surface the assumptions first and the conclusion last. Paste this block ahead of the analysis:
Analyze the data below, but follow this order strictly:
1. ASSUMPTIONS TABLE first. Before any result, list:
- The exact population/denominator you are computing over (and what it excludes)
- Each formula you will use, in words
- Units, rate type (e.g. monthly vs annual), and time window
- Which figures are measured vs estimated/projected
2. Then write the CODE that computes each number. Do not state a
number you did not compute in the code.
3. Then the RESULT, each figure traceable to a line of code above.
4. Flag anything you could not compute from the data I gave you as
[needs data] — do not fill it with a plausible estimate.
If the denominator or a definition is ambiguous, stop and ask me
before computing. Do not guess which column or which population I mean.The order is doing the work. Demanding the assumptions table before the result forces the wrong-denominator and wrong-units errors into the open at the top, where you can catch them, instead of leaving them buried under a confident final number. Demanding the code before the result blocks the model from emitting a figure it never actually computed. The last two lines cover the calculator fallacy from the other side: a gap surfaces as a question or a [needs data] flag rather than a plausible fabrication.
For any number you would put in front of a decision-maker, add one more step the prompt cannot do for you. Compute it a second way, whether a spreadsheet, a different query, or a hand calculation on a sample, and if the two disagree, do not reconcile by picking the nicer number. Find out why they disagree first. The disagreement is the signal.
Worked example
IllustrativeIllustrative. A constructed exchange to show the habit, not a real session.
A growth lead exports the users who completed the new onboarding flow and wants the conversion rate for a board update. The fast move is to hand over the file and ask:
You: [pastes a CSV of 4,200 users who finished onboarding] What's our conversion rate to paid?
Assistant: Your conversion rate to paid is 45%. This is a strong result, well above typical SaaS benchmarks of 2–5%, suggesting the new onboarding flow is performing exceptionally well.
That 45% is computed correctly on the rows in the file and is still wrong, because the file is only users who finished onboarding. Everyone who dropped before finishing, everyone routed to the old flow, and everyone who errored out are not in the denominator. The model computed a real percentage of the wrong population, dressed it in a benchmark comparison, and never mentioned the base it was standing on. Assumptions-first, the same request goes differently:
You: [pastes the same CSV] Conversion rate to paid. Assumptions table first, then the code, then the result. State the denominator explicitly and what it excludes.
Assistant: Assumptions:
- Denominator: the 4,200 rows in this file. These are users who completed onboarding only. This excludes anyone who started but did not finish, anyone on the prior flow, and any sessions that errored — so this is conversion among completers, not conversion among signups.
- Formula: paid_users / completers.
- "Paid" =
plan != 'free'at export time. Point-in-time, not cohort-aged.Code:
paid = df[df.plan != 'free']; rate = len(paid) / len(df)Result: 1,890 / 4,200 = 45% conversion among onboarding completers. [needs data] To get conversion among all signups, I need the count of users who started onboarding, not just those who finished.
Same data, same model, same correct arithmetic. The second answer is trustworthy not because the number changed but because the base it stands on is now stated, the gap to the real question is flagged instead of filled, and a decision-maker can see exactly what 45% is a percentage of. The first answer would have gone to the board as the conversion rate. It was the conversion rate of the wrong people.