Skip to content
Back to blog

ChatGPT vs Real Data Analysis: Why 'It Appears' Isn't Evidence

guides
data-analysis
statistical-evidence

We ran a test. Same dataset. Same questions. Two tools: ChatGPT and Anna (the AI analyst inside heyanna).

The results weren't close. Not because ChatGPT is bad — it's genuinely impressive at many things. But data analysis isn't one of them. And the gap between "sounds right" and "is right" is exactly where bad decisions get made.

Here's everything that happened.

The problem with "it appears"

ChatGPT was trained to be helpful, harmless, and honest. The "honest" part means it hedges. A lot. When you upload a CSV and ask a question, you get responses full of qualifiers: "it appears," "this might suggest," "there seems to be."

Those phrases feel like intellectual humility. They're actually a red flag.

When ChatGPT says "it appears that email has a higher conversion rate," it's telling you it looked at two numbers and one was bigger. That's not analysis. That's subtraction. The question isn't whether 4.2% is bigger than 3.1% — you can see that yourself. The question is whether the difference is real or just noise in a small sample.

ChatGPT doesn't answer that question. It can't, because answering it requires running a statistical test, checking assumptions, and interpreting the result. ChatGPT's training optimized it for plausible-sounding text, not for statistical inference.

The danger isn't that ChatGPT gives you wrong numbers. It usually gets the arithmetic right. The danger is that it gives you unvalidated conclusions that sound authoritative. You act on them. You shift budget. You change strategy. And the whole decision was built on "it appears."

If an AI response uses "appears," "seems," or "might suggest" where you need a definitive answer, that's a signal: no statistical test was run. The tool is describing what the numbers look like, not what they mean.

The test setup

We wanted to make this fair. No cherry-picking. No coaching one tool and not the other.

The dataset

Marketing campaign data — 5,000 rows across four channels (email, paid search, paid social, organic) over 12 months. Each row contains a date, channel, spend, impressions, clicks, conversions, and revenue. The kind of dataset a marketing manager exports every month.

The questions

We asked five questions, each progressively harder. They move from descriptive ("what happened?") to inferential ("is this real?") to causal ("what's actually driving this?"):

  1. What's the overall conversion rate by channel?
  2. Is there a statistically significant difference in conversion rates between channels?
  3. Has the conversion rate changed over time? Is the trend significant?
  4. Which channel has the best ROI when you control for spend level?
  5. Are there any anomalies in the data that need investigation?

The rules

Same prompt. No follow-ups. No "try again." We took the first response from each tool. That's what matters in practice — most people don't have time to iterate five times to coax out a better answer.

Head-to-head results

Question 1: Descriptive stats (conversion rate by channel)

ChatGPT produced a clean table. Email: 4.2%. Paid search: 3.1%. Paid social: 2.8%. Organic: 3.6%. Correct numbers. Well-formatted.

Anna produced the same table — plus 95% confidence intervals for each rate. Email: 4.2% (3.8–4.6%). Paid search: 3.1% (2.7–3.5%). And so on.

Verdict: Both adequate. The confidence intervals from Anna are a nice addition — they tell you how precise the estimate is — but for a descriptive question, ChatGPT does fine. If all your questions are this simple, either tool works.

The problem is that most real questions aren't this simple.

Question 2: Statistical significance

This is where the gap opens.

We asked: Is there a statistically significant difference in conversion rates between channels?

ChatGPT: "Email appears to have the highest conversion rate at 4.2%, followed by organic at 3.6%. There does seem to be a meaningful difference between channels, particularly between email and paid social."

No test. No p-value. No effect size. Just "appears" and "seems." It described the numbers we already had in the table from Question 1 and added hedging language. If you weren't paying close attention, you'd think it answered the question. It didn't.

Anna: Ran a chi-squared test of independence across all four channels (chi-squared = 47.3, df = 3, p < 0.001). Then ran pairwise comparisons with Bonferroni correction to identify which specific pairs differ. Results:

  • Email vs paid social: significant (p < 0.001), effect size (Cramér's V) = 0.08
  • Email vs paid search: significant (p = 0.003), effect size = 0.05
  • Email vs organic: not significant (p = 0.14)
  • Paid search vs organic: not significant (p = 0.22)

Anna's summary: "Conversion rates differ significantly across channels. Email outperforms both paid channels, but the difference between email and organic is not statistically significant — you can't conclude email is better than organic based on this data. The effect sizes are small, meaning the practical difference is modest even where it's statistically real."

That last part matters. The difference between email and paid social is real — but it's small. A chi-squared test tells you the difference exists. The effect size tells you whether it's worth acting on. ChatGPT gave you neither.

Question 3: Trend analysis

We asked: Has the conversion rate changed over time? Is the trend significant?

ChatGPT: Generated a line chart showing conversion rate by month across all channels. Added commentary: "There seems to be a slight upward trend in conversion rates over the 12-month period, particularly for email and organic channels."

A line chart is not a trend analysis. Your eyes can deceive you — small fluctuations in a noisy series look like trends when you're primed to find one. And "slight upward trend" tells you nothing about whether it's real or how fast it's moving.

Anna: Ran a Mann-Kendall trend test on each channel's monthly conversion rate. Results:

  • Email: significant upward trend (tau = 0.42, p = 0.02), estimated slope = +0.12 percentage points per month
  • Organic: no significant trend (tau = 0.18, p = 0.31)
  • Paid search: no significant trend (tau = 0.09, p = 0.67)
  • Paid social: significant upward trend (tau = 0.38, p = 0.04), estimated slope = +0.09 pp/month

Anna also tested for seasonality using STL decomposition, found a seasonal component in paid social (Q4 spike), and flagged a changepoint in email conversion rate in month 7 — a step-change rather than a gradual trend, possibly corresponding to a campaign or landing page change.

"Email conversion rate isn't gradually improving — it jumped in July and has been stable at the new level since. Worth investigating what changed that month. Paid social has a genuine upward trend, but part of the Q4 improvement is seasonal, not structural."

That's the difference between "seems like it's going up" and knowing exactly what's happening, when it started, and what might be causing it.

Question 4: Controlled comparison (ROI by channel)

This is the question that separates description from analysis.

We asked: Which channel has the best ROI when you control for spend level?

ChatGPT: Calculated simple ROI per channel: email = 4.8x, organic = 3.2x, paid search = 2.1x, paid social = 1.7x. Concluded: "Email has the highest ROI, delivering $4.80 for every dollar spent."

That's not wrong. It's misleading. Email had the lowest spend. Paid search had the highest. Comparing ROI without controlling for spend level is like comparing fuel efficiency between a car doing city driving and one doing highway driving. The conditions are different.

This is Simpson's paradox in practice — aggregate numbers that reverse when you control for a variable. High-spend channels often show lower ROI because of diminishing returns at scale. That doesn't mean they're worse investments.

Anna: Ran a multiple regression with revenue as the dependent variable, channel as a categorical predictor, and spend as a continuous control variable. This isolates the channel effect from the spend-level effect.

Results told a different story. After controlling for spend:

  • Paid search had the highest marginal ROI at equivalent spend levels
  • Email's apparent advantage disappeared — it looked best only because it operated at low spend where returns are naturally higher
  • Paid social showed the steepest diminishing returns curve

Anna's summary: "At equivalent spend levels, paid search delivers the best return. Email's 4.8x ROI reflects its low absolute spend, not superior efficiency. If you shifted $10K from email to paid search, the regression model predicts a net revenue increase of approximately $3,200/month — but this assumes linear scaling, which should be validated with a controlled test."

Same data. Opposite conclusion. The simple ROI calculation would have told you to pour money into email. The controlled analysis tells you paid search is the better investment at scale. That's the cost of skipping the statistics.

Whenever you're comparing performance across groups that differ in size, spend, or exposure — control for the difference. Uncontrolled comparisons are the single most common source of misleading data analysis.

Question 5: Anomaly detection

We asked: Are there any anomalies in the data that need investigation?

ChatGPT: "The data looks relatively consistent. I don't see any obvious outliers or anomalies. The channels perform within expected ranges throughout the period."

This is the most dangerous kind of response — a false negative delivered with confidence. It looked at the data and said "all clear." That's reassuring. It's also wrong.

Anna: Flagged three anomalies using IQR-based outlier detection and z-score analysis:

  1. Paid social, October 14: Spend was 340% above the channel's daily average with zero conversions recorded. Likely a tracking failure or a data pipeline issue — not an actual campaign result.
  2. Email, July 2–4: Three consecutive days with conversion rates 2.8 standard deviations above the channel mean. This aligns with the changepoint detected in Question 3 — something changed in July.
  3. Paid search, March 22: Revenue recorded as negative ($-1,240). Likely a refund or data error that should be excluded from ROI calculations.

Anna's summary: "Three data quality issues need attention before trusting any analysis of this dataset. The negative revenue value in paid search is artificially depressing that channel's ROI. The zero-conversion day in paid social is inflating its cost-per-acquisition. Both should be investigated and potentially excluded."

ChatGPT said everything was fine. Anna found data quality problems that, if left uncorrected, would distort every other analysis. Anomaly detection isn't glamorous. It's essential.

Why this matters

The cost of false confidence

A plausible-sounding wrong answer is worse than no answer. No answer means you investigate further. A confident wrong answer means you act.

Consider the Question 4 result. ChatGPT's simple ROI calculation says email is your best channel. You shift budget to email. Three months later, total revenue is down because you starved your highest-performing channel at scale. The analysis that looked right cost you real money.

This isn't hypothetical. Teams make budget decisions, strategy pivots, and hiring choices based on data analysis every day. The quality of the analysis directly determines the quality of the decision.

When ChatGPT is fine

Let's be fair. ChatGPT is genuinely useful for data-adjacent tasks:

  • Formatting and cleaning: "Convert these dates from MM/DD/YYYY to YYYY-MM-DD" — great at this.
  • Brainstorming questions: "What are interesting questions I could ask about this marketing dataset?" — solid starting point.
  • Explaining concepts: "What's a confidence interval?" — clear, accessible explanations.
  • Quick summaries: "What columns are in this CSV and what do they contain?" — perfectly adequate.

These are legitimate uses. ChatGPT is a writing and reasoning tool, and it does those things well. The problem is when people use it as an analysis tool — when they need evidence, not explanations.

When you need real analysis

Any time you're making a decision based on data, you need statistical rigor. Specifically:

  • Budget allocation: Which channels deserve more investment? (You need controlled comparisons, not simple averages.)
  • Strategy changes: Should we double down on email? (You need trend analysis with significance testing, not eyeballing a line chart.)
  • Reporting to stakeholders: Is this result real? (You need confidence intervals and effect sizes, not "it appears.")
  • Performance evaluation: Did the campaign actually work? (You need before/after testing with appropriate controls.)

If someone could challenge your conclusion by asking "but is that statistically significant?" and you don't have an answer — the analysis isn't done.

What "real analysis" actually means

The phrase "statistical rigor" sounds intimidating. It shouldn't. It means five things:

1. The right test for the question. Comparing two groups? T-test or Mann-Whitney, depending on the distribution. Comparing proportions? Chi-squared. Looking for a trend? Mann-Kendall or regression. The test choice matters because each has different assumptions. Anna selects the test automatically based on your data's characteristics.

2. Assumption checking. Every statistical test assumes something about your data — normality, independence, equal variance, sufficient sample size. If the assumptions don't hold, the results can't be trusted. Anna checks these before running the test and switches to a non-parametric alternative when needed.

3. Effect sizes, not just p-values. A p-value tells you whether a difference exists. The effect size tells you whether it matters. A statistically significant 0.1% conversion rate difference is real but useless. A 2% difference might be worth millions. Both can have the same p-value.

4. Confidence intervals, not point estimates. "Conversion rate: 4.2%" is a point estimate. "Conversion rate: 4.2% (3.8–4.6%)" is a confidence interval. The first creates false precision. The second tells you how much to trust the number.

5. Reproducible methodology. Can you explain how you reached your conclusion? Can someone else verify it? If your analysis is "I asked ChatGPT and it said so," that's not reproducible. If your analysis includes the test used, the sample size, the p-value, and the effect size — that's something a stakeholder can evaluate and trust.

This is what Anna does automatically. You ask a question in plain English. She selects the appropriate test, checks the assumptions, runs the analysis, calculates effect sizes and confidence intervals, and explains the result in language you can put in a report. The statistical rigor is there. You don't have to think about it.

The real comparison

This isn't about ChatGPT being bad. It's about using the right tool for the job.

ChatGPT is a language model. It generates plausible text. When you ask it about data, it generates plausible-sounding analysis — and plausible-sounding analysis is exactly what makes it dangerous for decisions. It's hard to tell the difference between "sounds right" and "is right" without running the actual statistics.

Anna is a data analyst. She runs real statistical tests on your actual data. When she tells you something is significant, she means it in the statistical sense — not the "this looks important" sense. When she says a trend exists, she's tested for it. When she says two groups differ, she's quantified how much and how confident she is.

Same question. Same data. Different tools. Different quality of answer.

The choice depends on what you need. If you need a quick summary or a brainstorm, ChatGPT is fine. If you're making a decision — allocating budget, changing strategy, reporting to your board — you need evidence, not appearances.

Upload a CSV. Ask a question. See what real analysis looks like.

Try it with your own data.

Stop making decisions based on "it appears." Upload your data and get real statistical analysis.

Try it free

No credit card required