Create A/B tests by chatting with AI and launch them on your website within minutes.

Try it for FREE now
CONTENTS
How to
12
Min read

Why You Can't Trust GA4 Data for A/B Testing

Donald Ng
Donald Ng
May 10, 2026
|
Capterra
5-star rating
4.8
Reviews on Capterra

Quick answer

GA4 user counts are probabilistic estimates, not exact counts. Google applies a HyperLogLog++ cardinality estimation algorithm to every user metric in GA4 — and unlike Universal Analytics, there is no toggle to disable it. This means every A/B test analysis based on GA4 user data is working with approximated numbers. At low traffic the error is minor, but at scale it can flip experiment outcomes: one analysis of 20 real-world tests found 6 had completely reversed conclusions when exact counts replaced HLL++ estimates. The only path to raw, unsampled user data is the GA4 → BigQuery export. This post explains why the problem is worse than most teams realize, and walks through the BigQuery setup step by step.

Key takeaways

  • GA4 reports estimated user and session counts via HyperLogLog++. The error can reach ±2% per metric — small in isolation, catastrophic when applied to statistical significance calculations at scale.
  • Unlike Universal Analytics, GA4 provides no way to switch off this estimation. Every number you see in the GA4 interface or API is approximate.
  • GA4 also layers "modeled data" on top — machine-learning imputations for cookieless users — compounding the accuracy problem.
  • Session-based metrics from GA4 violate the independence assumption that A/B testing statistics require, introducing additional bias.
  • The fix: use the GA4 BigQuery export and query COUNT(DISTINCT user_pseudo_id) directly from raw event data.

Every A/B test is only as trustworthy as the data behind it. Most growth teams know to worry about sample size, test duration, and statistical significance. Far fewer realize that the data source itself — Google Analytics 4 — introduces a layer of approximation that can silently corrupt experiment results.

This is not a minor rounding issue. At sufficient scale, using GA4 user metrics for A/B testing analysis can cause you to implement changes that hurt conversions, reject changes that help them, or waste months of experimentation budget chasing signals that aren't real.

Let's walk through every layer of the problem.

Problem 1: GA4 and A/B Testing Tools Count Users Differently

The most fundamental mismatch is definitional. GA4 is built around sessions. Your A/B testing tool is built around unique visitors. These two things are not the same, and treating them as interchangeable quietly breaks your statistics.

What is a session in GA4?

A session in GA4 is a group of interactions a user takes on your site within a given time frame. By default, a session expires after 30 minutes of inactivity. The same visitor can generate multiple sessions in a single day — if they arrive, leave, come back an hour later, and browse again, that's two sessions from one person.

GA4 does report "users" (specifically "active users"), but this metric is calculated differently from the unique visitor count your A/B testing platform uses. Your testing tool assigns a persistent identifier — usually a first-party cookie — to each browser. One cookie equals one unique visitor, regardless of how many times they return or how many sessions they start.

Why this matters for A/B test statistics

Statistical significance tests for proportions (conversion rate tests) assume that each observation in your sample is independent. If the same person shows up in your dataset five times because they generated five sessions, you've inflated your sample size while violating the independence assumption. The p-value your calculator produces is no longer valid.

The practical consequence: session-based samples from GA4 will appear larger than they are. Your test will look like it has enough statistical power before it actually does. You'll call winners earlier and more often than the data justifies.

This is why serious A/B testing platforms track experiments at the visitor level, not the session level. If you supplement or replace your testing tool's built-in analysis with GA4 session data, you're reintroducing exactly the problem the visitor-level design was meant to prevent.

Problem 2: GA4 User Counts Are Estimates, Not Exact Counts

Here's the part most people don't know, and the part that is most dangerous for A/B testing.

The "users" number you see anywhere in GA4 — in standard reports, in exploration reports, via the API — is not an exact count. It is a probabilistic estimate computed using the HyperLogLog++ (HLL++) algorithm, a technique from computer science designed to estimate the cardinality of very large datasets efficiently.

This was first extensively documented by Georgi Georgiev at Analytics-Toolkit.com, and later confirmed by Google's own documentation. The official GA4 developer blog post on HLL++ explicitly states that HLL is used for both user and session counts, with a precision parameter of 14 for user counts and 12 for session counts.

"In early 2017, Google Analytics began updating the calculation for the Users and Active Users metrics to more efficiently count users with high accuracy and low error rate (typically less than 2%)." — Google Analytics Help documentation

A 2% error rate sounds fine. The problem is what that 2% does to statistical significance calculations when you have a large sample.

How HyperLogLog++ works

HyperLogLog++ is a space-efficient probabilistic data structure. Instead of storing every user ID and counting them exactly, it uses a mathematical sketch that can estimate "there are approximately N distinct values here" using a tiny fraction of the memory. For GA4, this allows Google to report user counts across billions of events without storing full sets of IDs.

At precision parameter 14, the HLL++ algorithm achieves:

  • ±0.41% error for 68% of estimates
  • ±0.81% error for 95% of estimates
  • ±1.22% error for 99% of estimates

In absolute terms, this means a reported user count of 100,000 could represent anywhere from roughly 98,780 to 101,220 actual users, and you have no way of knowing which direction the estimate skewed.

What makes this particularly insidious is that the error is random and uncorrelated between your control and test groups. In an A/B test, if your control group happens to be underestimated by 1.5% and your test group happens to be overestimated by 1.5%, your conversion rate calculation is now wrong in both the numerator and the denominator — and it can swing in the direction that produces a false positive or a false negative.

Free A/B Testing Tool

Run your next A/B test the right way

Visual editor, 15 KB script, GA4-native — and free forever up to 100,000 monthly visitors. No developer required.

✓ Visual editor✓ 15 KB script✓ GA4 integration✓ Free up to 100k visitors
Try Mida free →

Problem 3: In GA4, You Can't Turn This Off

Here's where GA4 is fundamentally different from its predecessor, Universal Analytics.

In Universal Analytics, there was a property-level setting called "Enable User Metrics in Reporting." If you turned this toggle off, GA reported exact user counts — real, raw counts of unique client IDs from the session logs — instead of HLL++ estimates. It wasn't well-known, it wasn't on by default, but it existed.

In GA4, this toggle does not exist. There is no setting, no API parameter, no query option that forces GA4 to return exact user counts. Every user count you retrieve from GA4 — through the web interface, through the Data API, through Looker Studio — is an approximation.

"In the new Google Analytics 4 no option exists to turn off user metrics so every user count you can get through the interface is an estimate instead of an exact count." — Georgi Georgiev, Analytics-Toolkit.com

This architectural decision by Google means that for GA4 users, the question of whether to use exact counts vs. estimates is not a choice. You get estimates by default and there is no escape route through the standard GA4 interface or API.

The only escape route is BigQuery — and we'll get to that.

Problem 4: GA4 Layers Modeled Data on Top

The HLL++ estimation is a structural issue with GA4's architecture. There's a second, more visible layer of approximation that compounds it: modeled data.

As privacy regulations (GDPR, CCPA) have tightened and browser-based cookie consent requirements have grown, a significant portion of your visitors — often 30–50% depending on region and industry — decline tracking consent. GA4 receives no data at all for these users.

To fill this gap, GA4 uses machine learning models to impute what these unconsented users would have done. It estimates their behavior based on patterns from users who did consent. This modeled data is then blended into your standard GA4 reports without any explicit label indicating which rows are real and which are imputed.

As Louis Loh documented on Medium, most marketers using GA4 have no idea how much of their data is modeled versus observed. The modeled percentage can be surprisingly high, and the modeling algorithms are proprietary — you cannot audit or validate them.

For A/B testing purposes, this creates a severe problem. If 40% of your "users" in an experiment are GA4 model imputations, your conversion rates are partly real and partly fictional. The fictional part is generated by a model that has no knowledge of which experiment variant a user saw.

In other words, modeled data in GA4 effectively assigns random experiment outcomes to a large fraction of your experimental population. This adds noise that is indistinguishable from signal in the GA4 interface.

Problem 5: Sampling and Data Freshness

Beyond estimation and modeling, GA4 applies traditional data sampling to some report types:

Additionally, GA4 data is not real-time for most reports. Standard processing takes 24–48 hours for complete data. Intraday data is available but is explicitly preliminary and subject to revision. If you're pulling conversion data from GA4 mid-experiment, you may be working with incomplete data.

The Hard Numbers: How Wrong Can It Get?

At small sample sizes the HLL++ error is tolerable. Below about 12,000 users per test arm, the algorithm achieves high enough accuracy that the statistical distortion is minimal. Most low-traffic teams using GA4 for informal analysis won't notice any problem.

The situation changes dramatically at scale. Georgiev's simulation study ran 10,000 A/A tests at various cardinality levels and measured how often the p-value fell below common significance thresholds. In a valid system with no true effect, this should happen at exactly the stated rate (5% for p < 0.05). Here's what HLL++ estimates actually produced:

Users per test arm p < 0.05 rate (should be 5%) α inflation p < 0.01 rate (should be 1%) α inflation
20,0005.79%+16%1.50%+50%
50,0008.75%+75%2.57%+157%
100,00011.67%+133%4.39%+339%
200,00018.43%+269%8.41%+741%
500,00032.11%+542%19.29%+1829%
1,000,00043.86%+777%31.97%+3097%

Source: Analytics-Toolkit.com — The Perils of Using Google Analytics User Counts in A/B Testing

At 100,000 users per arm — a sample size many mid-market teams would consider reasonably large — your actual false positive rate is 11.67%, not the 5% you think you're testing at. Your 95% confidence threshold now provides only ~88% confidence in practice.

At 500,000 users per arm, the false positive rate for a "95% confidence" test is 32%. You would make the wrong decision on roughly one in three A/B tests.

Real-world test outcome reversals

This isn't theoretical. Georgiev analyzed 20 real experiments and compared outcomes using HLL++ user counts (as reported in GA) vs. exact counts (from BigQuery raw export). The results:

  • 6 out of 20 tests had completely reversed conclusions — tests that appeared statistically significant with GA estimates had no real effect with exact counts, or vice versa.
  • The remaining 14 were unaltered in direction but had meaningfully different p-values and confidence intervals.

The reversals weren't small noise. Several showed p-values of 0.004–0.018 in GA (clear "winners") that came in at 0.12–0.82 with exact data (clearly not significant). Three more showed the opposite: rejected by GA estimates, but statistically significant with exact counts.

These are not edge cases — they are the direct consequence of using estimated counts for statistical analysis.

Free A/B Testing Tool

Run your next A/B test the right way

Visual editor, 15 KB script, GA4-native — and free forever up to 100,000 monthly visitors. No developer required.

✓ Visual editor✓ 15 KB script✓ GA4 integration✓ Free up to 100k visitors
Try Mida free →

The Solution: Use GA4's BigQuery Export

The only way to get exact, unsampled, unmodeled user counts from GA4 is through the native BigQuery export. GA4 exports every raw event hit — including the underlying user identifiers — to BigQuery in near real-time. From BigQuery, you can count distinct users exactly, without HLL++ approximation and without blended modeled data.

This is the approach recommended by data engineers who work with GA4 at scale, and it's the one Georgiev's analysis uses when comparing "exact counts" to "HLL++ estimates."

How to Set Up the GA4 → BigQuery Export

The setup is free for the GA4 side. You need a Google Cloud project (BigQuery charges for storage and query compute — typically a few dollars per month for moderate-volume sites).

Step 1: Create a Google Cloud project

  1. Go to console.cloud.google.com and sign in with your Google account.
  2. Click Select a project → New Project. Give it a name (e.g., my-ga4-export).
  3. Enable the BigQuery API for this project: navigate to APIs & Services → Enable APIs → search BigQuery → Enable.
  4. Make sure billing is enabled on the project (BigQuery requires a billing account, though free-tier usage covers small queries).

Step 2: Link GA4 to BigQuery

  1. In GA4, go to Admin → Property → BigQuery Links (under Product Links).
  2. Click Link.
  3. Select your Google Cloud project. GA4 will verify that you have the necessary permissions (BigQuery Admin role on the project).
  4. Choose your export location (region). Select the region closest to your users or your data team.
  5. Choose Daily export (cheaper, batched once a day) or Streaming export (continuous, higher cost). For A/B test analysis, daily is almost always sufficient.
  6. Select which events to export. "All events" is recommended — you can filter later in SQL.
  7. Click Submit.

After linking, GA4 will start exporting to BigQuery within 24 hours. A dataset called analytics_[PROPERTY_ID] will appear in your BigQuery project. Daily exports appear as tables named events_YYYYMMDD; streaming creates an events_intraday_YYYYMMDD table that is replaced each day by the final daily table.

Step 3: Identify your experiment exposure events

For this to work, your A/B testing platform needs to log an event to GA4 when a user is exposed to an experiment. If you're using a dedicated A/B testing tool like Mida, this happens automatically via the GA4 integration — Mida fires an experiment_impression event with the test ID and variant as parameters.

In BigQuery, your exposure events will appear as rows in the events table with event_name = 'experiment_impression' (or whatever your testing tool names it).

Step 4: Query exact user counts per variant

The following query gives you exact unique user counts and conversions for each variant of an experiment, using the raw event data:

-- Exact experiment user counts from GA4 BigQuery export
-- Replace 'YOUR_DATASET', date range, experiment_id, and goal event name

WITH exposures AS (
  SELECT
    user_pseudo_id,
    -- Extract the variant parameter from the experiment event
    (SELECT value.string_value FROM UNNEST(event_params)
     WHERE key = 'variant_id') AS variant
  FROM `YOUR_DATASET.events_*`
  WHERE _TABLE_SUFFIX BETWEEN '20260401' AND '20260430'
    AND event_name = 'experiment_impression'
    AND (SELECT value.string_value FROM UNNEST(event_params)
         WHERE key = 'experiment_id') = 'your-experiment-id'
),

conversions AS (
  SELECT DISTINCT user_pseudo_id
  FROM `YOUR_DATASET.events_*`
  WHERE _TABLE_SUFFIX BETWEEN '20260401' AND '20260430'
    AND event_name = 'purchase'  -- replace with your goal event
),

-- Deduplicate: one row per user per variant (take first assignment)
user_variants AS (
  SELECT
    user_pseudo_id,
    MIN(variant) AS variant  -- or use ARRAY_AGG with ORDER BY event_timestamp
  FROM exposures
  GROUP BY user_pseudo_id
)

SELECT
  uv.variant,
  COUNT(DISTINCT uv.user_pseudo_id) AS users,
  COUNT(DISTINCT c.user_pseudo_id) AS conversions,
  SAFE_DIVIDE(
    COUNT(DISTINCT c.user_pseudo_id),
    COUNT(DISTINCT uv.user_pseudo_id)
  ) AS conversion_rate
FROM user_variants uv
LEFT JOIN conversions c USING (user_pseudo_id)
GROUP BY uv.variant
ORDER BY uv.variant;

This query uses COUNT(DISTINCT user_pseudo_id) — an exact computation — instead of any HLL++ estimate. The output gives you the exact numbers to plug into your significance calculator.

Step 5: Run statistical analysis on the exact counts

Take the user counts and conversion counts from the BigQuery query and run them through a standard two-proportion z-test or chi-squared test. You can use:

  • Mida's built-in A/B testing significance calculator
  • A Python/R script using scipy.stats.proportions_ztest
  • Any spreadsheet-based calculator that accepts raw user counts and conversion counts

The critical difference from using GA4's interface: you are now working with exact numbers, so the statistical thresholds you set actually mean what they say.

What If BigQuery Setup Is Too Complex?

The BigQuery export requires a Google Cloud account and basic SQL comfort. For some teams, that's a barrier. Here's the practical alternative hierarchy:

  1. Use your A/B testing tool's native statistics. A purpose-built A/B testing platform like Mida tracks unique visitors independently from GA4, computes its own exact counts, and runs significance calculations on clean data. This is the most reliable approach for most teams and requires no BigQuery setup at all.
  2. Use GA4 as supplemental context only. You can still use GA4 to understand segment breakdowns, funnel drop-off, and behavioral patterns around your experiment. Just don't use GA4 user counts as the input to your significance calculations.
  3. If using GA4 for analysis, limit to very low traffic. Below roughly 12,000 users per arm, the HLL++ distortion is small enough to be tolerable. Above that threshold, the error starts compounding into test outcome territory.

Common Questions

Q: Does this affect GA4's built-in A/B test analysis (if using Google Optimize or similar)?
A: Google Optimize shut down in September 2023. If you were using it and analyzing results in the GA4 interface, yes — your analyses were subject to HLL++ distortion. The Google Optimize data pipeline was affected by the same property-level estimation.

Q: Does this apply to session-level conversion rates too?
A: Session-based metrics are less affected by HLL++ (the algorithm was primarily applied to user-level metrics), but are subject to sampling in Explore reports. More importantly, session-based conversion rates are conceptually wrong for A/B testing because of the independence violation discussed in Problem 1.

Q: If I use GA4 just to track one goal event, not users, am I safe?
A: Event counts (not user-distinct counts) are exact in GA4 — they don't go through HLL++. If your conversion metric is a raw event count (e.g., total purchase events, not unique purchasing users), this issue doesn't apply. However, raw event count metrics violate user-level independence in A/B tests, which is a different problem.

Q: Is this unique to GA4 or do other analytics tools have the same issue?
A: Adobe Analytics uses its own HyperLogLog implementation with up to 5% inaccuracy for 95% of estimates — worse than GA4's. Other analytics platforms vary. Tools that export raw events to warehouses (like Amplitude, Mixpanel, or Segment) allow exact counts if queried correctly at the source.

Free A/B Testing Tool

Run your next A/B test the right way

Visual editor, 15 KB script, GA4-native — and free forever up to 100,000 monthly visitors. No developer required.

✓ Visual editor✓ 15 KB script✓ GA4 integration✓ Free up to 100k visitors
Try Mida free →

Final Thoughts

GA4 is an excellent analytics tool for understanding traffic, attribution, and user behavior at a high level. It is not designed to be the statistical engine for controlled experiments, and using it as one introduces errors that compound with scale in ways that are hard to detect without going back to raw data.

The three-sentence summary:

  • GA4 user counts are estimates, not exact counts, and you can't change that.
  • At scale (>20K users per arm), those estimates produce false positive rates far above what your confidence threshold implies.
  • The fix — BigQuery export with COUNT(DISTINCT user_pseudo_id) — is straightforward once set up, and it gives you the clean data your experiment analysis requires.

If you're running experiments on a dedicated A/B testing platform like Mida, your experiment statistics are already computed from exact visitor counts that Mida tracks independently — not from GA4. GA4 integration in Mida sends exposure and conversion events to GA4 for cohort analysis and funnel visualization, but the significance calculations in Mida's reporting use the platform's own first-party data. That's the right architecture for reliable experimentation.

Before you make your next product decision based on an A/B test, make sure you know what numbers you're actually looking at.


Sources and further reading:

Get Access To Our FREE 100-point Ecommerce Optimization Checklist!

This comprehensive checklist covers all critical pages, from homepage to checkout, giving you actionable steps to boost sales and revenue.

Decorative graphicDecorative graphicDecorative graphicDecorative graphic