About 60 percent of the "winning" Google Ads changes that advertisers roll out in 2026 were never actually proven — they were judged by comparing this month to last month, where seasonality, competitors and a dozen other edits all moved at once. An experiment removes that confusion by running the old and new versions side by side, in the same auctions, at the same time, so the only thing that differs is the one change you are testing.
This guide walks through drafts and experiments end to end — what to test, how to split traffic, how much data you need, and how to read the result without fooling yourself — so your next "win" is a real one. To see which parts of your account are most worth testing first, run our free 5-axis Google Ads audit.
Updated 2026-05-17 with current drafts and experiments behavior, Smart Bidding learning windows and significance practice observed across US, UK and European accounts.
- Draft first, then experiment — a draft is a safe sandbox; the experiment splits live traffic. 2. One variable per test — change the bid strategy or the landing page, never both.
- Run arms concurrently — both face the same seasonality, so time is no longer a confound. 4. Size the sample before you peek — aim for 100+ conversions per arm and 2 to 4 full weeks. 5. 95 percent is a gate, not a finish line — a non-significant result is unproven, not a tie.
What are drafts and experiments in Google Ads?
Drafts and experiments are two halves of the same workflow, and understanding the split is the foundation for every test below. A draft is a staging copy; an experiment is the live comparison that copy makes possible.
Drafts — A draft is a sandbox duplicate of a live campaign where you make your proposed change without touching the original. Nothing in a draft spends money or serves ads; it is simply a safe place to stage one edit — a new bid strategy, a different landing page, a rewritten RSA — and review it before any traffic sees it.
Experiments — Promoting a draft into an experiment is what makes the comparison real. Google splits the campaign's eligible auctions between the original (the control) and the draft (the variant), so both run at the same time against the same competition and seasonality. This concurrency is the whole point: it removes time as a confounding variable.
Why this beats a before-and-after — When you change a live campaign and compare last week to this week, every other moving part — competitors, demand, your other edits — is baked into the result. Because an experiment runs both arms together, the difference you measure is far closer to the true effect of your one change. For the causal logic behind this, see our incrementality testing guide.
What should you actually test first?
Not every change deserves an experiment, and the ones that do should be ranked by how much they can move CPA. Spend your limited traffic on the few tests with real leverage, not on cosmetic tweaks.
Bidding strategy — This is usually the highest-leverage test because the bid algorithm decides what you pay for every click. Comparing Maximize Conversions against Target CPA, or one Target CPA against a tighter one, can swing cost per conversion materially. Our Maximize versus Target CPA guide breaks down when each wins.
Landing pages — Sending the variant arm to a different URL is one of the cleanest tests in the platform, because the page change is fully isolated from the ad. A faster page, a tighter headline, or a shorter form often moves conversion rate more than any bid tweak. See our landing page conversion guide.
Ad copy and RSAs — Testing a new RSA or a different asset mix tells you what messaging the auction actually rewards. The method matters here: our RSA writing method shows how to build variants worth testing.
One variable at a time — Whatever you choose, change exactly one thing. Bundle a new bid strategy with a new landing page and a winning result tells you nothing reusable, because you cannot attribute the lift to either change.
How do you set up a valid 50/50 experiment?
A valid experiment is mostly about discipline at setup. Get the split, the timing and the isolation right, and the read at the end is trustworthy; get them wrong and no amount of analysis saves the result.
The 50/50 split — Start with an even traffic split so both arms accumulate data at the same rate and reach significance together. An uneven split — say 10/90 — protects the original but starves the variant of data, so it takes far longer to prove anything.
Cookie-based assignment — Use a cookie-based split rather than a search-based one so a returning user always sees the same arm. Otherwise the same person can land in both the control and the variant, blurring the comparison and inflating noise.
Equal everything else — The draft must match the original on budget, targeting, schedule and structure. The single permitted difference is your test variable. If the variant also has a higher budget or a different geo, you are no longer measuring what you think you are measuring.
Time the start — Launch at the beginning of a week and plan to run full weeks. Starting mid-week loads one arm with more weekend traffic than the other early on, which adds avoidable noise to the first read.
How much traffic and time do you need for significance?
This is where most experiments go wrong: they are stopped too early, on too little data, because the dashboard looked exciting. Significance is a function of conversions and effect size, not of how many days have passed.
Conversions, not clicks — Significance is driven by conversions per arm, not impressions or clicks. A rough working floor is 100 conversions per arm; fewer than 30 per arm is almost never conclusive. Clicks accumulate fast and tempt you to read early, but the conversion count is what actually decides the test.
Effect size sets the cost — The smaller the true difference, the more data you need to see it. Detecting a 30 percent swing might take a few hundred conversions per arm; detecting a 5 percent swing can take thousands. Decide upfront how big an effect is worth detecting and size the test for it.
Most tests need 2 to 4 weeks — In practice, accumulating enough conversions across full weeks lands most experiments in a 2 to 4 week window. If your account produces only 20 to 40 conversions a week, accept that you can only reliably detect large effects, and design bold tests accordingly.
Do not lower the bar for speed — When volume is thin, extend the window rather than declaring a winner early. A fast read on a small sample is usually a false read, and acting on it costs more than the wait.
How do you read results without fooling yourself?
The hardest part of testing is not setup — it is resisting the stories your own brain tells about early data. Most false wins are self-inflicted, created by reading too soon and stopping too eagerly.
Peeking creates false wins — Early on, each arm has so few conversions that one lucky day can put the variant 40 percent ahead. If you check daily and stop the moment it looks good, you will lock in noise as if it were signal. Decide the sample size first, then ignore the dashboard until you hit it.
Regression to the mean — An arm that surges early almost always drifts back toward the true value as data accumulates. The dramatic early gap is the least reliable number in the whole test, yet it is the one that tempts people to stop. Wait for the gap to stabilize.
95 percent is a gate, not a goal — Treat the 95 percent confidence indicator as the minimum bar to clear, not a target to celebrate. Clearing it means the difference is probably real; not clearing it means the result is unproven, which is not the same as a tie.
Judge on the right metric — Compare arms on cost per conversion and conversion value, not on clicks or CTR. A variant can win on engagement and still lose on the money metric that actually matters, so always anchor the decision to outcomes.
How do you roll out or roll back a winning experiment?
A clean result is only useful if you apply it cleanly. The rollout step is where teams quietly reintroduce noise, either by reverting too fast or by resetting the learning they just paid for.
Apply, do not rebuild — When the variant wins, apply the experiment to update the original campaign rather than recreating it from scratch. Applying preserves history and signal where possible; rebuilding throws away learning and forces a fresh, costly ramp.
Expect a short re-learning dip — Applying a change, especially a bidding change, can trigger a brief learning period as the algorithm re-stabilizes. Plan for a few quiet days before the win fully shows in steady-state numbers, and do not panic-edit during them.
Discard cleanly on a loss — If the variant loses or ties, end the experiment and keep the control untouched. A tie is a real outcome: it tells you the change did not help, which saves you from rolling out a non-improvement to your whole account.
Document every result — Record what you tested, the sample size and the outcome, win or lose. This stops your team re-running the same inconclusive test in three months and builds a library of what your account actually responds to. To convert raw rate differences into expected revenue impact before you commit, use our free 5-axis audit alongside the conversion rate calculator.
The experiment-design decision table
Use this table to choose the right test, the right split and the right read for the situation in front of you. It is ordered roughly from setup decisions to result-reading discipline.
Early in a test each arm has only a handful of conversions, so a single lucky day can put the variant 30 to 40 percent ahead before regression to the mean drags it back. Stopping there locks in noise as if it were a result and ships a change that does not actually help. Decide your sample size and minimum duration before launch, then ignore the dashboard until you hit them. A result that has not cleared 95 percent confidence is unproven, not a win.
How to put it all together
The discipline of testing compounds: each clean experiment makes the next decision cheaper and more confident. The accounts that improve fastest are not the ones that change the most, but the ones that prove the most.
Test the big levers — Spend your limited traffic on bidding and landing-page experiments where the leverage is real, and skip the cosmetic tweaks that cannot move CPA enough to reach significance anyway. Bold tests on thin volume beat subtle tests you can never prove.
Protect the read — Size the sample before you start, run full weeks, let Smart Bidding exit learning, and hold to the 95 percent bar even when an early gap tempts you. The whole value of an experiment is destroyed the moment you peek and stop early.
Build a habit — Document every test, win or lose, so your account accumulates a library of proven changes instead of a pile of unproven hunches. Over a year, a team that runs one clean experiment a fortnight learns more than one that ships ten blind edits a week.
To find the highest-leverage tests in your own account before you spend a week proving them, run the SteerAds free 5-axis audit, then size the expected impact of any rate change with our conversion rate calculator.
Sources
Official sources consulted for this guide:
-
support.google.com — about campaign experiments
-
support.google.com — about drafts
-
blog.google — ads and commerce updates
-
ads.google.com — Google Ads
FAQ
How do Google Ads experiments actually work?
An experiment splits a single campaign's traffic into two arms that run at the same time. You first create a draft, which is a sandbox copy of the original campaign where you make one change, then you promote that draft into an experiment with a traffic split, usually 50/50. From that point Google randomly assigns each eligible auction to either the control or the variant, so both arms face the same seasonality, competition and audience. Because the two arms run concurrently rather than before-and-after, you isolate the effect of your change from time-based noise. The experiment dashboard then reports each arm's metrics side by side with confidence indicators.
How long should a Google Ads experiment run?
Run it until you reach statistical significance, not until a fixed date, and never stop on the first good-looking day. In practice most experiments need 2 to 4 weeks because you need enough conversions per arm, not just enough clicks. A rough floor is around 100 conversions per arm before you trust a result, and fewer than 30 per arm is almost never conclusive. Always run across full weeks so both arms see the same weekday and weekend patterns. If volume is very low, extend the window rather than lowering your bar, because a fast read on thin data is usually a false read.
What can I actually A/B test in Google Ads?
The cleanest tests change exactly one variable so the result is interpretable. The four highest-value tests are bidding strategy, such as Maximize Conversions versus Target CPA; landing page, sending the variant arm to a different URL; ad copy and RSA assets; and audience or targeting changes. Bidding and landing-page tests usually move CPA the most, which is why they are worth the wait. Avoid bundling several changes into one experiment — if you change the bid strategy and the landing page together and CPA improves, you cannot tell which one did it, so you learn nothing reusable.
How many conversions do I need for a valid result?
It depends on the size of the effect you want to detect, but a practical rule is at least 100 conversions per arm for a moderate effect and far more to catch a small one. Detecting a 5 percent change reliably can take thousands of conversions per arm, while a 30 percent change shows up with a few hundred. The smaller the true difference, the more data you need to separate it from random noise. If your account only produces 20 to 40 conversions a week total, accept that you can only detect large effects, and design bold tests rather than subtle tweaks.
Why do experiments often show a false winner early?
Early in an experiment each arm has very few conversions, so random variation swings the numbers wildly — one lucky day can put the variant 40 percent ahead before regression to the mean pulls it back. This is why peeking at results daily and stopping the moment one arm looks good produces false wins so often. The fix is to decide your sample size and duration before you start, then ignore the dashboard until you hit it. Treat the 95 percent confidence indicator as a minimum gate, not a finish line, and remember that a result that is not significant is not a tie — it is simply unproven.
Experiments versus just changing the campaign directly — which is better?
An experiment is a controlled comparison; a direct change is a blind bet. If you simply edit the live campaign and CPA improves the next week, you cannot prove the edit caused it, because the weather, competitors, seasonality and your own other changes all moved at the same time. The experiment holds those constant by running both versions concurrently. The trade-off is that experiments split your volume, so each arm gets half the data and significance takes longer. Use experiments for any change big enough to matter and reversible enough to test — and just ship tiny, obviously-correct fixes directly.
Can I run a Google Ads experiment on Smart Bidding?
Yes, and bidding experiments are among the most valuable because the bid strategy drives so much of your CPA. You can compare two strategies — for example Maximize Conversions against Target CPA — or the same strategy at two different targets. The one caution is the learning period: each arm needs time to exit learning before its numbers mean anything, so add roughly one to two weeks on top of your normal significance window. Do not judge a bidding experiment while either arm is still learning, and avoid large mid-flight edits that reset that learning and contaminate the comparison.