A/B Testing for E-Commerce

Despite the common belief that online testing is "simple," meaningful A/B testing with statistically valid results is actually quite challenging. Many e-commerce professionals struggle with several key issues:

Insufficient traffic for statistically valid tests
Prematurely ending tests (the peeking problem)
Tests running indefinitely without reaching significance
Testing only conversion rate instead of ARPU (Average Revenue Per User)
Not fully understanding what ARPU actually measures

In this guide, I'll share my key learnings about effective A/B testing for e-commerce, including why ARPU should be your primary metric, why 50/50 splits usually work best, and why you should start with big changes rather than minor tweaks. I'll also provide practical advice on test planning and when to seek expert help.

Whether you're new to A/B testing or looking to improve your current approach, these golden rules will help you avoid common pitfalls and develop a more effective testing strategy for your e-commerce business.

Let's dive into the specific rules for successful A/B testing:

Golden Rules

Use ARPU
For most tests you should look at ARPU (Average Revenue per User) – rule of thumb -> if the change can affect AOV and CVR independently, use ARPU. Why ARPU and not the Average Revenue per Session (long read)
- You don’t need special statistical tests for non-binomial metrics (long read)
Split Evenly
Split your variants evenly (eg. 50:50). An uneven split most of the time means that you’ll just take more time to reach your result and you won’t limit potential “negative” results. (long read)
Start with big changes
Sometimes you’ll read about these crazy stories of a change of a button color that brought in tens of thousands of incremental revenue. In reality these small changes need an extremely high amount of traffic to manifest in statistical relevance. So, if you don’t have millions of sessions start with big changes (long read)
Plan ahead
Don’t start your test without some timing assumptions and a definition of the minimal detectable effect (MDE) that you’ll want to achieve. You want to be able to say: I want to detect at least an uplift of 2% on the ARPU, for that, I’ll probably have to run my test 2 weeks (considering my historical data of the ARPU). A tool like Analytics Toolkit can help you there.

Who to follow

Samuel Hess with his ecommerce optimization agency Drip
and look at this post by Ton Wesseling you’ll find plenty of people in the space that share insights.
Thomas McKinlay shares great testing results that are backed by scientific research

Tech Tools

A/B-Testing Tools: Almost no free option anymore
ablyft.com / kameleoon.com / optimizely.com / abtasty.com / vwo.com (OMR Reviews) all provide testing capabilities as known from Google Optimize. With Google Optimizes shut-down there is unfortunately no free/cheap option out there to do a/b testing with a easy to use editor.
- One free option is to do the changes yourself with your Tag Management Solution like Google Tag Manager. It’s quite technical, but it could be a solution if you don’t have the budget for one of the tools above (and have technical know-how) (see tutorial)
A/B-Testing: Planning / Analytics
https://www.analytics-toolkit.com/ is pretty good tool with lots of insights. Its creator Georgi Georgiev also wrote a book about a/b testing.

How to get ARPU data and analyze it yourself

Let’s say you’ve done an A/B Test with one of the tools above or with a custom built solution. Typically you would have flagged each user with the randomly chosen variant and sent that information to GA4 (or your alternative analytics tool).

The dataset that you’ll want to analyze would look like this:

experiment_variant	value
variant_1	34.95
control	0
control	14.99
variant_1	0
variant_1	40.23
control	45.99
control	0
control	0
variant_1	0

The total count of all rows should represent the total count of users that have been exposed to your experiment.
A row with a record above 0 represents a user with a transaction (e.g. a purchase).
A row with a record of 0 represents a user that has not converted but has been exposed to your experiment.
The average over all records represents the ARPU (average revenue per user) for all users that have been exposed to the experiment.

How to get the data from GA4 (or any other analytics platform)

Typically it’s not that easy to get user-level-data directly from GA4 (or any other analytics platform) but there’s an easy workaround.

Build a segment for the experiment that you’ve run. This would most likely be done using a custom dimension that you’ve set for each user you’ve exposed to the change.
Export all transactions for that segment split by variant:

transaction_id	transaction_value	experiment_variant
12342134	34.95	variant_1
23452345	14.99	control
21345453	40.23	variant_1
12345544	45.99	control

Count the total transactions per variant, e.g.:
variant_1: 1230
control: 1100

Now get the total amount of users for each variant with the segment you’ve built before

experiment_variant	total_users
variant_1	23453
control	23444

Now you have everything to build your dataset as shown above.

Put the data from the transaction export in a table
subtract the total count of transactions from the total users per variant and add those values as 0 values in your dataset

The simple assumption behind that approach is that there is only one transaction per user and the rest of the users have not converted and as such have to be represented by a 0 in your dataset.

Example Calculations in Google Sheets

Grab a copy and put in your data, should be pretty self explanatory.

This is a free resource for subscribers to my newsletter – you'll receive this file and updates whenever I publish new content. I don't have the time to spam you – I just want to share my knowledge and build a community of like-minded people. And you can obviously unsubscribe anytime if I happen to not deliver on my promise.

Once you've subscribed the article is unlocked and you can directly download the files to get started with this workflow.