Split-Testing: Are yours statistically valid?

Split-testing campaigning emailings (and on web pages) is growing as organisations' e-campaigning starts to become more sophisticated. Yet ensuring each split-test is statistically valid is critical.

What is split-testing?

Split-testing is simply 'splitting' a group into two or more parts to expose to an 'ask' that differs by a single-variable. The results of each 'ask' is then compared to see which one performed the best. For emailings, this involves splitting the recipient list and the emailing sent to each differs by a single-variable like subject line, message tone, time/date of sending, etc.

The purpose of split-testing is to learn what will get a given recipient group to give the best response. Of course, what constitutes 'best' is up to your specific objectives and your test versions. Furthermore, results vary significantly between different group profiles, over time and often to seeming small variations in the messaging. Thus others' results, while useful to plan your test, should not be assumed to apply to your audience and objectives.

Split-testing is a well established technique and has been used in direct marketing for decades and in science for centuries.

Email split-testing principles

The people in each of the 'splits' have roughly the same composition as each other and of the overall group. A few sub-groups would have very different profiles, but as long as they are equally present in all splits then it is fine. A random split usually suffices for this.
The messaging differs by only one-variable. This is essential unless you venture into the more complicated multi-variate testing. A single-variable could be the subject line and then all other aspect (e.g. send date and time, sender, message body) are the same. If more than one vairable differs, the test results are compromised.
Identify one of your versions as the 'control' version. This is normally the one that represents the current way of doing things against which your 'test' versions will be compared.
Ensure a sufficient split (sample size) that is sufficient for testing and statistically valid. Most emailing split-tests fail on this criteria.

Planning an email split-test

Propose a split-test if not everyone agrees with what works best (this also helps avoid ego-bruising conflicts!)
Find out what existing evidence exists internally or externally. This can help you shape the split-test.
Run your split-test and share the results with peers across the sector
Run more split-tests based on what you learn, depending on the results consider
Plan to re-run the split tests in 12 months effectively building up a yearly schedule for new and repeat split tests

Being statistically valid

To be statistically valid, you should calculate the right sample size (for email, the number who open/click/participate depending on what you are testing) for your population (total email list size) with a given confidence level (and/or percentage difference between results) to find the confidence interval (the percentage range above which differences are statistically valid according to your confidence level).

If all that sounds confusing, use one of the many online sample size calculators.

I consider sample size to be not the number of people sent or receiving the email, but the number of people who do the action you want to measure (open/click/participate). This is because for public opinion surveys, they don't call 1,000 people, they call as many as it takes to get 1,000 responses!

So for anyone doing an email split-test, this means you need to:

know your 'normal' range of open/click/participation rates
calculate the sample size needed for a given confidence interval and population
calculate from your open/click/participation rates what size of split would get you this sample size
Doing the split (or delaying it)

You can, of course, do split-test less scientifically, but then their results are more open to dispute - and that is often on of the two key reasons we run email split-tests: to settle a difference in opinion about what works best and to improve engagement.

Thus, if you:

have 20,000 people on your email list (population) and you
want a confidence level of 95% with a
confidence interval of 1% then
you need a sample size of 6,489.

If you are comparing split-test at open rates and the norm is 40%, each 'split' needs to be 16,222 people (6489/0.4) - and thus you don't have enough for a split test with this confidence level and interval.

By increasing the confidence interval to 2%, it only requires a sample size of 2,144 and thus each split needs to be 5,360 (2144/0.4) so you can either have two splits and then email the rest with the best performing email, or 3 splits (almost 4) and learn the results for next time.

It also means that with a confidence interval of 2%, any difference of less than 2% isn't statistically valid = an insignificant difference!

You can start to see that having a large email list (e.g. > 50,000 subscribers) and high open/click/participation rates are almost pre-conditions for doing email split-testing :-)

Text vs HTML Emails

A debate that frequently re-surfaces (see this from 2000 and this from 2004) about mass emailings is the effectiveness of HTML vs. text. This is thus (due to 'Planning' point 1 above) a perfect candidate for a split-test.

Technology wise, multi-part emails make it possible to include both a plain-text version and an html version in a single email, so that people can choose with their email reader (e.g. Outlook, Gmail) what version they want to read. But this is more than a technology issue.

The existing evidence is not consistent: some says HTML gets higher click rates (it won't affect open rates for this test), whole others says text does. See:

HTML vs. Plain Text - a Test, 2 Aug 07
Compare performance in the latest vs. the 2006 MailerMailer Reports
search around and find others

The results from different published test results are inconsistent:

The 2 August 2007 test finds:
- A "Lite HTML" email outperformed Plain Text by 55% in click-through rate.
- A "Heavy HTML" (Ad style) email vs. Plain Text email. Plain Text outperformed Heavy HTML by 34%.
The MailerMailer reports find:
- June 2009: "There is virtually no difference in click rates between HTML (3.05%) and text email messages (2.95%)
- Jun 2006: "recipients are more likely to click on links in html emails (3.31), than plain text (2.71%)"

Despite this, we can learn a few things from them for our own tests:

Having a 'heavy' html (e.g. highly designed email newsletter) email may reduce click-throughs
The same test separated by a few years may have different results.

So why does light vs heavy html make a difference? The test don't tell us that, but we can speculate:

People probably like being communicated with but not marketed to. Think of it like snail-mail: a flier through the mailbox barely gets a glance, a personally addressed, hand written (or at least signed) letter gets opened, read and saved.
In email terms, 'light html' is showing you've taken some care to communicate clearly, heavy html says you are marketing and plain text is like a quick note on a post-it note put through the door.
In terms of the differences between years, people are becoming more experienced with email and are learning and adapting to what they receive. Thus their behaviour changes too.

Email and Mobile Phones

The rapid rise of the use of mobile phones for fetching email and browsing the web means plain text still needs to be a consideration. Specifically:

HTML versions that can be read without images loading (both for phones and for modern email readers)
Multi-part emails so that a device can use the plain text version if it is configured properly
Avoiding fixed-width emails so that they 'flow' properly into different sized devices

Perhaps the real question in the "html or plain text email" debate should be "html or plain text email for what" as this 13 Dec 2007 article starts to explore.

by Duane Raymond — published Oct 19, 2009,

Filed under: split-testing, practices

Clementine

How do you compute the normal range for your open rates?

May 15, 2010 09:32 pm

Personal tools

FairSay