<aside> đ Welcome to the this page folks. This document is essentially my experimentation diary, where I wrote down all my insights, best practices, and lessons I've picked up along the way. While I might not claim to be the ultimate expert in this field, if my notes and experiences can be of any help to others, I thought, why not share them?
</aside>
A/B testing is a type of controlled experiment used to find out how changes to a website, product, or service affect users and key metrics. It's the most reliable way to ensure that your changes have the desired effect, like increasing user engagement, conversions, and overall user experience.
A good introduction, is the one used by Ron Kohavi in his book âTrustworthy online controlled experimentâ :
In 2017 a Bing employee suggested to change how headlines were displayed on the search engine. The idea was to have a longer title of search result by combining elements of the description.
Nobody thought this would be the change leading to the most revenue increase of bing history. After 6 months in backlog, one dev decided to implement it quickly, and evaluate how it perform on users. Randomly showing some of them the old title layout, and some others, the new one. Users interacting with the website were recorded, and metrics like clicks or revenue generated as well. This is an example of AB test, a simple type of test comparing 2 variant (controlled and treatment).
Few hours later an alert was raised in Bing for ârevenue too highâ, stating that it was too big to be true, an error should have occured with the experiment. New layout was creating too much revenue from ads. But it wasnt a mistake, revenue increased by 12% with this simple modification, that no one expected to lead such changes.
This example shows multiple things :
<aside> đ In addition, we can say that AB testing is the best scientific way to establish causality with high probability (unlike pure observational study of two groups or qualitative feedbacks for instance)
</aside>
Hierarchy of evidence to assess quality of trial design
Letâs define the main terms so that we can all speak the same language.
All those terms describe a same thing, a quantitative measurement of the experiment objective. It should be observable in the short term but drive long term impact (this article from Meta explain why).
NB : You should always use the same metric that will be the same over time for the company or at least a team. Indeed, you work toward a same goal and the AB test results needs to reflect the impact of the modification on that goal. I recommend normalizing key metrics by the actual sample size. For instance, don't use EOC like "nb of clicks" but rather $Nb clicks/nbPageview$. This will avoid obvious biais related to volume