A/B testing is a type of controlled experiment used to find out how changes to a website, product, or service affect users and key metrics. It's the most reliable way to ensure that your changes have the desired effect, like increasing user engagement, conversions, and overall user experience.
A good introduction, is the one used by Ron Kohavi in his book “Trustworthy online controlled experiment” :
In 2017 a Bing employee suggested to change how headlines were displayed on the search engine. The idea was to have a longer title of search result by combining elements of the description.
Nobody thought this would be the change leading to the most revenue increase of bing history. After 6 months in backlog, one dev decided to implement it quickly, and evaluate how it perform on users. Randomly showing some of them the old title layout, and some others, the new one. Users interacting with the website were recorded, and metrics like clicks or revenue generated as well. This is an example of AB test, a simple type of test comparing 2 variant (controlled and treatment).

Few hours later an alert was raised in Bing for “revenue too high”, stating that it was too big to be true, an error should have occured with the experiment. New layout was creating too much revenue from ads. But it wasnt a mistake, revenue increased by 12% with this simple modification, that no one expected to lead such changes.
This example shows multiple things :
<aside> 📖 In addition, we can say that AB testing is the best scientific way to establish causality with high probability (unlike pure observational study of two groups or qualitative feedbacks for instance)
</aside>

Hierarchy of evidence to assess quality of trial design
Let’s define the main terms so that we can all speak the same language.
All those terms describe a same thing, a quantitative measurement of the experiment objective. It should be observable in the short term but drive long term impact (this article from Meta explain why).
NB : You should always use the same metric that will be the same over time for the company or at least a team. Indeed, you work toward a same goal and the AB test results needs to reflect the impact of the modification on that goal. I recommend normalizing key metrics by the actual sample size. For instance, don't use EOC like "nb of clicks" but rather $Nb clicks/nbPageview$. This will avoid obvious biais related to volume