I've dealt with this enough that at this point I'm convinced all companies that do this fail to see the users through the metrics. A/B testing is overvalued.
We stopped doing A/B tests after I insisted that they all be done as A/A/B tests. Suddenly the "clear winners" weren't so clear after all. It confused and frustrated the marketing department so much that it was decided just to stop doing them all together.
The reason I wanted this type of test was because it was a waste of time testing shades of blue or two headlines that only differed by 2 words. The test variants were never radical enough to see any kind of significant uplift. Then after 5-10 tests the design starts to suffer by wandering down some weird path that nobody would consciously design from the outset. But the series of test "winners" made things go off in wild directions.
I still think there is some value in A/B testing (A/A/B only, if I'm honest). But in a small team, it's a waste of time.
For an A/A/B test are you taking three samples (instead of two), and two of the three get shown the same thing (A)? Then you only consider the results for the B group if the two A groups show the same behavior?
Not the person you're responding to, but yes, that's the idea. It's a control not for the the B but for the unknown unknowns that may or may not be there.
If A' and B both statistically differ from A, then you have a problem because you're not testing what you think you are testing, regardless of what your naive A/B test's p-value would have indicated.
You take three samples and two of three get shown the same thing. What happens here is both A groups will show different results until your sample grows enough that the CI window becomes small.
This helps to show the effect of a low sample from a non-uniform distribution.
A lot of people (me included) think they know statistics, but they don't.
The blogpost in OP also tries to explain the same thing - you shouldn't do statistics without understanding what it is your doing.
dont you then need to run the experiment for a long long time to get the significance of it? + the site needs enough users viewing the page to run it in any reasonable amount of time.. i would think most sites dont get enough traffic to run a/a let alone ab?
My experience is similar. Even if and when the metrics are calculated properly, there's often some design or business reason put forth as an excuse to ignore them.
Before any effort goes into something like this I always raise my hand and ask, "If the data shows us something we don't want to see, will we change our strategy? If not, I'd rather put time/effort into other projects." It works about 80% of the time.
Metrics are only useful if the organization is actually willing to learn lessons from them.
Oh yeah, defining the thresholds and their associated courses of action ahead of time is important to make a good decision.
Any time someone wants to measure something, the top two questions should be "what are the lower and upper bounds this value has to exceed for us to do something different?"
Very often, it turns out these thresholds for change are so astronomical that nobody thinks we have even the slightest chance of exceeding them. That means the measurement is completely useless. Whatever result we plausibly get, it won't change anything.
Not to mention the decision paralysis and change aversion it often introduces into company culture where every change, however trivial or however obviously beneficial, has to first go through a 2-week A/B test which often turns out to be inconclusive anyways, and sometimes takes more eng resources to set up and run than it takes to make the change itself.
Previous testing should give the company at least some baseline understanding of what is trivial and what isn't. The correct way to experiment is certainly not "let's experiment every idea!"
> however obviously beneficial
If you've been around long enough, you've almost certainly run into dozens of "obviously beneficial" changes that led to poorer performance.
Most of what you're describing is issues with poor prioritization, a lack of understanding about your audience, and a culture that has a difficult time making decisions.
I've had a similar experience. Some companies will do things like A/B test fonts and button colors, yet ignore bigger things like content, it's absurd.