In this post I would like to share all of my knowledge about experiments in product management. I would like to point out that I’m not experienced in conducting experiments. However, here I would like to share my approach and knowledge about experiments, so You, as my future employer, will know in details what I bring to the table.
It’s just a tool.
In my opinion, experiments should be perceived as just one of the research tools. Experiments are not a good fit in every situation, nor in every organisation. I see experiments as a method which can help in the decision making during discovery phase, day-to-day support of the Development Team and in any situation when only the data can help to decide whether new idea will bring the value or not.
So when is a good time to do a test?
It’s crucial to make sure that experiments are worth it. In theory, experiments should provide a way to validate ideas and ultimately maximise added value (more about the value later). If your organisation doesn’t have proper tools in place or experiments are not perceived as valid way of prioritisation, trying to introduce experiments might be an effort with very low ROI.
But if the above is not an issue, experiments should:
- Support of discovery phase
If there is any major assumption on which the new feature is build, I would like to test the assumption and iterate basing on the outcomes.
- UX/UI tweaks
If UX or UI Team is suggesting an improvement - let’s test it.
- PM’s initiatives
If you would like to build a case of a new initiative to present it to your decision makers, outcomes of the experiment might make a powerful statement.
- Word against word
When being surrounded by talented professionals, there might occur situations when someone is suggesting an improvement with very high confidence. The ability to test the idea instead of lengthy conversations about pros and cons is just so much better solution.
Who should do the test?
According to the opinion that I agree with the most, ideally, experiments should be conducted by the Data Analytics Team. Product Manager should initiate the process by asking the right questions but all technicalities and interpretation of the outcomes should be carried on by professionals specialised in data analysis. If that’s possible in your organisation then great but it’s rarely the case. I believe that whenever it’s possible you should ask for help from people specialised in data analysis, in order to minimise the possibility of misinterpreting the outcomes but if their help is unavailable, you should be able to fill in.
The framework
And in order to that, you need to know how. Here is how I would go about it.
1) The hypothesis
Experiment should resolve an issue. Clear hypothesis will make an experiment relevant.To build a hypothesis there are three elements needed:
- Insight
- Change
- Expected uplift of a metric
As insight I mean any piece of data, heuristic, benchmark, best practice, suggestion from customer service, conclusion made during an interview with User. In other words, any solid and documented clue of a problem, space of improvement or sign of desire for a new feature from User.
As a change I mean a product solution that should address the clue. New positioning of elements of the page, new copywriting, change in the customer journey, change of the currency or the infamous colour of the button.
As expected uplift of the metric I mean the reason why we are conducting an experiment in the first place. Example of the hypothesis:
We know that in last three months only 21% of freshly onboarded Users in US fund their accounts. Based on GA data and surveys, we know that after creating an account User needs to wait 1-2 day for the account to get verified and that’s when they drop off.
We believe that changing “Go to trading platform” call to action button for US Users will result in uplift of conversion from “account created” to “account funded”. As a result of the change, User will be more willing to visit and familiarise themselves with the trading platform while their account is being verified, understand the value proposition and ultimately become more loyal Customer.
We’ll know by testing “Enforced Trading Call to Action” UI design for all US Customers who see MyAccount page for the first time and observing 1) Conversion rate “Account created” -> “Account funded” as a main metric and 2) % of Users who clicked the new CTA button as a assisting metric. We hope to see minimal uplift of 5,2% of CR.
Good hypothesis will show why we think there is an issue, how exactly we would like to fix it and which metrics we would like to influence.
I would like to spend some time on the numbers which you can see in the hypothesis. In order to get detailed numbers you need to run pre-calculations. Again, if there is a Data Analytics Team in your organisation to help you with doing pre-calculations - go to them. But as strongly I believe that “Data” should be done by “Data People”, I also believe that PM should know what the numbers mean.
Pre-calculations will help us to understand if we even stand a chance to run the experiment successfully. Our biggest enemy is low traffic. In pre-calculations we can enter currently observed conversion rate, traffic on our site and desired uplift. In return we will see what is minimum sample size or what is minumum detectable change given our webpage traffic data.
To follow up on the example hypothesis, let’s say we have 2k new accounts created every month and current conversion rate from "account created” to “account funded” equals 21%.
After entering the numbers into pre-test calculator, we can see that the required minimum sample size equals 1723, which means that the test should last 7 weeks. In order to achieve required levels of statistical power and significance of the test, we need to observe 26% relative uplift of the conversion rate, which in percentage points is around 5,2%.
2) Let’s run the test
Here, I would focus on segments and the duration of test.
In the example hypothesis, I mentioned US Users as the targeted segment. That’s a broad segment. In order to gain as much control over the outcomes of the test as possible, it is best to test the smallest possible changes on the smallest possible segment. That way the certainty of what action resulted in what change of metrics will increase. The reason for that is Users in US can showcase many different behaviours and without targeting our experiment to the smaller group of Users who share similar characteristics (for example financial status, lifestyle, goal of using our product), we can’t be sure if assumptions in the hypothesis were tested. The same goes with change on our website. The more changes we will implement in the tested version, the more difficult it is to determine which change impacted the metric. Ideally, we would like to have smaller, properly categorised group of Users and only one element changed.
Duration of the test depends on the traffic. Pre-test calculator can suggest the realistic timeframe of the experiment. While the experiment is live, we should monitor sample size of Users who used test version of our website, and also p value and statistical power of the experiment. Without more advanced statistical knowledge it is good to wait till sample size will hit the number from pre-test calculator.
Monitoring p-value and statistical power of the test and making sure that they are on good levels, will prevent us from type I and type II errors. Type I (false positive) error is when we observe uplift in the metric but in reality there is no uplift. P-value in check will prevent type I error from occurring.
Type II (false negative) error is when we don’t see any uplift but in reality there is improvement of our observed metric. Statistical power will prevent type II error from occurring.
P-value and power will be calculated by the A/B testing tool of your choice and by the rule of thumb your experiment should achieve p-value<0,05 and power of at leat 80%.
It is very important, that if you don’t have an access to p-value or power, you should not consider the outcomes if the test as valid.
3) Documentation
It is very important to document the tests. Outcomes of the experiment can be used in the future as an insight and enrich the knowledge about your Users. By retaining the knowledge we will make sure that we don’t duplicate the work and, overtime, our methodology of conducting experiments will improve if we will learn from mistakes or successes from the past.
In the documentation of test I would like to include hypothesis, insight which has been used as a reason for the test, monitored metrics, UX/UI designs, technical details even pieces of code if it has been written to conduct the experiment. I would also conclude the test with the results - which version was successful, why we think that is, has there been any interruptions during the test. Every info is important because in 2 weeks you won’t remember it.
Value of the test
In the given example of an experiment, we were expecting uplift of 5,2% of CR. By using LTV, uplift of CR in that case can be converted to financial benefit that the company might enjoy. However it will be achieved, I believe that experiments should be treated as an investment. That means that conducting an experiment should be treated as a cost and outcome of the test should bring financial gains to the company. It is very important to keep that mindset through out the whole process and be able to prove real value of an experiment whenever you will be asked about it.