Statistical Methods

Statistical Methods Used

THIS ARTICLE WILL HELP YOU:

Convert supports two distinct statistical approaches for your reports: Frequentist and Bayesian.

Frequentist

Important Update: As of October 13th, 2023, all new experiences utilize t-tests instead of z-tests. Experiences initiated prior to this date employed z-tests for statistical significance.

For our updated Frequentist stats, we utilize t-tests. You have the flexibility to configure the following parameters:

The confidence level. Typically, a 95% confidence level is used for most experiments. However, if your experiment is of critical importance, selecting a 99% confidence level would be prudent.

The test type, either a one-tailed or two-tailed t-test. We generally recommend a one-tailed test for standard experiments, as it tends to reach significance more rapidly. For high-stakes, mission-critical experiments, opt for a two-tailed test. It may take longer to reach significance but provides a more conservative result.

Concerning multiple comparison correction techniques, you can choose from Bonferroni, Sidak, or None. From a statistical robustness perspective, Sidak stands out as the optimal choice, especially for mission-critical experiments. It maintains the family-wise error rate without adversely affecting the test's power. Nevertheless, the choice is yours.

Additionally, a Sensible Defaults menu is available. It allows you to quickly set "preferred" parameter values based on the criticality of your test—be it "standard" or "mission-critical."

Power Calculations

We support two power calculation modes that directly influence test completion automation:

  1. Dynamic: In this mode, we use the observed lift as the Minimum Detectable Effect (MDE) to compute the estimated test progress. This is the default setting and may cause the estimated progress to vary based on fluctuations in the lift, especially at the start of the test.

  2. Fixed: This mode is equivalent to standard fixed-horizon test planning. Here, you can set your target MDE, and the test progress will be calculated accordingly.

 

Sequential Testing

Our sequential testing employs Asymptotic Confidence Sequences for analysis, a method grounded in the advancements by Waudby-Smith et al. (2023). This approach shares similarities with the Generalized Anytime Valid Inference confidence sequences introduced by Howard et al. (2022). These methodologies offer robust solutions for handling data analysis in a dynamic and continuous manner, providing significant advantages over traditional testing approaches.

Key Features:

  1. Flexibility with Continuous Monitoring: One of the principal benefits of using confidence sequences is their ability to accommodate continuous monitoring of the experiment's data. This feature allows for evaluating the results of A/B tests at any point in time, enabling decisions to be made as soon as sufficient evidence is gathered, all while maintaining control over error rates.

  2. Designed for a Variety of Settings: The method is versatile and suitable for a wide range of experimentation scenarios, catering to different needs and user cases. It provides experimenters with the tools necessary to tailor the testing process to their specific requirements.

  3. Adjustability with Tuning Parameter: The tuning parameter is pivotal in configuring the tightness of the confidence sequences. It can be adjusted according to the anticipated decision-making point regarding sample size, thus balancing the benefits of early decision-making against the risk of premature conclusions.

  4. Control Over False Positive Rates: Confidence sequences effectively address the "peeking problem" commonly associated with interim analysis in A/B testing. This method controls the false positive rate despite frequent data checks, an achievement not possible with standard fixed-sample testing without specific adjustments.

Comparison with Other Methods:

  1. Against General Fixed-Sample Testing: Unlike fixed-sample testing, which requires waiting until a predetermined sample size or duration has been reached, confidence sequences facilitate a more dynamic and responsive decision-making process. This approach aligns well with environments that demand rapid iteration and real-time data analysis.

Implementing Sequential Testing in Our Tools:

  1. Sequential Tuning Parameter: The minimum number of visitors required for the test is utilized as a crucial tuning parameter in sequential testing. This setting is vital for controlling the statistical thresholds, ensuring that decisions made at any point are as reliable as those at the conclusion of the experiment.

  2. Adjusting the Tuning Parameter: Modifying this parameter affects the speed at which significant results can be detected. Increasing the parameter enhances the confidence in early results by demanding more data for declaring significance. This adjustment is especially valuable for critical decisions. Conversely, reducing the parameter can accelerate decision-making, which is beneficial in fast-paced environments where quick actions are essential.

  3. Optimal Settings: The default setting for the sequential tuning parameter is often placed at 5,000, based on historical data and average traffic to ensure a balance between responsiveness and rigor. However, it can be adjusted to better match the specifics of an individual experiment, depending on expected data variability and the critical nature of the test outcomes.

By integrating sequential testing into our suite of analytical tools, we provide a method that supports flexible and timely decision-making while upholding the stringent standards necessary for sound statistical analysis.

Bayesian

Within the Bayesian framework, you can set a decision threshold. This is the minimum "chance to win" probability you'd find acceptable for making decisions. The default is set at 95%, but this can be adjusted according to your risk tolerance. For those who seek maximal certainty, a 99% threshold is recommended.

Regarding priors, we employ uninformative priors, meaning that, a priori, each variant has an equal likelihood of either outperforming or underperforming the others. As data accumulate during the test, these priors are updated, resulting in posterior distributions that inform your decisions.