Statistical Methods Used

Convert supports two distinct statistical approaches for your reports: Frequentist and Bayesian.

This Article Will Help You:




Frequentist

Important Update: As of October 13th, 2023, all new experiences utilize t-tests instead of z-tests. Experiences initiated prior to this date employed z-tests for statistical significance.

For our updated Frequentist stats, we utilize t-tests. You have the flexibility to configure the following parameters:

The confidence level. Typically, a 95% confidence level is used for most experiments. However, if your experiment is of critical importance, selecting a 99% confidence level would be prudent.

The test type, either a one-tailed or two-tailed t-test. We generally recommend a one-tailed test for standard experiments, as it tends to reach significance more rapidly. For high-stakes, mission-critical experiments, opt for a two-tailed test. It may take longer to reach significance but provides a more conservative result.

Concerning multiple comparison correction techniques, you can choose from Bonferroni, Sidak, or None. From a statistical robustness perspective, Sidak stands out as the optimal choice, especially for mission-critical experiments. It maintains the family-wise error rate without adversely affecting the test's power. Nevertheless, the choice is yours.

Sensible Defaults

The "Sensible Defaults" feature in our statistical settings panel serves as a powerful tool for users who need to rapidly configure their A/B testing experiments with predefined statistical parameters that are best suited to the specific criticality of their testing scenario.

When you access the "Sensible Defaults" dropdown menu, you can choose between two key configurations:

  1. Standard: This setting is tailored for routine tests where the stakes are moderate. The defaults are designed to balance between thoroughness and resource efficiency. For example, the statistical confidence for a T-Test under this setting is set at 95%, indicating a moderate level of confidence in the test results. The T-Test itself defaults to a one-tailed test with a Sidak correction, which adjusts for multiple comparisons to prevent an increase in the Type I error rate. Power calculations are dynamically adjusted, ensuring that the test is sufficiently powerful while not overcommitting resources.

  2. Mission-Critical: Intended for high-stakes testing scenarios where precision is paramount, this configuration adjusts the statistical parameters to maximize reliability and minimize risk. The confidence level is set higher at 99%, reflecting increased certainty in the outcomes. The test remains two-tailed, offering a conservative approach by testing for a difference in both directions and applying a Sidak correction for stringent control over error rates.

Additionally, each setting subtly modifies other underlying statistical assumptions and computations to best suit the test's importance. This feature not only saves time but also introduces a layer of best-practice statistical methodology that users can leverage without deep expertise in statistics.

Power Calculations

We support two power calculation modes that directly influence test completion automation:

  1. Dynamic: In this mode, we use the observed lift as the Minimum Detectable Effect (MDE) to compute the estimated test progress. This is the default setting and may cause the estimated progress to vary based on fluctuations in the lift, especially at the start of the test.

  2. Fixed: This mode is equivalent to standard fixed-horizon test planning. Here, you can set your target MDE, and the test progress will be calculated accordingly.

Sequential Testing

Our sequential testing employs Asymptotic Confidence Sequences for analysis, a method grounded in the advancements by Waudby-Smith et al. (2023). This approach shares similarities with the Generalized Anytime Valid Inference confidence sequences introduced by Howard et al. (2022). These methodologies offer robust solutions for handling data analysis in a dynamic and continuous manner, providing significant advantages over traditional testing approaches.

Key Features:

  1. Flexibility with Continuous Monitoring: One of the principal benefits of using confidence sequences is their ability to accommodate continuous monitoring of the experiment's data. This feature allows for evaluating the results of A/B tests at any point in time, enabling decisions to be made as soon as sufficient evidence is gathered, all while maintaining control over error rates.

  2. Designed for a Variety of Settings: The method is versatile and suitable for a wide range of experimentation scenarios, catering to different needs and user cases. It provides experimenters with the tools necessary to tailor the testing process to their specific requirements.

  3. Adjustability with Tuning Parameter: The tuning parameter is pivotal in configuring the tightness of the confidence sequences. It can be adjusted according to the anticipated decision-making point regarding sample size, thus balancing the benefits of early decision-making against the risk of premature conclusions.

  4. Control Over False Positive Rates: Confidence sequences effectively address the "peeking problem" commonly associated with interim analysis in A/B testing. This method controls the false positive rate despite frequent data checks, an achievement not possible with standard fixed-sample testing without specific adjustments.

Comparison with Other Methods:

  1. Against General Fixed-Sample Testing: Unlike fixed-sample testing, which requires waiting until a predetermined sample size or duration has been reached, confidence sequences facilitate a more dynamic and responsive decision-making process. This approach aligns well with environments that demand rapid iteration and real-time data analysis.

Implementing Sequential Testing in Our Tools:

  1. Sequential Tuning Parameter: The minimum number of visitors required for the test is utilized as a crucial tuning parameter in sequential testing. This setting is vital for controlling the statistical thresholds, ensuring that decisions made at any point are as reliable as those at the conclusion of the experiment.

  2. Adjusting the Tuning Parameter: Modifying this parameter affects the speed at which significant results can be detected. Increasing the parameter enhances the confidence in early results by demanding more data for declaring significance. This adjustment is especially valuable for critical decisions. Conversely, reducing the parameter can accelerate decision-making, which is beneficial in fast-paced environments where quick actions are essential.

  3. Optimal Settings: The default setting for the sequential tuning parameter is often placed at 5,000, based on historical data and average traffic to ensure a balance between responsiveness and rigor. However, it can be adjusted to better match the specifics of an individual experiment, depending on expected data variability and the critical nature of the test outcomes.

By integrating sequential testing into our suite of analytical tools, we provide a method that supports flexible and timely decision-making while upholding the stringent standards necessary for sound statistical analysis.

Bayesian

Our Bayesian framework offers several key advantages over frequentist approaches:

Intuitive Results

Bayesian analysis provides probabilities and distributions of likely outcomes rather than p-values and confidence intervals. This allows you to make statements like, "There’s a 95% chance this new button is better and a 5% chance it’s worse," which is more intuitive for decision-making.

Flexibility with Experiment Duration

Bayesian results remain valid even if you stop an experiment early. Although Bayesian methods can still be affected by "peeking," the main probabilities and statistical results are not invalidated by an early stop. However, it's important to note that stopping an experiment early can still result in inflated false positive rates.

Priors and Posteriors

We employ uninformative priors, meaning that, a priori, each variant has an equal likelihood of either outperforming or underperforming the others. As data accumulate during the test, these priors are updated, resulting in posterior distributions that inform your decisions.

Decision Thresholds

  1. The "Chance to Win" setting in the Bayesian analysis represents the probability threshold that a variant must meet or exceed to be considered as potentially outperforming the control group. This setting helps in making decisions based on the likelihood of one variant being better than another. It is a crucial factor in determining whether a variant's performance is statistically significant enough to warrant consideration over others.

  2. The "Risk" setting corresponds to the upper limit of the confidence interval for the risk metric associated with a variant. This setting is used to assess the statistical uncertainty associated with a variant's performance metrics. A lower risk threshold means that you require a tighter confidence interval (i.e., less uncertainty and more precision) to consider a variant as a potential winner. This setting helps in managing the downside risk when making decisions based on the Bayesian statistical analysis.