# Sample Size Guide Reference for calculating sample sizes and test duration. ## Sample Size Fundamentals ### Required Inputs 1. **Baseline conversion rate**: Your current rate 2. **Minimum detectable effect (MDE)**: Smallest change worth detecting 3. **Statistical significance level**: Usually 95% (α = 0.05) 4. **Statistical power**: Usually 80% (β = 0.20) ### What These Mean **Baseline conversion rate**: If your page converts at 5%, that's your baseline. **MDE (Minimum Detectable Effect)**: The smallest improvement you care about detecting. Set this based on: - Business impact (is a 5% lift meaningful?) - Implementation cost (worth the effort?) - Realistic expectations (what have past tests shown?) **Statistical significance (95%)**: Means there's less than 5% chance the observed difference is due to random chance. **Statistical power (80%)**: Means if there's a real effect of size MDE, you have 80% chance of detecting it. --- ## Sample Size Quick Reference Tables ### Conversion Rate: 1% | Lift to Detect | Sample per Variant | Total Sample | |----------------|-------------------|--------------| | 5% (1% → 1.05%) | 1,500,000 | 3,000,000 | | 10% (1% → 1.1%) | 380,000 | 760,000 | | 20% (1% → 1.2%) | 97,000 | 194,000 | | 50% (1% → 1.5%) | 16,000 | 32,000 | | 100% (1% → 2%) | 4,200 | 8,400 | ### Conversion Rate: 3% | Lift to Detect | Sample per Variant | Total Sample | |----------------|-------------------|--------------| | 5% (3% → 3.15%) | 480,000 | 960,000 | | 10% (3% → 3.3%) | 120,000 | 240,000 | | 20% (3% → 3.6%) | 31,000 | 62,000 | | 50% (3% → 4.5%) | 5,200 | 10,400 | | 100% (3% → 6%) | 1,400 | 2,800 | ### Conversion Rate: 5% | Lift to Detect | Sample per Variant | Total Sample | |----------------|-------------------|--------------| | 5% (5% → 5.25%) | 280,000 | 560,000 | | 10% (5% → 5.5%) | 72,000 | 144,000 | | 20% (5% → 6%) | 18,000 | 36,000 | | 50% (5% → 7.5%) | 3,100 | 6,200 | | 100% (5% → 10%) | 810 | 1,620 | ### Conversion Rate: 10% | Lift to Detect | Sample per Variant | Total Sample | |----------------|-------------------|--------------| | 5% (10% → 10.5%) | 130,000 | 260,000 | | 10% (10% → 11%) | 34,000 | 68,000 | | 20% (10% → 12%) | 8,700 | 17,400 | | 50% (10% → 15%) | 1,500 | 3,000 | | 100% (10% → 20%) | 400 | 800 | ### Conversion Rate: 20% | Lift to Detect | Sample per Variant | Total Sample | |----------------|-------------------|--------------| | 5% (20% → 21%) | 60,000 | 120,000 | | 10% (20% → 22%) | 16,000 | 32,000 | | 20% (20% → 24%) | 4,000 | 8,000 | | 50% (20% → 30%) | 700 | 1,400 | | 100% (20% → 40%) | 200 | 400 | --- ## Duration Calculator ### Formula ``` Duration (days) = (Sample per variant × Number of variants) / (Daily traffic × % exposed) ``` ### Examples **Scenario 1: High-traffic page** - Need: 10,000 per variant (2 variants = 20,000 total) - Daily traffic: 5,000 visitors - 100% exposed to test - Duration: 20,000 / 5,000 = **4 days** **Scenario 2: Medium-traffic page** - Need: 30,000 per variant (60,000 total) - Daily traffic: 2,000 visitors - 100% exposed - Duration: 60,000 / 2,000 = **30 days** **Scenario 3: Low-traffic with partial exposure** - Need: 15,000 per variant (30,000 total) - Daily traffic: 500 visitors - 50% exposed to test - Effective daily: 250 - Duration: 30,000 / 250 = **120 days** (too long!) ### Minimum Duration Rules Even with sufficient sample size, run tests for at least: - **1 full week**: To capture day-of-week variation - **2 business cycles**: If B2B (weekday vs. weekend patterns) - **Through paydays**: If e-commerce (beginning/end of month) ### Maximum Duration Guidelines Avoid running tests longer than 4-8 weeks: - Novelty effects wear off - External factors intervene - Opportunity cost of other tests --- ## Online Calculators ### Recommended Tools **Evan Miller's Calculator** https://www.evanmiller.org/ab-testing/sample-size.html - Simple interface - Bookmark-worthy **Optimizely's Calculator** https://www.optimizely.com/sample-size-calculator/ - Business-friendly language - Duration estimates **AB Test Guide Calculator** https://www.abtestguide.com/calc/ - Includes Bayesian option - Multiple test types **VWO Duration Calculator** https://vwo.com/tools/ab-test-duration-calculator/ - Duration-focused - Good for planning --- ## Adjusting for Multiple Variants With more than 2 variants (A/B/n tests), you need more sample: | Variants | Multiplier | |----------|------------| | 2 (A/B) | 1x | | 3 (A/B/C) | ~1.5x | | 4 (A/B/C/D) | ~2x | | 5+ | Consider reducing variants | **Why?** More comparisons increase chance of false positives. You're comparing: - A vs B - A vs C - B vs C (sometimes) Apply Bonferroni correction or use tools that handle this automatically. --- ## Common Sample Size Mistakes ### 1. Underpowered tests **Problem**: Not enough sample to detect realistic effects **Fix**: Be realistic about MDE, get more traffic, or don't test ### 2. Overpowered tests **Problem**: Waiting for sample size when you already have significance **Fix**: This is actually fine—you committed to sample size, honor it ### 3. Wrong baseline rate **Problem**: Using wrong conversion rate for calculation **Fix**: Use the specific metric and page, not site-wide averages ### 4. Ignoring segments **Problem**: Calculating for full traffic, then analyzing segments **Fix**: If you plan segment analysis, calculate sample for smallest segment ### 5. Testing too many things **Problem**: Dividing traffic too many ways **Fix**: Prioritize ruthlessly, run fewer concurrent tests --- ## When Sample Size Requirements Are Too High Options when you can't get enough traffic: 1. **Increase MDE**: Accept only detecting larger effects (20%+ lift) 2. **Lower confidence**: Use 90% instead of 95% (risky, document it) 3. **Reduce variants**: Test only the most promising variant 4. **Combine traffic**: Test across multiple similar pages 5. **Test upstream**: Test earlier in funnel where traffic is higher 6. **Don't test**: Make decision based on qualitative data instead 7. **Longer test**: Accept longer duration (weeks/months) --- ## Sequential Testing If you must check results before reaching sample size: ### What is it? Statistical method that adjusts for multiple looks at data. ### When to use - High-risk changes - Need to stop bad variants early - Time-sensitive decisions ### Tools that support it - Optimizely (Stats Accelerator) - VWO (SmartStats) - PostHog (Bayesian approach) ### Tradeoff - More flexibility to stop early - Slightly larger sample size requirement - More complex analysis --- ## Quick Decision Framework ### Can I run this test? ``` Daily traffic to page: _____ Baseline conversion rate: _____ MDE I care about: _____ Sample needed per variant: _____ (from tables above) Days to run: Sample / Daily traffic = _____ If days > 60: Consider alternatives If days > 30: Acceptable for high-impact tests If days < 14: Likely feasible If days < 7: Easy to run, consider running longer anyway ```