Phase 2 refactoring of skills >500 lines and medium-sized skills: - paid-ads: 553 → 297 lines - Extract ad-copy-templates.md, audience-targeting.md, platform-setup-checklists.md - analytics-tracking: 541 → 292 lines - Extract ga4-implementation.md, gtm-implementation.md, event-library.md - ab-test-setup: 510 → 264 lines - Extract test-templates.md, sample-size-guide.md - copywriting: 458 → 248 lines - Extract copy-frameworks.md (headline formulas, section types) - page-cro: 336 → 180 lines - Extract experiments.md (experiment ideas by page type) - onboarding-cro: 435 → 218 lines - Extract experiments.md (onboarding experiment ideas) All skills now use progressive disclosure with references/ folders, keeping SKILL.md files focused on core workflow while detailed content is available when needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
6.9 KiB
Sample Size Guide
Reference for calculating sample sizes and test duration.
Sample Size Fundamentals
Required Inputs
- Baseline conversion rate: Your current rate
- Minimum detectable effect (MDE): Smallest change worth detecting
- Statistical significance level: Usually 95% (α = 0.05)
- Statistical power: Usually 80% (β = 0.20)
What These Mean
Baseline conversion rate: If your page converts at 5%, that's your baseline.
MDE (Minimum Detectable Effect): The smallest improvement you care about detecting. Set this based on:
- Business impact (is a 5% lift meaningful?)
- Implementation cost (worth the effort?)
- Realistic expectations (what have past tests shown?)
Statistical significance (95%): Means there's less than 5% chance the observed difference is due to random chance.
Statistical power (80%): Means if there's a real effect of size MDE, you have 80% chance of detecting it.
Sample Size Quick Reference Tables
Conversion Rate: 1%
| Lift to Detect | Sample per Variant | Total Sample |
|---|---|---|
| 5% (1% → 1.05%) | 1,500,000 | 3,000,000 |
| 10% (1% → 1.1%) | 380,000 | 760,000 |
| 20% (1% → 1.2%) | 97,000 | 194,000 |
| 50% (1% → 1.5%) | 16,000 | 32,000 |
| 100% (1% → 2%) | 4,200 | 8,400 |
Conversion Rate: 3%
| Lift to Detect | Sample per Variant | Total Sample |
|---|---|---|
| 5% (3% → 3.15%) | 480,000 | 960,000 |
| 10% (3% → 3.3%) | 120,000 | 240,000 |
| 20% (3% → 3.6%) | 31,000 | 62,000 |
| 50% (3% → 4.5%) | 5,200 | 10,400 |
| 100% (3% → 6%) | 1,400 | 2,800 |
Conversion Rate: 5%
| Lift to Detect | Sample per Variant | Total Sample |
|---|---|---|
| 5% (5% → 5.25%) | 280,000 | 560,000 |
| 10% (5% → 5.5%) | 72,000 | 144,000 |
| 20% (5% → 6%) | 18,000 | 36,000 |
| 50% (5% → 7.5%) | 3,100 | 6,200 |
| 100% (5% → 10%) | 810 | 1,620 |
Conversion Rate: 10%
| Lift to Detect | Sample per Variant | Total Sample |
|---|---|---|
| 5% (10% → 10.5%) | 130,000 | 260,000 |
| 10% (10% → 11%) | 34,000 | 68,000 |
| 20% (10% → 12%) | 8,700 | 17,400 |
| 50% (10% → 15%) | 1,500 | 3,000 |
| 100% (10% → 20%) | 400 | 800 |
Conversion Rate: 20%
| Lift to Detect | Sample per Variant | Total Sample |
|---|---|---|
| 5% (20% → 21%) | 60,000 | 120,000 |
| 10% (20% → 22%) | 16,000 | 32,000 |
| 20% (20% → 24%) | 4,000 | 8,000 |
| 50% (20% → 30%) | 700 | 1,400 |
| 100% (20% → 40%) | 200 | 400 |
Duration Calculator
Formula
Duration (days) = (Sample per variant × Number of variants) / (Daily traffic × % exposed)
Examples
Scenario 1: High-traffic page
- Need: 10,000 per variant (2 variants = 20,000 total)
- Daily traffic: 5,000 visitors
- 100% exposed to test
- Duration: 20,000 / 5,000 = 4 days
Scenario 2: Medium-traffic page
- Need: 30,000 per variant (60,000 total)
- Daily traffic: 2,000 visitors
- 100% exposed
- Duration: 60,000 / 2,000 = 30 days
Scenario 3: Low-traffic with partial exposure
- Need: 15,000 per variant (30,000 total)
- Daily traffic: 500 visitors
- 50% exposed to test
- Effective daily: 250
- Duration: 30,000 / 250 = 120 days (too long!)
Minimum Duration Rules
Even with sufficient sample size, run tests for at least:
- 1 full week: To capture day-of-week variation
- 2 business cycles: If B2B (weekday vs. weekend patterns)
- Through paydays: If e-commerce (beginning/end of month)
Maximum Duration Guidelines
Avoid running tests longer than 4-8 weeks:
- Novelty effects wear off
- External factors intervene
- Opportunity cost of other tests
Online Calculators
Recommended Tools
Evan Miller's Calculator https://www.evanmiller.org/ab-testing/sample-size.html
- Simple interface
- Bookmark-worthy
Optimizely's Calculator https://www.optimizely.com/sample-size-calculator/
- Business-friendly language
- Duration estimates
AB Test Guide Calculator https://www.abtestguide.com/calc/
- Includes Bayesian option
- Multiple test types
VWO Duration Calculator https://vwo.com/tools/ab-test-duration-calculator/
- Duration-focused
- Good for planning
Adjusting for Multiple Variants
With more than 2 variants (A/B/n tests), you need more sample:
| Variants | Multiplier |
|---|---|
| 2 (A/B) | 1x |
| 3 (A/B/C) | ~1.5x |
| 4 (A/B/C/D) | ~2x |
| 5+ | Consider reducing variants |
Why? More comparisons increase chance of false positives. You're comparing:
- A vs B
- A vs C
- B vs C (sometimes)
Apply Bonferroni correction or use tools that handle this automatically.
Common Sample Size Mistakes
1. Underpowered tests
Problem: Not enough sample to detect realistic effects Fix: Be realistic about MDE, get more traffic, or don't test
2. Overpowered tests
Problem: Waiting for sample size when you already have significance Fix: This is actually fine—you committed to sample size, honor it
3. Wrong baseline rate
Problem: Using wrong conversion rate for calculation Fix: Use the specific metric and page, not site-wide averages
4. Ignoring segments
Problem: Calculating for full traffic, then analyzing segments Fix: If you plan segment analysis, calculate sample for smallest segment
5. Testing too many things
Problem: Dividing traffic too many ways Fix: Prioritize ruthlessly, run fewer concurrent tests
When Sample Size Requirements Are Too High
Options when you can't get enough traffic:
- Increase MDE: Accept only detecting larger effects (20%+ lift)
- Lower confidence: Use 90% instead of 95% (risky, document it)
- Reduce variants: Test only the most promising variant
- Combine traffic: Test across multiple similar pages
- Test upstream: Test earlier in funnel where traffic is higher
- Don't test: Make decision based on qualitative data instead
- Longer test: Accept longer duration (weeks/months)
Sequential Testing
If you must check results before reaching sample size:
What is it?
Statistical method that adjusts for multiple looks at data.
When to use
- High-risk changes
- Need to stop bad variants early
- Time-sensitive decisions
Tools that support it
- Optimizely (Stats Accelerator)
- VWO (SmartStats)
- PostHog (Bayesian approach)
Tradeoff
- More flexibility to stop early
- Slightly larger sample size requirement
- More complex analysis
Quick Decision Framework
Can I run this test?
Daily traffic to page: _____
Baseline conversion rate: _____
MDE I care about: _____
Sample needed per variant: _____ (from tables above)
Days to run: Sample / Daily traffic = _____
If days > 60: Consider alternatives
If days > 30: Acceptable for high-impact tests
If days < 14: Likely feasible
If days < 7: Easy to run, consider running longer anyway