Refactor remaining skills for progressive disclosure

Phase 2 refactoring of skills >500 lines and medium-sized skills: - paid-ads: 553 → 297 lines - Extract ad-copy-templates.md, audience-targeting.md, platform-setup-checklists.md - analytics-tracking: 541 → 292 lines - Extract ga4-implementation.md, gtm-implementation.md, event-library.md - ab-test-setup: 510 → 264 lines - Extract test-templates.md, sample-size-guide.md - copywriting: 458 → 248 lines - Extract copy-frameworks.md (headline formulas, section types) - page-cro: 336 → 180 lines - Extract experiments.md (experiment ideas by page type) - onboarding-cro: 435 → 218 lines - Extract experiments.md (onboarding experiment ideas) All skills now use progressive disclosure with references/ folders, keeping SKILL.md files focused on core workflow while detailed content is available when needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 16:59:23 -08:00
parent c29ee7e6db
commit 98e74b79d7
17 changed files with 3356 additions and 1721 deletions
--- a/skills/ab-test-setup/references/sample-size-guide.md
+++ b/skills/ab-test-setup/references/sample-size-guide.md
@@ -0,0 +1,252 @@
+# Sample Size Guide
+
+Reference for calculating sample sizes and test duration.
+
+## Sample Size Fundamentals
+
+### Required Inputs
+
+1. **Baseline conversion rate**: Your current rate
+2. **Minimum detectable effect (MDE)**: Smallest change worth detecting
+3. **Statistical significance level**: Usually 95% (α = 0.05)
+4. **Statistical power**: Usually 80% (β = 0.20)
+
+### What These Mean
+
+**Baseline conversion rate**: If your page converts at 5%, that's your baseline.
+
+**MDE (Minimum Detectable Effect)**: The smallest improvement you care about detecting. Set this based on:
+- Business impact (is a 5% lift meaningful?)
+- Implementation cost (worth the effort?)
+- Realistic expectations (what have past tests shown?)
+
+**Statistical significance (95%)**: Means there's less than 5% chance the observed difference is due to random chance.
+
+**Statistical power (80%)**: Means if there's a real effect of size MDE, you have 80% chance of detecting it.
+
+---
+
+## Sample Size Quick Reference Tables
+
+### Conversion Rate: 1%
+
+| Lift to Detect | Sample per Variant | Total Sample |
+|----------------|-------------------|--------------|
+| 5% (1% → 1.05%) | 1,500,000 | 3,000,000 |
+| 10% (1% → 1.1%) | 380,000 | 760,000 |
+| 20% (1% → 1.2%) | 97,000 | 194,000 |
+| 50% (1% → 1.5%) | 16,000 | 32,000 |
+| 100% (1% → 2%) | 4,200 | 8,400 |
+
+### Conversion Rate: 3%
+
+| Lift to Detect | Sample per Variant | Total Sample |
+|----------------|-------------------|--------------|
+| 5% (3% → 3.15%) | 480,000 | 960,000 |
+| 10% (3% → 3.3%) | 120,000 | 240,000 |
+| 20% (3% → 3.6%) | 31,000 | 62,000 |
+| 50% (3% → 4.5%) | 5,200 | 10,400 |
+| 100% (3% → 6%) | 1,400 | 2,800 |
+
+### Conversion Rate: 5%
+
+| Lift to Detect | Sample per Variant | Total Sample |
+|----------------|-------------------|--------------|
+| 5% (5% → 5.25%) | 280,000 | 560,000 |
+| 10% (5% → 5.5%) | 72,000 | 144,000 |
+| 20% (5% → 6%) | 18,000 | 36,000 |
+| 50% (5% → 7.5%) | 3,100 | 6,200 |
+| 100% (5% → 10%) | 810 | 1,620 |
+
+### Conversion Rate: 10%
+
+| Lift to Detect | Sample per Variant | Total Sample |
+|----------------|-------------------|--------------|
+| 5% (10% → 10.5%) | 130,000 | 260,000 |
+| 10% (10% → 11%) | 34,000 | 68,000 |
+| 20% (10% → 12%) | 8,700 | 17,400 |
+| 50% (10% → 15%) | 1,500 | 3,000 |
+| 100% (10% → 20%) | 400 | 800 |
+
+### Conversion Rate: 20%
+
+| Lift to Detect | Sample per Variant | Total Sample |
+|----------------|-------------------|--------------|
+| 5% (20% → 21%) | 60,000 | 120,000 |
+| 10% (20% → 22%) | 16,000 | 32,000 |
+| 20% (20% → 24%) | 4,000 | 8,000 |
+| 50% (20% → 30%) | 700 | 1,400 |
+| 100% (20% → 40%) | 200 | 400 |
+
+---
+
+## Duration Calculator
+
+### Formula
+
+```
+Duration (days) = (Sample per variant × Number of variants) / (Daily traffic × % exposed)
+```
+
+### Examples
+
+**Scenario 1: High-traffic page**
+- Need: 10,000 per variant (2 variants = 20,000 total)
+- Daily traffic: 5,000 visitors
+- 100% exposed to test
+- Duration: 20,000 / 5,000 = **4 days**
+
+**Scenario 2: Medium-traffic page**
+- Need: 30,000 per variant (60,000 total)
+- Daily traffic: 2,000 visitors
+- 100% exposed
+- Duration: 60,000 / 2,000 = **30 days**
+
+**Scenario 3: Low-traffic with partial exposure**
+- Need: 15,000 per variant (30,000 total)
+- Daily traffic: 500 visitors
+- 50% exposed to test
+- Effective daily: 250
+- Duration: 30,000 / 250 = **120 days** (too long!)
+
+### Minimum Duration Rules
+
+Even with sufficient sample size, run tests for at least:
+- **1 full week**: To capture day-of-week variation
+- **2 business cycles**: If B2B (weekday vs. weekend patterns)
+- **Through paydays**: If e-commerce (beginning/end of month)
+
+### Maximum Duration Guidelines
+
+Avoid running tests longer than 4-8 weeks:
+- Novelty effects wear off
+- External factors intervene
+- Opportunity cost of other tests
+
+---
+
+## Online Calculators
+
+### Recommended Tools
+
+**Evan Miller's Calculator**
+https://www.evanmiller.org/ab-testing/sample-size.html
+- Simple interface
+- Bookmark-worthy
+
+**Optimizely's Calculator**
+https://www.optimizely.com/sample-size-calculator/
+- Business-friendly language
+- Duration estimates
+
+**AB Test Guide Calculator**
+https://www.abtestguide.com/calc/
+- Includes Bayesian option
+- Multiple test types
+
+**VWO Duration Calculator**
+https://vwo.com/tools/ab-test-duration-calculator/
+- Duration-focused
+- Good for planning
+
+---
+
+## Adjusting for Multiple Variants
+
+With more than 2 variants (A/B/n tests), you need more sample:
+
+| Variants | Multiplier |
+|----------|------------|
+| 2 (A/B) | 1x |
+| 3 (A/B/C) | ~1.5x |
+| 4 (A/B/C/D) | ~2x |
+| 5+ | Consider reducing variants |
+
+**Why?** More comparisons increase chance of false positives. You're comparing:
+- A vs B
+- A vs C
+- B vs C (sometimes)
+
+Apply Bonferroni correction or use tools that handle this automatically.
+
+---
+
+## Common Sample Size Mistakes
+
+### 1. Underpowered tests
+**Problem**: Not enough sample to detect realistic effects
+**Fix**: Be realistic about MDE, get more traffic, or don't test
+
+### 2. Overpowered tests
+**Problem**: Waiting for sample size when you already have significance
+**Fix**: This is actually fine—you committed to sample size, honor it
+
+### 3. Wrong baseline rate
+**Problem**: Using wrong conversion rate for calculation
+**Fix**: Use the specific metric and page, not site-wide averages
+
+### 4. Ignoring segments
+**Problem**: Calculating for full traffic, then analyzing segments
+**Fix**: If you plan segment analysis, calculate sample for smallest segment
+
+### 5. Testing too many things
+**Problem**: Dividing traffic too many ways
+**Fix**: Prioritize ruthlessly, run fewer concurrent tests
+
+---
+
+## When Sample Size Requirements Are Too High
+
+Options when you can't get enough traffic:
+
+1. **Increase MDE**: Accept only detecting larger effects (20%+ lift)
+2. **Lower confidence**: Use 90% instead of 95% (risky, document it)
+3. **Reduce variants**: Test only the most promising variant
+4. **Combine traffic**: Test across multiple similar pages
+5. **Test upstream**: Test earlier in funnel where traffic is higher
+6. **Don't test**: Make decision based on qualitative data instead
+7. **Longer test**: Accept longer duration (weeks/months)
+
+---
+
+## Sequential Testing
+
+If you must check results before reaching sample size:
+
+### What is it?
+Statistical method that adjusts for multiple looks at data.
+
+### When to use
+- High-risk changes
+- Need to stop bad variants early
+- Time-sensitive decisions
+
+### Tools that support it
+- Optimizely (Stats Accelerator)
+- VWO (SmartStats)
+- PostHog (Bayesian approach)
+
+### Tradeoff
+- More flexibility to stop early
+- Slightly larger sample size requirement
+- More complex analysis
+
+---
+
+## Quick Decision Framework
+
+### Can I run this test?
+
+```
+Daily traffic to page: _____
+Baseline conversion rate: _____
+MDE I care about: _____
+
+Sample needed per variant: _____ (from tables above)
+Days to run: Sample / Daily traffic = _____
+
+If days > 60: Consider alternatives
+If days > 30: Acceptable for high-impact tests
+If days < 14: Likely feasible
+If days < 7: Easy to run, consider running longer anyway
+```