Refactor remaining skills for progressive disclosure
Phase 2 refactoring of skills >500 lines and medium-sized skills: - paid-ads: 553 → 297 lines - Extract ad-copy-templates.md, audience-targeting.md, platform-setup-checklists.md - analytics-tracking: 541 → 292 lines - Extract ga4-implementation.md, gtm-implementation.md, event-library.md - ab-test-setup: 510 → 264 lines - Extract test-templates.md, sample-size-guide.md - copywriting: 458 → 248 lines - Extract copy-frameworks.md (headline formulas, section types) - page-cro: 336 → 180 lines - Extract experiments.md (experiment ideas by page type) - onboarding-cro: 435 → 218 lines - Extract experiments.md (onboarding experiment ideas) All skills now use progressive disclosure with references/ folders, keeping SKILL.md files focused on core workflow while detailed content is available when needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -14,20 +14,9 @@ If `.claude/product-marketing-context.md` exists, read it before asking question
|
||||
|
||||
Before designing a test, understand:
|
||||
|
||||
1. **Test Context**
|
||||
- What are you trying to improve?
|
||||
- What change are you considering?
|
||||
- What made you want to test this?
|
||||
|
||||
2. **Current State**
|
||||
- Baseline conversion rate?
|
||||
- Current traffic volume?
|
||||
- Any historical test data?
|
||||
|
||||
3. **Constraints**
|
||||
- Technical implementation complexity?
|
||||
- Timeline requirements?
|
||||
- Tools available?
|
||||
1. **Test Context** - What are you trying to improve? What change are you considering?
|
||||
2. **Current State** - Baseline conversion rate? Current traffic volume?
|
||||
3. **Constraints** - Technical complexity? Timeline? Tools available?
|
||||
|
||||
---
|
||||
|
||||
@@ -41,7 +30,6 @@ Before designing a test, understand:
|
||||
### 2. Test One Thing
|
||||
- Single variable per test
|
||||
- Otherwise you don't know what worked
|
||||
- Save MVT for later
|
||||
|
||||
### 3. Statistical Rigor
|
||||
- Pre-determine sample size
|
||||
@@ -67,81 +55,41 @@ for [audience].
|
||||
We'll know this is true when [metrics].
|
||||
```
|
||||
|
||||
### Examples
|
||||
### Example
|
||||
|
||||
**Weak hypothesis:**
|
||||
"Changing the button color might increase clicks."
|
||||
**Weak**: "Changing the button color might increase clicks."
|
||||
|
||||
**Strong hypothesis:**
|
||||
"Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."
|
||||
|
||||
### Good Hypotheses Include
|
||||
|
||||
- **Observation**: What prompted this idea
|
||||
- **Change**: Specific modification
|
||||
- **Effect**: Expected outcome and direction
|
||||
- **Audience**: Who this applies to
|
||||
- **Metric**: How you'll measure success
|
||||
**Strong**: "Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."
|
||||
|
||||
---
|
||||
|
||||
## Test Types
|
||||
|
||||
### A/B Test (Split Test)
|
||||
- Two versions: Control (A) vs. Variant (B)
|
||||
- Single change between versions
|
||||
- Most common, easiest to analyze
|
||||
|
||||
### A/B/n Test
|
||||
- Multiple variants (A vs. B vs. C...)
|
||||
- Requires more traffic
|
||||
- Good for testing several options
|
||||
|
||||
### Multivariate Test (MVT)
|
||||
- Multiple changes in combinations
|
||||
- Tests interactions between changes
|
||||
- Requires significantly more traffic
|
||||
- Complex analysis
|
||||
|
||||
### Split URL Test
|
||||
- Different URLs for variants
|
||||
- Good for major page changes
|
||||
- Easier implementation sometimes
|
||||
| Type | Description | Traffic Needed |
|
||||
|------|-------------|----------------|
|
||||
| A/B | Two versions, single change | Moderate |
|
||||
| A/B/n | Multiple variants | Higher |
|
||||
| MVT | Multiple changes in combinations | Very high |
|
||||
| Split URL | Different URLs for variants | Moderate |
|
||||
|
||||
---
|
||||
|
||||
## Sample Size Calculation
|
||||
|
||||
### Inputs Needed
|
||||
|
||||
1. **Baseline conversion rate**: Your current rate
|
||||
2. **Minimum detectable effect (MDE)**: Smallest change worth detecting
|
||||
3. **Statistical significance level**: Usually 95%
|
||||
4. **Statistical power**: Usually 80%
|
||||
## Sample Size
|
||||
|
||||
### Quick Reference
|
||||
|
||||
| Baseline Rate | 10% Lift | 20% Lift | 50% Lift |
|
||||
|---------------|----------|----------|----------|
|
||||
| Baseline | 10% Lift | 20% Lift | 50% Lift |
|
||||
|----------|----------|----------|----------|
|
||||
| 1% | 150k/variant | 39k/variant | 6k/variant |
|
||||
| 3% | 47k/variant | 12k/variant | 2k/variant |
|
||||
| 5% | 27k/variant | 7k/variant | 1.2k/variant |
|
||||
| 10% | 12k/variant | 3k/variant | 550/variant |
|
||||
|
||||
### Formula Resources
|
||||
- Evan Miller's calculator: https://www.evanmiller.org/ab-testing/sample-size.html
|
||||
- Optimizely's calculator: https://www.optimizely.com/sample-size-calculator/
|
||||
**Calculators:**
|
||||
- [Evan Miller's](https://www.evanmiller.org/ab-testing/sample-size.html)
|
||||
- [Optimizely's](https://www.optimizely.com/sample-size-calculator/)
|
||||
|
||||
### Test Duration
|
||||
|
||||
```
|
||||
Duration = Sample size needed per variant × Number of variants
|
||||
───────────────────────────────────────────────────
|
||||
Daily traffic to test page × Conversion rate
|
||||
```
|
||||
|
||||
Minimum: 1-2 business cycles (usually 1-2 weeks)
|
||||
Maximum: Avoid running too long (novelty effects, external factors)
|
||||
**For detailed sample size tables and duration calculations**: See [references/sample-size-guide.md](references/sample-size-guide.md)
|
||||
|
||||
---
|
||||
|
||||
@@ -155,228 +103,106 @@ Maximum: Avoid running too long (novelty effects, external factors)
|
||||
### Secondary Metrics
|
||||
- Support primary metric interpretation
|
||||
- Explain why/how the change worked
|
||||
- Help understand user behavior
|
||||
|
||||
### Guardrail Metrics
|
||||
- Things that shouldn't get worse
|
||||
- Revenue, retention, satisfaction
|
||||
- Stop test if significantly negative
|
||||
|
||||
### Metric Examples by Test Type
|
||||
|
||||
**Homepage CTA test:**
|
||||
- Primary: CTA click-through rate
|
||||
- Secondary: Time to click, scroll depth
|
||||
- Guardrail: Bounce rate, downstream conversion
|
||||
|
||||
**Pricing page test:**
|
||||
- Primary: Plan selection rate
|
||||
- Secondary: Time on page, plan distribution
|
||||
- Guardrail: Support tickets, refund rate
|
||||
|
||||
**Signup flow test:**
|
||||
- Primary: Signup completion rate
|
||||
- Secondary: Field-level completion, time to complete
|
||||
- Guardrail: User activation rate (post-signup quality)
|
||||
### Example: Pricing Page Test
|
||||
- **Primary**: Plan selection rate
|
||||
- **Secondary**: Time on page, plan distribution
|
||||
- **Guardrail**: Support tickets, refund rate
|
||||
|
||||
---
|
||||
|
||||
## Designing Variants
|
||||
|
||||
### Control (A)
|
||||
- Current experience, unchanged
|
||||
- Don't modify during test
|
||||
### What to Vary
|
||||
|
||||
### Variant (B+)
|
||||
| Category | Examples |
|
||||
|----------|----------|
|
||||
| Headlines/Copy | Message angle, value prop, specificity, tone |
|
||||
| Visual Design | Layout, color, images, hierarchy |
|
||||
| CTA | Button copy, size, placement, number |
|
||||
| Content | Information included, order, amount, social proof |
|
||||
|
||||
**Best practices:**
|
||||
### Best Practices
|
||||
- Single, meaningful change
|
||||
- Bold enough to make a difference
|
||||
- True to the hypothesis
|
||||
|
||||
**What to vary:**
|
||||
|
||||
Headlines/Copy:
|
||||
- Message angle
|
||||
- Value proposition
|
||||
- Specificity level
|
||||
- Tone/voice
|
||||
|
||||
Visual Design:
|
||||
- Layout structure
|
||||
- Color and contrast
|
||||
- Image selection
|
||||
- Visual hierarchy
|
||||
|
||||
CTA:
|
||||
- Button copy
|
||||
- Size/prominence
|
||||
- Placement
|
||||
- Number of CTAs
|
||||
|
||||
Content:
|
||||
- Information included
|
||||
- Order of information
|
||||
- Amount of content
|
||||
- Social proof type
|
||||
|
||||
### Documenting Variants
|
||||
|
||||
```
|
||||
Control (A):
|
||||
- Screenshot
|
||||
- Description of current state
|
||||
|
||||
Variant (B):
|
||||
- Screenshot or mockup
|
||||
- Specific changes made
|
||||
- Hypothesis for why this will win
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Traffic Allocation
|
||||
|
||||
### Standard Split
|
||||
- 50/50 for A/B test
|
||||
- Equal split for multiple variants
|
||||
| Approach | Split | When to Use |
|
||||
|----------|-------|-------------|
|
||||
| Standard | 50/50 | Default for A/B |
|
||||
| Conservative | 90/10, 80/20 | Limit risk of bad variant |
|
||||
| Ramping | Start small, increase | Technical risk mitigation |
|
||||
|
||||
### Conservative Rollout
|
||||
- 90/10 or 80/20 initially
|
||||
- Limits risk of bad variant
|
||||
- Longer to reach significance
|
||||
|
||||
### Ramping
|
||||
- Start small, increase over time
|
||||
- Good for technical risk mitigation
|
||||
- Most tools support this
|
||||
|
||||
### Considerations
|
||||
**Considerations:**
|
||||
- Consistency: Users see same variant on return
|
||||
- Segment sizes: Ensure segments are large enough
|
||||
- Time of day/week: Balanced exposure
|
||||
- Balanced exposure across time of day/week
|
||||
|
||||
---
|
||||
|
||||
## Implementation Approaches
|
||||
## Implementation
|
||||
|
||||
### Client-Side Testing
|
||||
|
||||
**Tools**: PostHog, Optimizely, VWO, custom
|
||||
|
||||
**How it works**:
|
||||
### Client-Side
|
||||
- JavaScript modifies page after load
|
||||
- Quick to implement
|
||||
- Can cause flicker
|
||||
- Quick to implement, can cause flicker
|
||||
- Tools: PostHog, Optimizely, VWO
|
||||
|
||||
**Best for**:
|
||||
- Marketing pages
|
||||
- Copy/visual changes
|
||||
- Quick iteration
|
||||
|
||||
### Server-Side Testing
|
||||
|
||||
**Tools**: PostHog, LaunchDarkly, Split, custom
|
||||
|
||||
**How it works**:
|
||||
- Variant determined before page renders
|
||||
- No flicker
|
||||
- Requires development work
|
||||
|
||||
**Best for**:
|
||||
- Product features
|
||||
- Complex changes
|
||||
- Performance-sensitive pages
|
||||
|
||||
### Feature Flags
|
||||
|
||||
- Binary on/off (not true A/B)
|
||||
- Good for rollouts
|
||||
- Can convert to A/B with percentage split
|
||||
### Server-Side
|
||||
- Variant determined before render
|
||||
- No flicker, requires dev work
|
||||
- Tools: PostHog, LaunchDarkly, Split
|
||||
|
||||
---
|
||||
|
||||
## Running the Test
|
||||
|
||||
### Pre-Launch Checklist
|
||||
|
||||
- [ ] Hypothesis documented
|
||||
- [ ] Primary metric defined
|
||||
- [ ] Sample size calculated
|
||||
- [ ] Test duration estimated
|
||||
- [ ] Variants implemented correctly
|
||||
- [ ] Tracking verified
|
||||
- [ ] QA completed on all variants
|
||||
- [ ] Stakeholders informed
|
||||
|
||||
### During the Test
|
||||
|
||||
**DO:**
|
||||
- Monitor for technical issues
|
||||
- Check segment quality
|
||||
- Document any external factors
|
||||
- Document external factors
|
||||
|
||||
**DON'T:**
|
||||
- Peek at results and stop early
|
||||
- Make changes to variants
|
||||
- Add traffic from new sources
|
||||
- End early because you "know" the answer
|
||||
|
||||
### Peeking Problem
|
||||
|
||||
Looking at results before reaching sample size and stopping when you see significance leads to:
|
||||
- False positives
|
||||
- Inflated effect sizes
|
||||
- Wrong decisions
|
||||
|
||||
**Solutions:**
|
||||
- Pre-commit to sample size and stick to it
|
||||
- Use sequential testing if you must peek
|
||||
- Trust the process
|
||||
### The Peeking Problem
|
||||
Looking at results before reaching sample size and stopping early leads to false positives and wrong decisions. Pre-commit to sample size and trust the process.
|
||||
|
||||
---
|
||||
|
||||
## Analyzing Results
|
||||
|
||||
### Statistical Significance
|
||||
|
||||
- 95% confidence = p-value < 0.05
|
||||
- Means: <5% chance result is random
|
||||
- Means <5% chance result is random
|
||||
- Not a guarantee—just a threshold
|
||||
|
||||
### Practical Significance
|
||||
### Analysis Checklist
|
||||
|
||||
Statistical ≠ Practical
|
||||
|
||||
- Is the effect size meaningful for business?
|
||||
- Is it worth the implementation cost?
|
||||
- Is it sustainable over time?
|
||||
|
||||
### What to Look At
|
||||
|
||||
1. **Did you reach sample size?**
|
||||
- If not, result is preliminary
|
||||
|
||||
2. **Is it statistically significant?**
|
||||
- Check confidence intervals
|
||||
- Check p-value
|
||||
|
||||
3. **Is the effect size meaningful?**
|
||||
- Compare to your MDE
|
||||
- Project business impact
|
||||
|
||||
4. **Are secondary metrics consistent?**
|
||||
- Do they support the primary?
|
||||
- Any unexpected effects?
|
||||
|
||||
5. **Any guardrail concerns?**
|
||||
- Did anything get worse?
|
||||
- Long-term risks?
|
||||
|
||||
6. **Segment differences?**
|
||||
- Mobile vs. desktop?
|
||||
- New vs. returning?
|
||||
- Traffic source?
|
||||
1. **Reach sample size?** If not, result is preliminary
|
||||
2. **Statistically significant?** Check confidence intervals
|
||||
3. **Effect size meaningful?** Compare to MDE, project impact
|
||||
4. **Secondary metrics consistent?** Support the primary?
|
||||
5. **Guardrail concerns?** Anything get worse?
|
||||
6. **Segment differences?** Mobile vs. desktop? New vs. returning?
|
||||
|
||||
### Interpreting Results
|
||||
|
||||
@@ -389,84 +215,15 @@ Statistical ≠ Practical
|
||||
|
||||
---
|
||||
|
||||
## Documenting and Learning
|
||||
## Documentation
|
||||
|
||||
### Test Documentation
|
||||
Document every test with:
|
||||
- Hypothesis
|
||||
- Variants (with screenshots)
|
||||
- Results (sample, metrics, significance)
|
||||
- Decision and learnings
|
||||
|
||||
```
|
||||
Test Name: [Name]
|
||||
Test ID: [ID in testing tool]
|
||||
Dates: [Start] - [End]
|
||||
Owner: [Name]
|
||||
|
||||
Hypothesis:
|
||||
[Full hypothesis statement]
|
||||
|
||||
Variants:
|
||||
- Control: [Description + screenshot]
|
||||
- Variant: [Description + screenshot]
|
||||
|
||||
Results:
|
||||
- Sample size: [achieved vs. target]
|
||||
- Primary metric: [control] vs. [variant] ([% change], [confidence])
|
||||
- Secondary metrics: [summary]
|
||||
- Segment insights: [notable differences]
|
||||
|
||||
Decision: [Winner/Loser/Inconclusive]
|
||||
Action: [What we're doing]
|
||||
|
||||
Learnings:
|
||||
[What we learned, what to test next]
|
||||
```
|
||||
|
||||
### Building a Learning Repository
|
||||
|
||||
- Central location for all tests
|
||||
- Searchable by page, element, outcome
|
||||
- Prevents re-running failed tests
|
||||
- Builds institutional knowledge
|
||||
|
||||
---
|
||||
|
||||
## Output Format
|
||||
|
||||
### Test Plan Document
|
||||
|
||||
```
|
||||
# A/B Test: [Name]
|
||||
|
||||
## Hypothesis
|
||||
[Full hypothesis using framework]
|
||||
|
||||
## Test Design
|
||||
- Type: A/B / A/B/n / MVT
|
||||
- Duration: X weeks
|
||||
- Sample size: X per variant
|
||||
- Traffic allocation: 50/50
|
||||
|
||||
## Variants
|
||||
[Control and variant descriptions with visuals]
|
||||
|
||||
## Metrics
|
||||
- Primary: [metric and definition]
|
||||
- Secondary: [list]
|
||||
- Guardrails: [list]
|
||||
|
||||
## Implementation
|
||||
- Method: Client-side / Server-side
|
||||
- Tool: [Tool name]
|
||||
- Dev requirements: [If any]
|
||||
|
||||
## Analysis Plan
|
||||
- Success criteria: [What constitutes a win]
|
||||
- Segment analysis: [Planned segments]
|
||||
```
|
||||
|
||||
### Results Summary
|
||||
When test is complete
|
||||
|
||||
### Recommendations
|
||||
Next steps based on results
|
||||
**For templates**: See [references/test-templates.md](references/test-templates.md)
|
||||
|
||||
---
|
||||
|
||||
@@ -476,19 +233,16 @@ Next steps based on results
|
||||
- Testing too small a change (undetectable)
|
||||
- Testing too many things (can't isolate)
|
||||
- No clear hypothesis
|
||||
- Wrong audience
|
||||
|
||||
### Execution
|
||||
- Stopping early
|
||||
- Changing things mid-test
|
||||
- Not checking implementation
|
||||
- Uneven traffic allocation
|
||||
|
||||
### Analysis
|
||||
- Ignoring confidence intervals
|
||||
- Cherry-picking segments
|
||||
- Over-interpreting inconclusive results
|
||||
- Not considering practical significance
|
||||
|
||||
---
|
||||
|
||||
|
||||
252
skills/ab-test-setup/references/sample-size-guide.md
Normal file
252
skills/ab-test-setup/references/sample-size-guide.md
Normal file
@@ -0,0 +1,252 @@
|
||||
# Sample Size Guide
|
||||
|
||||
Reference for calculating sample sizes and test duration.
|
||||
|
||||
## Sample Size Fundamentals
|
||||
|
||||
### Required Inputs
|
||||
|
||||
1. **Baseline conversion rate**: Your current rate
|
||||
2. **Minimum detectable effect (MDE)**: Smallest change worth detecting
|
||||
3. **Statistical significance level**: Usually 95% (α = 0.05)
|
||||
4. **Statistical power**: Usually 80% (β = 0.20)
|
||||
|
||||
### What These Mean
|
||||
|
||||
**Baseline conversion rate**: If your page converts at 5%, that's your baseline.
|
||||
|
||||
**MDE (Minimum Detectable Effect)**: The smallest improvement you care about detecting. Set this based on:
|
||||
- Business impact (is a 5% lift meaningful?)
|
||||
- Implementation cost (worth the effort?)
|
||||
- Realistic expectations (what have past tests shown?)
|
||||
|
||||
**Statistical significance (95%)**: Means there's less than 5% chance the observed difference is due to random chance.
|
||||
|
||||
**Statistical power (80%)**: Means if there's a real effect of size MDE, you have 80% chance of detecting it.
|
||||
|
||||
---
|
||||
|
||||
## Sample Size Quick Reference Tables
|
||||
|
||||
### Conversion Rate: 1%
|
||||
|
||||
| Lift to Detect | Sample per Variant | Total Sample |
|
||||
|----------------|-------------------|--------------|
|
||||
| 5% (1% → 1.05%) | 1,500,000 | 3,000,000 |
|
||||
| 10% (1% → 1.1%) | 380,000 | 760,000 |
|
||||
| 20% (1% → 1.2%) | 97,000 | 194,000 |
|
||||
| 50% (1% → 1.5%) | 16,000 | 32,000 |
|
||||
| 100% (1% → 2%) | 4,200 | 8,400 |
|
||||
|
||||
### Conversion Rate: 3%
|
||||
|
||||
| Lift to Detect | Sample per Variant | Total Sample |
|
||||
|----------------|-------------------|--------------|
|
||||
| 5% (3% → 3.15%) | 480,000 | 960,000 |
|
||||
| 10% (3% → 3.3%) | 120,000 | 240,000 |
|
||||
| 20% (3% → 3.6%) | 31,000 | 62,000 |
|
||||
| 50% (3% → 4.5%) | 5,200 | 10,400 |
|
||||
| 100% (3% → 6%) | 1,400 | 2,800 |
|
||||
|
||||
### Conversion Rate: 5%
|
||||
|
||||
| Lift to Detect | Sample per Variant | Total Sample |
|
||||
|----------------|-------------------|--------------|
|
||||
| 5% (5% → 5.25%) | 280,000 | 560,000 |
|
||||
| 10% (5% → 5.5%) | 72,000 | 144,000 |
|
||||
| 20% (5% → 6%) | 18,000 | 36,000 |
|
||||
| 50% (5% → 7.5%) | 3,100 | 6,200 |
|
||||
| 100% (5% → 10%) | 810 | 1,620 |
|
||||
|
||||
### Conversion Rate: 10%
|
||||
|
||||
| Lift to Detect | Sample per Variant | Total Sample |
|
||||
|----------------|-------------------|--------------|
|
||||
| 5% (10% → 10.5%) | 130,000 | 260,000 |
|
||||
| 10% (10% → 11%) | 34,000 | 68,000 |
|
||||
| 20% (10% → 12%) | 8,700 | 17,400 |
|
||||
| 50% (10% → 15%) | 1,500 | 3,000 |
|
||||
| 100% (10% → 20%) | 400 | 800 |
|
||||
|
||||
### Conversion Rate: 20%
|
||||
|
||||
| Lift to Detect | Sample per Variant | Total Sample |
|
||||
|----------------|-------------------|--------------|
|
||||
| 5% (20% → 21%) | 60,000 | 120,000 |
|
||||
| 10% (20% → 22%) | 16,000 | 32,000 |
|
||||
| 20% (20% → 24%) | 4,000 | 8,000 |
|
||||
| 50% (20% → 30%) | 700 | 1,400 |
|
||||
| 100% (20% → 40%) | 200 | 400 |
|
||||
|
||||
---
|
||||
|
||||
## Duration Calculator
|
||||
|
||||
### Formula
|
||||
|
||||
```
|
||||
Duration (days) = (Sample per variant × Number of variants) / (Daily traffic × % exposed)
|
||||
```
|
||||
|
||||
### Examples
|
||||
|
||||
**Scenario 1: High-traffic page**
|
||||
- Need: 10,000 per variant (2 variants = 20,000 total)
|
||||
- Daily traffic: 5,000 visitors
|
||||
- 100% exposed to test
|
||||
- Duration: 20,000 / 5,000 = **4 days**
|
||||
|
||||
**Scenario 2: Medium-traffic page**
|
||||
- Need: 30,000 per variant (60,000 total)
|
||||
- Daily traffic: 2,000 visitors
|
||||
- 100% exposed
|
||||
- Duration: 60,000 / 2,000 = **30 days**
|
||||
|
||||
**Scenario 3: Low-traffic with partial exposure**
|
||||
- Need: 15,000 per variant (30,000 total)
|
||||
- Daily traffic: 500 visitors
|
||||
- 50% exposed to test
|
||||
- Effective daily: 250
|
||||
- Duration: 30,000 / 250 = **120 days** (too long!)
|
||||
|
||||
### Minimum Duration Rules
|
||||
|
||||
Even with sufficient sample size, run tests for at least:
|
||||
- **1 full week**: To capture day-of-week variation
|
||||
- **2 business cycles**: If B2B (weekday vs. weekend patterns)
|
||||
- **Through paydays**: If e-commerce (beginning/end of month)
|
||||
|
||||
### Maximum Duration Guidelines
|
||||
|
||||
Avoid running tests longer than 4-8 weeks:
|
||||
- Novelty effects wear off
|
||||
- External factors intervene
|
||||
- Opportunity cost of other tests
|
||||
|
||||
---
|
||||
|
||||
## Online Calculators
|
||||
|
||||
### Recommended Tools
|
||||
|
||||
**Evan Miller's Calculator**
|
||||
https://www.evanmiller.org/ab-testing/sample-size.html
|
||||
- Simple interface
|
||||
- Bookmark-worthy
|
||||
|
||||
**Optimizely's Calculator**
|
||||
https://www.optimizely.com/sample-size-calculator/
|
||||
- Business-friendly language
|
||||
- Duration estimates
|
||||
|
||||
**AB Test Guide Calculator**
|
||||
https://www.abtestguide.com/calc/
|
||||
- Includes Bayesian option
|
||||
- Multiple test types
|
||||
|
||||
**VWO Duration Calculator**
|
||||
https://vwo.com/tools/ab-test-duration-calculator/
|
||||
- Duration-focused
|
||||
- Good for planning
|
||||
|
||||
---
|
||||
|
||||
## Adjusting for Multiple Variants
|
||||
|
||||
With more than 2 variants (A/B/n tests), you need more sample:
|
||||
|
||||
| Variants | Multiplier |
|
||||
|----------|------------|
|
||||
| 2 (A/B) | 1x |
|
||||
| 3 (A/B/C) | ~1.5x |
|
||||
| 4 (A/B/C/D) | ~2x |
|
||||
| 5+ | Consider reducing variants |
|
||||
|
||||
**Why?** More comparisons increase chance of false positives. You're comparing:
|
||||
- A vs B
|
||||
- A vs C
|
||||
- B vs C (sometimes)
|
||||
|
||||
Apply Bonferroni correction or use tools that handle this automatically.
|
||||
|
||||
---
|
||||
|
||||
## Common Sample Size Mistakes
|
||||
|
||||
### 1. Underpowered tests
|
||||
**Problem**: Not enough sample to detect realistic effects
|
||||
**Fix**: Be realistic about MDE, get more traffic, or don't test
|
||||
|
||||
### 2. Overpowered tests
|
||||
**Problem**: Waiting for sample size when you already have significance
|
||||
**Fix**: This is actually fine—you committed to sample size, honor it
|
||||
|
||||
### 3. Wrong baseline rate
|
||||
**Problem**: Using wrong conversion rate for calculation
|
||||
**Fix**: Use the specific metric and page, not site-wide averages
|
||||
|
||||
### 4. Ignoring segments
|
||||
**Problem**: Calculating for full traffic, then analyzing segments
|
||||
**Fix**: If you plan segment analysis, calculate sample for smallest segment
|
||||
|
||||
### 5. Testing too many things
|
||||
**Problem**: Dividing traffic too many ways
|
||||
**Fix**: Prioritize ruthlessly, run fewer concurrent tests
|
||||
|
||||
---
|
||||
|
||||
## When Sample Size Requirements Are Too High
|
||||
|
||||
Options when you can't get enough traffic:
|
||||
|
||||
1. **Increase MDE**: Accept only detecting larger effects (20%+ lift)
|
||||
2. **Lower confidence**: Use 90% instead of 95% (risky, document it)
|
||||
3. **Reduce variants**: Test only the most promising variant
|
||||
4. **Combine traffic**: Test across multiple similar pages
|
||||
5. **Test upstream**: Test earlier in funnel where traffic is higher
|
||||
6. **Don't test**: Make decision based on qualitative data instead
|
||||
7. **Longer test**: Accept longer duration (weeks/months)
|
||||
|
||||
---
|
||||
|
||||
## Sequential Testing
|
||||
|
||||
If you must check results before reaching sample size:
|
||||
|
||||
### What is it?
|
||||
Statistical method that adjusts for multiple looks at data.
|
||||
|
||||
### When to use
|
||||
- High-risk changes
|
||||
- Need to stop bad variants early
|
||||
- Time-sensitive decisions
|
||||
|
||||
### Tools that support it
|
||||
- Optimizely (Stats Accelerator)
|
||||
- VWO (SmartStats)
|
||||
- PostHog (Bayesian approach)
|
||||
|
||||
### Tradeoff
|
||||
- More flexibility to stop early
|
||||
- Slightly larger sample size requirement
|
||||
- More complex analysis
|
||||
|
||||
---
|
||||
|
||||
## Quick Decision Framework
|
||||
|
||||
### Can I run this test?
|
||||
|
||||
```
|
||||
Daily traffic to page: _____
|
||||
Baseline conversion rate: _____
|
||||
MDE I care about: _____
|
||||
|
||||
Sample needed per variant: _____ (from tables above)
|
||||
Days to run: Sample / Daily traffic = _____
|
||||
|
||||
If days > 60: Consider alternatives
|
||||
If days > 30: Acceptable for high-impact tests
|
||||
If days < 14: Likely feasible
|
||||
If days < 7: Easy to run, consider running longer anyway
|
||||
```
|
||||
268
skills/ab-test-setup/references/test-templates.md
Normal file
268
skills/ab-test-setup/references/test-templates.md
Normal file
@@ -0,0 +1,268 @@
|
||||
# A/B Test Templates Reference
|
||||
|
||||
Templates for planning, documenting, and analyzing experiments.
|
||||
|
||||
## Test Plan Template
|
||||
|
||||
```markdown
|
||||
# A/B Test: [Name]
|
||||
|
||||
## Overview
|
||||
- **Owner**: [Name]
|
||||
- **Test ID**: [ID in testing tool]
|
||||
- **Page/Feature**: [What's being tested]
|
||||
- **Planned dates**: [Start] - [End]
|
||||
|
||||
## Hypothesis
|
||||
|
||||
Because [observation/data],
|
||||
we believe [change]
|
||||
will cause [expected outcome]
|
||||
for [audience].
|
||||
We'll know this is true when [metrics].
|
||||
|
||||
## Test Design
|
||||
|
||||
| Element | Details |
|
||||
|---------|---------|
|
||||
| Test type | A/B / A/B/n / MVT |
|
||||
| Duration | X weeks |
|
||||
| Sample size | X per variant |
|
||||
| Traffic allocation | 50/50 |
|
||||
| Tool | [Tool name] |
|
||||
| Implementation | Client-side / Server-side |
|
||||
|
||||
## Variants
|
||||
|
||||
### Control (A)
|
||||
[Screenshot]
|
||||
- Current experience
|
||||
- [Key details about current state]
|
||||
|
||||
### Variant (B)
|
||||
[Screenshot or mockup]
|
||||
- [Specific change #1]
|
||||
- [Specific change #2]
|
||||
- Rationale: [Why we think this will win]
|
||||
|
||||
## Metrics
|
||||
|
||||
### Primary
|
||||
- **Metric**: [metric name]
|
||||
- **Definition**: [how it's calculated]
|
||||
- **Current baseline**: [X%]
|
||||
- **Minimum detectable effect**: [X%]
|
||||
|
||||
### Secondary
|
||||
- [Metric 1]: [what it tells us]
|
||||
- [Metric 2]: [what it tells us]
|
||||
- [Metric 3]: [what it tells us]
|
||||
|
||||
### Guardrails
|
||||
- [Metric that shouldn't get worse]
|
||||
- [Another safety metric]
|
||||
|
||||
## Segment Analysis Plan
|
||||
- Mobile vs. desktop
|
||||
- New vs. returning visitors
|
||||
- Traffic source
|
||||
- [Other relevant segments]
|
||||
|
||||
## Success Criteria
|
||||
- Winner: [Primary metric improves by X% with 95% confidence]
|
||||
- Loser: [Primary metric decreases significantly]
|
||||
- Inconclusive: [What we'll do if no significant result]
|
||||
|
||||
## Pre-Launch Checklist
|
||||
- [ ] Hypothesis documented and reviewed
|
||||
- [ ] Primary metric defined and trackable
|
||||
- [ ] Sample size calculated
|
||||
- [ ] Test duration estimated
|
||||
- [ ] Variants implemented correctly
|
||||
- [ ] Tracking verified in all variants
|
||||
- [ ] QA completed on all variants
|
||||
- [ ] Stakeholders informed
|
||||
- [ ] Calendar hold for analysis date
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Results Documentation Template
|
||||
|
||||
```markdown
|
||||
# A/B Test Results: [Name]
|
||||
|
||||
## Summary
|
||||
| Element | Value |
|
||||
|---------|-------|
|
||||
| Test ID | [ID] |
|
||||
| Dates | [Start] - [End] |
|
||||
| Duration | X days |
|
||||
| Result | Winner / Loser / Inconclusive |
|
||||
| Decision | [What we're doing] |
|
||||
|
||||
## Hypothesis (Reminder)
|
||||
[Copy from test plan]
|
||||
|
||||
## Results
|
||||
|
||||
### Sample Size
|
||||
| Variant | Target | Actual | % of target |
|
||||
|---------|--------|--------|-------------|
|
||||
| Control | X | Y | Z% |
|
||||
| Variant | X | Y | Z% |
|
||||
|
||||
### Primary Metric: [Metric Name]
|
||||
| Variant | Value | 95% CI | vs. Control |
|
||||
|---------|-------|--------|-------------|
|
||||
| Control | X% | [X%, Y%] | — |
|
||||
| Variant | X% | [X%, Y%] | +X% |
|
||||
|
||||
**Statistical significance**: p = X.XX (95% = sig / not sig)
|
||||
**Practical significance**: [Is this lift meaningful for the business?]
|
||||
|
||||
### Secondary Metrics
|
||||
|
||||
| Metric | Control | Variant | Change | Significant? |
|
||||
|--------|---------|---------|--------|--------------|
|
||||
| [Metric 1] | X | Y | +Z% | Yes/No |
|
||||
| [Metric 2] | X | Y | +Z% | Yes/No |
|
||||
|
||||
### Guardrail Metrics
|
||||
|
||||
| Metric | Control | Variant | Change | Concern? |
|
||||
|--------|---------|---------|--------|----------|
|
||||
| [Metric 1] | X | Y | +Z% | Yes/No |
|
||||
|
||||
### Segment Analysis
|
||||
|
||||
**Mobile vs. Desktop**
|
||||
| Segment | Control | Variant | Lift |
|
||||
|---------|---------|---------|------|
|
||||
| Mobile | X% | Y% | +Z% |
|
||||
| Desktop | X% | Y% | +Z% |
|
||||
|
||||
**New vs. Returning**
|
||||
| Segment | Control | Variant | Lift |
|
||||
|---------|---------|---------|------|
|
||||
| New | X% | Y% | +Z% |
|
||||
| Returning | X% | Y% | +Z% |
|
||||
|
||||
## Interpretation
|
||||
|
||||
### What happened?
|
||||
[Explanation of results in plain language]
|
||||
|
||||
### Why do we think this happened?
|
||||
[Analysis and reasoning]
|
||||
|
||||
### Caveats
|
||||
[Any limitations, external factors, or concerns]
|
||||
|
||||
## Decision
|
||||
|
||||
**Winner**: [Control / Variant]
|
||||
|
||||
**Action**: [Implement variant / Keep control / Re-test]
|
||||
|
||||
**Timeline**: [When changes will be implemented]
|
||||
|
||||
## Learnings
|
||||
|
||||
### What we learned
|
||||
- [Key insight 1]
|
||||
- [Key insight 2]
|
||||
|
||||
### What to test next
|
||||
- [Follow-up test idea 1]
|
||||
- [Follow-up test idea 2]
|
||||
|
||||
### Impact
|
||||
- **Projected lift**: [X% improvement in Y metric]
|
||||
- **Business impact**: [Revenue, conversions, etc.]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Test Repository Entry Template
|
||||
|
||||
For tracking all tests in a central location:
|
||||
|
||||
```markdown
|
||||
| Test ID | Name | Page | Dates | Primary Metric | Result | Lift | Link |
|
||||
|---------|------|------|-------|----------------|--------|------|------|
|
||||
| 001 | Hero headline test | Homepage | 1/1-1/15 | CTR | Winner | +12% | [Link] |
|
||||
| 002 | Pricing table layout | Pricing | 1/10-1/31 | Plan selection | Loser | -5% | [Link] |
|
||||
| 003 | Signup form fields | Signup | 2/1-2/14 | Completion | Inconclusive | +2% | [Link] |
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Test Brief Template
|
||||
|
||||
For simple tests that don't need full documentation:
|
||||
|
||||
```markdown
|
||||
## [Test Name]
|
||||
|
||||
**What**: [One sentence description]
|
||||
**Why**: [One sentence hypothesis]
|
||||
**Metric**: [Primary metric]
|
||||
**Duration**: [X weeks]
|
||||
**Result**: [TBD / Winner / Loser / Inconclusive]
|
||||
**Learnings**: [Key takeaway]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Stakeholder Update Template
|
||||
|
||||
```markdown
|
||||
## A/B Test Update: [Name]
|
||||
|
||||
**Status**: Running / Complete
|
||||
**Days remaining**: X (or complete)
|
||||
**Current sample**: X% of target
|
||||
|
||||
### Preliminary observations
|
||||
[What we're seeing - without making decisions yet]
|
||||
|
||||
### Next steps
|
||||
[What happens next]
|
||||
|
||||
### Timeline
|
||||
- [Date]: Analysis complete
|
||||
- [Date]: Decision and recommendation
|
||||
- [Date]: Implementation (if winner)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Experiment Prioritization Scorecard
|
||||
|
||||
For deciding which tests to run:
|
||||
|
||||
| Factor | Weight | Test A | Test B | Test C |
|
||||
|--------|--------|--------|--------|--------|
|
||||
| Potential impact | 30% | | | |
|
||||
| Confidence in hypothesis | 25% | | | |
|
||||
| Ease of implementation | 20% | | | |
|
||||
| Risk if wrong | 15% | | | |
|
||||
| Strategic alignment | 10% | | | |
|
||||
| **Total** | | | | |
|
||||
|
||||
Scoring: 1-5 (5 = best)
|
||||
|
||||
---
|
||||
|
||||
## Hypothesis Bank Template
|
||||
|
||||
For collecting test ideas:
|
||||
|
||||
```markdown
|
||||
| ID | Page/Area | Observation | Hypothesis | Potential Impact | Status |
|
||||
|----|-----------|-------------|------------|------------------|--------|
|
||||
| H1 | Homepage | Low scroll depth | Shorter hero will increase scroll | High | Testing |
|
||||
| H2 | Pricing | Users compare plans | Comparison table will help | Medium | Backlog |
|
||||
| H3 | Signup | Drop-off at email | Social login will increase completion | Medium | Backlog |
|
||||
```
|
||||
Reference in New Issue
Block a user