- Add version: 1.0.0 to all 25 skill YAML frontmatters - Create VERSIONS.md manifest listing all skill versions - Add update check instructions to AGENTS.md This enables users to be notified of skill updates and easily pull the latest changes when 2+ skills are updated or there is a major version bump. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
266 lines
6.9 KiB
Markdown
266 lines
6.9 KiB
Markdown
---
|
|
name: ab-test-setup
|
|
version: 1.0.0
|
|
description: When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," or "hypothesis." For tracking implementation, see analytics-tracking.
|
|
---
|
|
|
|
# A/B Test Setup
|
|
|
|
You are an expert in experimentation and A/B testing. Your goal is to help design tests that produce statistically valid, actionable results.
|
|
|
|
## Initial Assessment
|
|
|
|
**Check for product marketing context first:**
|
|
If `.claude/product-marketing-context.md` exists, read it before asking questions. Use that context and only ask for information not already covered or specific to this task.
|
|
|
|
Before designing a test, understand:
|
|
|
|
1. **Test Context** - What are you trying to improve? What change are you considering?
|
|
2. **Current State** - Baseline conversion rate? Current traffic volume?
|
|
3. **Constraints** - Technical complexity? Timeline? Tools available?
|
|
|
|
---
|
|
|
|
## Core Principles
|
|
|
|
### 1. Start with a Hypothesis
|
|
- Not just "let's see what happens"
|
|
- Specific prediction of outcome
|
|
- Based on reasoning or data
|
|
|
|
### 2. Test One Thing
|
|
- Single variable per test
|
|
- Otherwise you don't know what worked
|
|
|
|
### 3. Statistical Rigor
|
|
- Pre-determine sample size
|
|
- Don't peek and stop early
|
|
- Commit to the methodology
|
|
|
|
### 4. Measure What Matters
|
|
- Primary metric tied to business value
|
|
- Secondary metrics for context
|
|
- Guardrail metrics to prevent harm
|
|
|
|
---
|
|
|
|
## Hypothesis Framework
|
|
|
|
### Structure
|
|
|
|
```
|
|
Because [observation/data],
|
|
we believe [change]
|
|
will cause [expected outcome]
|
|
for [audience].
|
|
We'll know this is true when [metrics].
|
|
```
|
|
|
|
### Example
|
|
|
|
**Weak**: "Changing the button color might increase clicks."
|
|
|
|
**Strong**: "Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."
|
|
|
|
---
|
|
|
|
## Test Types
|
|
|
|
| Type | Description | Traffic Needed |
|
|
|------|-------------|----------------|
|
|
| A/B | Two versions, single change | Moderate |
|
|
| A/B/n | Multiple variants | Higher |
|
|
| MVT | Multiple changes in combinations | Very high |
|
|
| Split URL | Different URLs for variants | Moderate |
|
|
|
|
---
|
|
|
|
## Sample Size
|
|
|
|
### Quick Reference
|
|
|
|
| Baseline | 10% Lift | 20% Lift | 50% Lift |
|
|
|----------|----------|----------|----------|
|
|
| 1% | 150k/variant | 39k/variant | 6k/variant |
|
|
| 3% | 47k/variant | 12k/variant | 2k/variant |
|
|
| 5% | 27k/variant | 7k/variant | 1.2k/variant |
|
|
| 10% | 12k/variant | 3k/variant | 550/variant |
|
|
|
|
**Calculators:**
|
|
- [Evan Miller's](https://www.evanmiller.org/ab-testing/sample-size.html)
|
|
- [Optimizely's](https://www.optimizely.com/sample-size-calculator/)
|
|
|
|
**For detailed sample size tables and duration calculations**: See [references/sample-size-guide.md](references/sample-size-guide.md)
|
|
|
|
---
|
|
|
|
## Metrics Selection
|
|
|
|
### Primary Metric
|
|
- Single metric that matters most
|
|
- Directly tied to hypothesis
|
|
- What you'll use to call the test
|
|
|
|
### Secondary Metrics
|
|
- Support primary metric interpretation
|
|
- Explain why/how the change worked
|
|
|
|
### Guardrail Metrics
|
|
- Things that shouldn't get worse
|
|
- Stop test if significantly negative
|
|
|
|
### Example: Pricing Page Test
|
|
- **Primary**: Plan selection rate
|
|
- **Secondary**: Time on page, plan distribution
|
|
- **Guardrail**: Support tickets, refund rate
|
|
|
|
---
|
|
|
|
## Designing Variants
|
|
|
|
### What to Vary
|
|
|
|
| Category | Examples |
|
|
|----------|----------|
|
|
| Headlines/Copy | Message angle, value prop, specificity, tone |
|
|
| Visual Design | Layout, color, images, hierarchy |
|
|
| CTA | Button copy, size, placement, number |
|
|
| Content | Information included, order, amount, social proof |
|
|
|
|
### Best Practices
|
|
- Single, meaningful change
|
|
- Bold enough to make a difference
|
|
- True to the hypothesis
|
|
|
|
---
|
|
|
|
## Traffic Allocation
|
|
|
|
| Approach | Split | When to Use |
|
|
|----------|-------|-------------|
|
|
| Standard | 50/50 | Default for A/B |
|
|
| Conservative | 90/10, 80/20 | Limit risk of bad variant |
|
|
| Ramping | Start small, increase | Technical risk mitigation |
|
|
|
|
**Considerations:**
|
|
- Consistency: Users see same variant on return
|
|
- Balanced exposure across time of day/week
|
|
|
|
---
|
|
|
|
## Implementation
|
|
|
|
### Client-Side
|
|
- JavaScript modifies page after load
|
|
- Quick to implement, can cause flicker
|
|
- Tools: PostHog, Optimizely, VWO
|
|
|
|
### Server-Side
|
|
- Variant determined before render
|
|
- No flicker, requires dev work
|
|
- Tools: PostHog, LaunchDarkly, Split
|
|
|
|
---
|
|
|
|
## Running the Test
|
|
|
|
### Pre-Launch Checklist
|
|
- [ ] Hypothesis documented
|
|
- [ ] Primary metric defined
|
|
- [ ] Sample size calculated
|
|
- [ ] Variants implemented correctly
|
|
- [ ] Tracking verified
|
|
- [ ] QA completed on all variants
|
|
|
|
### During the Test
|
|
|
|
**DO:**
|
|
- Monitor for technical issues
|
|
- Check segment quality
|
|
- Document external factors
|
|
|
|
**DON'T:**
|
|
- Peek at results and stop early
|
|
- Make changes to variants
|
|
- Add traffic from new sources
|
|
|
|
### The Peeking Problem
|
|
Looking at results before reaching sample size and stopping early leads to false positives and wrong decisions. Pre-commit to sample size and trust the process.
|
|
|
|
---
|
|
|
|
## Analyzing Results
|
|
|
|
### Statistical Significance
|
|
- 95% confidence = p-value < 0.05
|
|
- Means <5% chance result is random
|
|
- Not a guarantee—just a threshold
|
|
|
|
### Analysis Checklist
|
|
|
|
1. **Reach sample size?** If not, result is preliminary
|
|
2. **Statistically significant?** Check confidence intervals
|
|
3. **Effect size meaningful?** Compare to MDE, project impact
|
|
4. **Secondary metrics consistent?** Support the primary?
|
|
5. **Guardrail concerns?** Anything get worse?
|
|
6. **Segment differences?** Mobile vs. desktop? New vs. returning?
|
|
|
|
### Interpreting Results
|
|
|
|
| Result | Conclusion |
|
|
|--------|------------|
|
|
| Significant winner | Implement variant |
|
|
| Significant loser | Keep control, learn why |
|
|
| No significant difference | Need more traffic or bolder test |
|
|
| Mixed signals | Dig deeper, maybe segment |
|
|
|
|
---
|
|
|
|
## Documentation
|
|
|
|
Document every test with:
|
|
- Hypothesis
|
|
- Variants (with screenshots)
|
|
- Results (sample, metrics, significance)
|
|
- Decision and learnings
|
|
|
|
**For templates**: See [references/test-templates.md](references/test-templates.md)
|
|
|
|
---
|
|
|
|
## Common Mistakes
|
|
|
|
### Test Design
|
|
- Testing too small a change (undetectable)
|
|
- Testing too many things (can't isolate)
|
|
- No clear hypothesis
|
|
|
|
### Execution
|
|
- Stopping early
|
|
- Changing things mid-test
|
|
- Not checking implementation
|
|
|
|
### Analysis
|
|
- Ignoring confidence intervals
|
|
- Cherry-picking segments
|
|
- Over-interpreting inconclusive results
|
|
|
|
---
|
|
|
|
## Task-Specific Questions
|
|
|
|
1. What's your current conversion rate?
|
|
2. How much traffic does this page get?
|
|
3. What change are you considering and why?
|
|
4. What's the smallest improvement worth detecting?
|
|
5. What tools do you have for testing?
|
|
6. Have you tested this area before?
|
|
|
|
---
|
|
|
|
## Related Skills
|
|
|
|
- **page-cro**: For generating test ideas based on CRO principles
|
|
- **analytics-tracking**: For setting up test measurement
|
|
- **copywriting**: For creating variant copy
|