Refactor remaining skills for progressive disclosure

Phase 2 refactoring of skills >500 lines and medium-sized skills:

- paid-ads: 553 → 297 lines
  - Extract ad-copy-templates.md, audience-targeting.md, platform-setup-checklists.md

- analytics-tracking: 541 → 292 lines
  - Extract ga4-implementation.md, gtm-implementation.md, event-library.md

- ab-test-setup: 510 → 264 lines
  - Extract test-templates.md, sample-size-guide.md

- copywriting: 458 → 248 lines
  - Extract copy-frameworks.md (headline formulas, section types)

- page-cro: 336 → 180 lines
  - Extract experiments.md (experiment ideas by page type)

- onboarding-cro: 435 → 218 lines
  - Extract experiments.md (onboarding experiment ideas)

All skills now use progressive disclosure with references/ folders,
keeping SKILL.md files focused on core workflow while detailed
content is available when needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Corey Haines
2026-01-26 16:59:23 -08:00
parent c29ee7e6db
commit 98e74b79d7
17 changed files with 3356 additions and 1721 deletions

View File

@@ -14,20 +14,9 @@ If `.claude/product-marketing-context.md` exists, read it before asking question
Before designing a test, understand:
1. **Test Context**
- What are you trying to improve?
- What change are you considering?
- What made you want to test this?
2. **Current State**
- Baseline conversion rate?
- Current traffic volume?
- Any historical test data?
3. **Constraints**
- Technical implementation complexity?
- Timeline requirements?
- Tools available?
1. **Test Context** - What are you trying to improve? What change are you considering?
2. **Current State** - Baseline conversion rate? Current traffic volume?
3. **Constraints** - Technical complexity? Timeline? Tools available?
---
@@ -41,7 +30,6 @@ Before designing a test, understand:
### 2. Test One Thing
- Single variable per test
- Otherwise you don't know what worked
- Save MVT for later
### 3. Statistical Rigor
- Pre-determine sample size
@@ -67,81 +55,41 @@ for [audience].
We'll know this is true when [metrics].
```
### Examples
### Example
**Weak hypothesis:**
"Changing the button color might increase clicks."
**Weak**: "Changing the button color might increase clicks."
**Strong hypothesis:**
"Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."
### Good Hypotheses Include
- **Observation**: What prompted this idea
- **Change**: Specific modification
- **Effect**: Expected outcome and direction
- **Audience**: Who this applies to
- **Metric**: How you'll measure success
**Strong**: "Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."
---
## Test Types
### A/B Test (Split Test)
- Two versions: Control (A) vs. Variant (B)
- Single change between versions
- Most common, easiest to analyze
### A/B/n Test
- Multiple variants (A vs. B vs. C...)
- Requires more traffic
- Good for testing several options
### Multivariate Test (MVT)
- Multiple changes in combinations
- Tests interactions between changes
- Requires significantly more traffic
- Complex analysis
### Split URL Test
- Different URLs for variants
- Good for major page changes
- Easier implementation sometimes
| Type | Description | Traffic Needed |
|------|-------------|----------------|
| A/B | Two versions, single change | Moderate |
| A/B/n | Multiple variants | Higher |
| MVT | Multiple changes in combinations | Very high |
| Split URL | Different URLs for variants | Moderate |
---
## Sample Size Calculation
### Inputs Needed
1. **Baseline conversion rate**: Your current rate
2. **Minimum detectable effect (MDE)**: Smallest change worth detecting
3. **Statistical significance level**: Usually 95%
4. **Statistical power**: Usually 80%
## Sample Size
### Quick Reference
| Baseline Rate | 10% Lift | 20% Lift | 50% Lift |
|---------------|----------|----------|----------|
| Baseline | 10% Lift | 20% Lift | 50% Lift |
|----------|----------|----------|----------|
| 1% | 150k/variant | 39k/variant | 6k/variant |
| 3% | 47k/variant | 12k/variant | 2k/variant |
| 5% | 27k/variant | 7k/variant | 1.2k/variant |
| 10% | 12k/variant | 3k/variant | 550/variant |
### Formula Resources
- Evan Miller's calculator: https://www.evanmiller.org/ab-testing/sample-size.html
- Optimizely's calculator: https://www.optimizely.com/sample-size-calculator/
**Calculators:**
- [Evan Miller's](https://www.evanmiller.org/ab-testing/sample-size.html)
- [Optimizely's](https://www.optimizely.com/sample-size-calculator/)
### Test Duration
```
Duration = Sample size needed per variant × Number of variants
───────────────────────────────────────────────────
Daily traffic to test page × Conversion rate
```
Minimum: 1-2 business cycles (usually 1-2 weeks)
Maximum: Avoid running too long (novelty effects, external factors)
**For detailed sample size tables and duration calculations**: See [references/sample-size-guide.md](references/sample-size-guide.md)
---
@@ -155,228 +103,106 @@ Maximum: Avoid running too long (novelty effects, external factors)
### Secondary Metrics
- Support primary metric interpretation
- Explain why/how the change worked
- Help understand user behavior
### Guardrail Metrics
- Things that shouldn't get worse
- Revenue, retention, satisfaction
- Stop test if significantly negative
### Metric Examples by Test Type
**Homepage CTA test:**
- Primary: CTA click-through rate
- Secondary: Time to click, scroll depth
- Guardrail: Bounce rate, downstream conversion
**Pricing page test:**
- Primary: Plan selection rate
- Secondary: Time on page, plan distribution
- Guardrail: Support tickets, refund rate
**Signup flow test:**
- Primary: Signup completion rate
- Secondary: Field-level completion, time to complete
- Guardrail: User activation rate (post-signup quality)
### Example: Pricing Page Test
- **Primary**: Plan selection rate
- **Secondary**: Time on page, plan distribution
- **Guardrail**: Support tickets, refund rate
---
## Designing Variants
### Control (A)
- Current experience, unchanged
- Don't modify during test
### What to Vary
### Variant (B+)
| Category | Examples |
|----------|----------|
| Headlines/Copy | Message angle, value prop, specificity, tone |
| Visual Design | Layout, color, images, hierarchy |
| CTA | Button copy, size, placement, number |
| Content | Information included, order, amount, social proof |
**Best practices:**
### Best Practices
- Single, meaningful change
- Bold enough to make a difference
- True to the hypothesis
**What to vary:**
Headlines/Copy:
- Message angle
- Value proposition
- Specificity level
- Tone/voice
Visual Design:
- Layout structure
- Color and contrast
- Image selection
- Visual hierarchy
CTA:
- Button copy
- Size/prominence
- Placement
- Number of CTAs
Content:
- Information included
- Order of information
- Amount of content
- Social proof type
### Documenting Variants
```
Control (A):
- Screenshot
- Description of current state
Variant (B):
- Screenshot or mockup
- Specific changes made
- Hypothesis for why this will win
```
---
## Traffic Allocation
### Standard Split
- 50/50 for A/B test
- Equal split for multiple variants
| Approach | Split | When to Use |
|----------|-------|-------------|
| Standard | 50/50 | Default for A/B |
| Conservative | 90/10, 80/20 | Limit risk of bad variant |
| Ramping | Start small, increase | Technical risk mitigation |
### Conservative Rollout
- 90/10 or 80/20 initially
- Limits risk of bad variant
- Longer to reach significance
### Ramping
- Start small, increase over time
- Good for technical risk mitigation
- Most tools support this
### Considerations
**Considerations:**
- Consistency: Users see same variant on return
- Segment sizes: Ensure segments are large enough
- Time of day/week: Balanced exposure
- Balanced exposure across time of day/week
---
## Implementation Approaches
## Implementation
### Client-Side Testing
**Tools**: PostHog, Optimizely, VWO, custom
**How it works**:
### Client-Side
- JavaScript modifies page after load
- Quick to implement
- Can cause flicker
- Quick to implement, can cause flicker
- Tools: PostHog, Optimizely, VWO
**Best for**:
- Marketing pages
- Copy/visual changes
- Quick iteration
### Server-Side Testing
**Tools**: PostHog, LaunchDarkly, Split, custom
**How it works**:
- Variant determined before page renders
- No flicker
- Requires development work
**Best for**:
- Product features
- Complex changes
- Performance-sensitive pages
### Feature Flags
- Binary on/off (not true A/B)
- Good for rollouts
- Can convert to A/B with percentage split
### Server-Side
- Variant determined before render
- No flicker, requires dev work
- Tools: PostHog, LaunchDarkly, Split
---
## Running the Test
### Pre-Launch Checklist
- [ ] Hypothesis documented
- [ ] Primary metric defined
- [ ] Sample size calculated
- [ ] Test duration estimated
- [ ] Variants implemented correctly
- [ ] Tracking verified
- [ ] QA completed on all variants
- [ ] Stakeholders informed
### During the Test
**DO:**
- Monitor for technical issues
- Check segment quality
- Document any external factors
- Document external factors
**DON'T:**
- Peek at results and stop early
- Make changes to variants
- Add traffic from new sources
- End early because you "know" the answer
### Peeking Problem
Looking at results before reaching sample size and stopping when you see significance leads to:
- False positives
- Inflated effect sizes
- Wrong decisions
**Solutions:**
- Pre-commit to sample size and stick to it
- Use sequential testing if you must peek
- Trust the process
### The Peeking Problem
Looking at results before reaching sample size and stopping early leads to false positives and wrong decisions. Pre-commit to sample size and trust the process.
---
## Analyzing Results
### Statistical Significance
- 95% confidence = p-value < 0.05
- Means: <5% chance result is random
- Means <5% chance result is random
- Not a guarantee—just a threshold
### Practical Significance
### Analysis Checklist
Statistical ≠ Practical
- Is the effect size meaningful for business?
- Is it worth the implementation cost?
- Is it sustainable over time?
### What to Look At
1. **Did you reach sample size?**
- If not, result is preliminary
2. **Is it statistically significant?**
- Check confidence intervals
- Check p-value
3. **Is the effect size meaningful?**
- Compare to your MDE
- Project business impact
4. **Are secondary metrics consistent?**
- Do they support the primary?
- Any unexpected effects?
5. **Any guardrail concerns?**
- Did anything get worse?
- Long-term risks?
6. **Segment differences?**
- Mobile vs. desktop?
- New vs. returning?
- Traffic source?
1. **Reach sample size?** If not, result is preliminary
2. **Statistically significant?** Check confidence intervals
3. **Effect size meaningful?** Compare to MDE, project impact
4. **Secondary metrics consistent?** Support the primary?
5. **Guardrail concerns?** Anything get worse?
6. **Segment differences?** Mobile vs. desktop? New vs. returning?
### Interpreting Results
@@ -389,84 +215,15 @@ Statistical ≠ Practical
---
## Documenting and Learning
## Documentation
### Test Documentation
Document every test with:
- Hypothesis
- Variants (with screenshots)
- Results (sample, metrics, significance)
- Decision and learnings
```
Test Name: [Name]
Test ID: [ID in testing tool]
Dates: [Start] - [End]
Owner: [Name]
Hypothesis:
[Full hypothesis statement]
Variants:
- Control: [Description + screenshot]
- Variant: [Description + screenshot]
Results:
- Sample size: [achieved vs. target]
- Primary metric: [control] vs. [variant] ([% change], [confidence])
- Secondary metrics: [summary]
- Segment insights: [notable differences]
Decision: [Winner/Loser/Inconclusive]
Action: [What we're doing]
Learnings:
[What we learned, what to test next]
```
### Building a Learning Repository
- Central location for all tests
- Searchable by page, element, outcome
- Prevents re-running failed tests
- Builds institutional knowledge
---
## Output Format
### Test Plan Document
```
# A/B Test: [Name]
## Hypothesis
[Full hypothesis using framework]
## Test Design
- Type: A/B / A/B/n / MVT
- Duration: X weeks
- Sample size: X per variant
- Traffic allocation: 50/50
## Variants
[Control and variant descriptions with visuals]
## Metrics
- Primary: [metric and definition]
- Secondary: [list]
- Guardrails: [list]
## Implementation
- Method: Client-side / Server-side
- Tool: [Tool name]
- Dev requirements: [If any]
## Analysis Plan
- Success criteria: [What constitutes a win]
- Segment analysis: [Planned segments]
```
### Results Summary
When test is complete
### Recommendations
Next steps based on results
**For templates**: See [references/test-templates.md](references/test-templates.md)
---
@@ -476,19 +233,16 @@ Next steps based on results
- Testing too small a change (undetectable)
- Testing too many things (can't isolate)
- No clear hypothesis
- Wrong audience
### Execution
- Stopping early
- Changing things mid-test
- Not checking implementation
- Uneven traffic allocation
### Analysis
- Ignoring confidence intervals
- Cherry-picking segments
- Over-interpreting inconclusive results
- Not considering practical significance
---