Refactor remaining skills for progressive disclosure

Phase 2 refactoring of skills >500 lines and medium-sized skills: - paid-ads: 553 → 297 lines - Extract ad-copy-templates.md, audience-targeting.md, platform-setup-checklists.md - analytics-tracking: 541 → 292 lines - Extract ga4-implementation.md, gtm-implementation.md, event-library.md - ab-test-setup: 510 → 264 lines - Extract test-templates.md, sample-size-guide.md - copywriting: 458 → 248 lines - Extract copy-frameworks.md (headline formulas, section types) - page-cro: 336 → 180 lines - Extract experiments.md (experiment ideas by page type) - onboarding-cro: 435 → 218 lines - Extract experiments.md (onboarding experiment ideas) All skills now use progressive disclosure with references/ folders, keeping SKILL.md files focused on core workflow while detailed content is available when needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 16:59:23 -08:00
parent c29ee7e6db
commit 98e74b79d7
17 changed files with 3356 additions and 1721 deletions
--- a/skills/ab-test-setup/SKILL.md
+++ b/skills/ab-test-setup/SKILL.md
@@ -14,20 +14,9 @@ If `.claude/product-marketing-context.md` exists, read it before asking question

 Before designing a test, understand:

-1. **Test Context**
-   - What are you trying to improve?
-   - What change are you considering?
-   - What made you want to test this?
-
-2. **Current State**
-   - Baseline conversion rate?
-   - Current traffic volume?
-   - Any historical test data?
-
-3. **Constraints**
-   - Technical implementation complexity?
-   - Timeline requirements?
-   - Tools available?
+1. **Test Context** - What are you trying to improve? What change are you considering?
+2. **Current State** - Baseline conversion rate? Current traffic volume?
+3. **Constraints** - Technical complexity? Timeline? Tools available?

 ---

@@ -41,7 +30,6 @@ Before designing a test, understand:
 ### 2. Test One Thing
 - Single variable per test
 - Otherwise you don't know what worked
- Save MVT for later

 ### 3. Statistical Rigor
 - Pre-determine sample size
@@ -67,81 +55,41 @@ for [audience].
 We'll know this is true when [metrics].
 ```

-### Examples
+### Example

-**Weak hypothesis:**
-"Changing the button color might increase clicks."
+**Weak**: "Changing the button color might increase clicks."

-**Strong hypothesis:**
-"Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."
-
-### Good Hypotheses Include
-
- **Observation**: What prompted this idea
- **Change**: Specific modification
- **Effect**: Expected outcome and direction
- **Audience**: Who this applies to
- **Metric**: How you'll measure success
+**Strong**: "Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."

 ---

 ## Test Types

-### A/B Test (Split Test)
- Two versions: Control (A) vs. Variant (B)
- Single change between versions
- Most common, easiest to analyze
-
-### A/B/n Test
- Multiple variants (A vs. B vs. C...)
- Requires more traffic
- Good for testing several options
-
-### Multivariate Test (MVT)
- Multiple changes in combinations
- Tests interactions between changes
- Requires significantly more traffic
- Complex analysis
-
-### Split URL Test
- Different URLs for variants
- Good for major page changes
- Easier implementation sometimes
+| Type | Description | Traffic Needed |
+|------|-------------|----------------|
+| A/B | Two versions, single change | Moderate |
+| A/B/n | Multiple variants | Higher |
+| MVT | Multiple changes in combinations | Very high |
+| Split URL | Different URLs for variants | Moderate |

 ---

-## Sample Size Calculation
-
-### Inputs Needed
-
-1. **Baseline conversion rate**: Your current rate
-2. **Minimum detectable effect (MDE)**: Smallest change worth detecting
-3. **Statistical significance level**: Usually 95%
-4. **Statistical power**: Usually 80%
+## Sample Size

 ### Quick Reference

-| Baseline Rate | 10% Lift | 20% Lift | 50% Lift |
-|---------------|----------|----------|----------|
+| Baseline | 10% Lift | 20% Lift | 50% Lift |
+|----------|----------|----------|----------|
 | 1% | 150k/variant | 39k/variant | 6k/variant |
 | 3% | 47k/variant | 12k/variant | 2k/variant |
 | 5% | 27k/variant | 7k/variant | 1.2k/variant |
 | 10% | 12k/variant | 3k/variant | 550/variant |

-### Formula Resources
- Evan Miller's calculator: https://www.evanmiller.org/ab-testing/sample-size.html
- Optimizely's calculator: https://www.optimizely.com/sample-size-calculator/
+**Calculators:**
+- [Evan Miller's](https://www.evanmiller.org/ab-testing/sample-size.html)
+- [Optimizely's](https://www.optimizely.com/sample-size-calculator/)

-### Test Duration
-
-```
-Duration = Sample size needed per variant × Number of variants
-           ───────────────────────────────────────────────────
-           Daily traffic to test page × Conversion rate
-```
-
-Minimum: 1-2 business cycles (usually 1-2 weeks)
-Maximum: Avoid running too long (novelty effects, external factors)
+**For detailed sample size tables and duration calculations**: See [references/sample-size-guide.md](references/sample-size-guide.md)

 ---

@@ -155,228 +103,106 @@ Maximum: Avoid running too long (novelty effects, external factors)
 ### Secondary Metrics
 - Support primary metric interpretation
 - Explain why/how the change worked
- Help understand user behavior

 ### Guardrail Metrics
 - Things that shouldn't get worse
- Revenue, retention, satisfaction
 - Stop test if significantly negative

-### Metric Examples by Test Type
-
-**Homepage CTA test:**
- Primary: CTA click-through rate
- Secondary: Time to click, scroll depth
- Guardrail: Bounce rate, downstream conversion
-
-**Pricing page test:**
- Primary: Plan selection rate
- Secondary: Time on page, plan distribution
- Guardrail: Support tickets, refund rate
-
-**Signup flow test:**
- Primary: Signup completion rate
- Secondary: Field-level completion, time to complete
- Guardrail: User activation rate (post-signup quality)
+### Example: Pricing Page Test
+- **Primary**: Plan selection rate
+- **Secondary**: Time on page, plan distribution
+- **Guardrail**: Support tickets, refund rate

 ---

 ## Designing Variants

-### Control (A)
- Current experience, unchanged
- Don't modify during test
+### What to Vary

-### Variant (B+)
+| Category | Examples |
+|----------|----------|
+| Headlines/Copy | Message angle, value prop, specificity, tone |
+| Visual Design | Layout, color, images, hierarchy |
+| CTA | Button copy, size, placement, number |
+| Content | Information included, order, amount, social proof |

-**Best practices:**
+### Best Practices
 - Single, meaningful change
 - Bold enough to make a difference
 - True to the hypothesis

-**What to vary:**
-
-Headlines/Copy:
- Message angle
- Value proposition
- Specificity level
- Tone/voice
-
-Visual Design:
- Layout structure
- Color and contrast
- Image selection
- Visual hierarchy
-
-CTA:
- Button copy
- Size/prominence
- Placement
- Number of CTAs
-
-Content:
- Information included
- Order of information
- Amount of content
- Social proof type
-
-### Documenting Variants
-
-```
-Control (A):
- Screenshot
- Description of current state
-
-Variant (B):
- Screenshot or mockup
- Specific changes made
- Hypothesis for why this will win
-```
-
 ---

 ## Traffic Allocation

-### Standard Split
- 50/50 for A/B test
- Equal split for multiple variants
+| Approach | Split | When to Use |
+|----------|-------|-------------|
+| Standard | 50/50 | Default for A/B |
+| Conservative | 90/10, 80/20 | Limit risk of bad variant |
+| Ramping | Start small, increase | Technical risk mitigation |

-### Conservative Rollout
- 90/10 or 80/20 initially
- Limits risk of bad variant
- Longer to reach significance
-
-### Ramping
- Start small, increase over time
- Good for technical risk mitigation
- Most tools support this
-
-### Considerations
+**Considerations:**
 - Consistency: Users see same variant on return
- Segment sizes: Ensure segments are large enough
- Time of day/week: Balanced exposure
+- Balanced exposure across time of day/week

 ---

-## Implementation Approaches
+## Implementation

-### Client-Side Testing
-
-**Tools**: PostHog, Optimizely, VWO, custom
-
-**How it works**:
+### Client-Side
 - JavaScript modifies page after load
- Quick to implement
- Can cause flicker
+- Quick to implement, can cause flicker
+- Tools: PostHog, Optimizely, VWO

-**Best for**:
- Marketing pages
- Copy/visual changes
- Quick iteration
-
-### Server-Side Testing
-
-**Tools**: PostHog, LaunchDarkly, Split, custom
-
-**How it works**:
- Variant determined before page renders
- No flicker
- Requires development work
-
-**Best for**:
- Product features
- Complex changes
- Performance-sensitive pages
-
-### Feature Flags
-
- Binary on/off (not true A/B)
- Good for rollouts
- Can convert to A/B with percentage split
+### Server-Side
+- Variant determined before render
+- No flicker, requires dev work
+- Tools: PostHog, LaunchDarkly, Split

 ---

 ## Running the Test

 ### Pre-Launch Checklist
-
 - [ ] Hypothesis documented
 - [ ] Primary metric defined
 - [ ] Sample size calculated
- [ ] Test duration estimated
 - [ ] Variants implemented correctly
 - [ ] Tracking verified
 - [ ] QA completed on all variants
- [ ] Stakeholders informed

 ### During the Test

 **DO:**
 - Monitor for technical issues
 - Check segment quality
- Document any external factors
+- Document external factors

 **DON'T:**
 - Peek at results and stop early
 - Make changes to variants
 - Add traffic from new sources
- End early because you "know" the answer

-### Peeking Problem
-
-Looking at results before reaching sample size and stopping when you see significance leads to:
- False positives
- Inflated effect sizes
- Wrong decisions
-
-**Solutions:**
- Pre-commit to sample size and stick to it
- Use sequential testing if you must peek
- Trust the process
+### The Peeking Problem
+Looking at results before reaching sample size and stopping early leads to false positives and wrong decisions. Pre-commit to sample size and trust the process.

 ---

 ## Analyzing Results

 ### Statistical Significance
-
 - 95% confidence = p-value < 0.05
- Means: <5% chance result is random
+- Means <5% chance result is random
 - Not a guarantee—just a threshold

-### Practical Significance
+### Analysis Checklist

-Statistical ≠ Practical
-
- Is the effect size meaningful for business?
- Is it worth the implementation cost?
- Is it sustainable over time?
-
-### What to Look At
-
-1. **Did you reach sample size?**
-   - If not, result is preliminary
-
-2. **Is it statistically significant?**
-   - Check confidence intervals
-   - Check p-value
-
-3. **Is the effect size meaningful?**
-   - Compare to your MDE
-   - Project business impact
-
-4. **Are secondary metrics consistent?**
-   - Do they support the primary?
-   - Any unexpected effects?
-
-5. **Any guardrail concerns?**
-   - Did anything get worse?
-   - Long-term risks?
-
-6. **Segment differences?**
-   - Mobile vs. desktop?
-   - New vs. returning?
-   - Traffic source?
+1. **Reach sample size?** If not, result is preliminary
+2. **Statistically significant?** Check confidence intervals
+3. **Effect size meaningful?** Compare to MDE, project impact
+4. **Secondary metrics consistent?** Support the primary?
+5. **Guardrail concerns?** Anything get worse?
+6. **Segment differences?** Mobile vs. desktop? New vs. returning?

 ### Interpreting Results

@@ -389,84 +215,15 @@ Statistical ≠ Practical

 ---

-## Documenting and Learning
+## Documentation

-### Test Documentation
+Document every test with:
+- Hypothesis
+- Variants (with screenshots)
+- Results (sample, metrics, significance)
+- Decision and learnings

-```
-Test Name: [Name]
-Test ID: [ID in testing tool]
-Dates: [Start] - [End]
-Owner: [Name]
-
-Hypothesis:
-[Full hypothesis statement]
-
-Variants:
- Control: [Description + screenshot]
- Variant: [Description + screenshot]
-
-Results:
- Sample size: [achieved vs. target]
- Primary metric: [control] vs. [variant] ([% change], [confidence])
- Secondary metrics: [summary]
- Segment insights: [notable differences]
-
-Decision: [Winner/Loser/Inconclusive]
-Action: [What we're doing]
-
-Learnings:
-[What we learned, what to test next]
-```
-
-### Building a Learning Repository
-
- Central location for all tests
- Searchable by page, element, outcome
- Prevents re-running failed tests
- Builds institutional knowledge
-
---
-
-## Output Format
-
-### Test Plan Document
-
-```
-# A/B Test: [Name]
-
-## Hypothesis
-[Full hypothesis using framework]
-
-## Test Design
- Type: A/B / A/B/n / MVT
- Duration: X weeks
- Sample size: X per variant
- Traffic allocation: 50/50
-
-## Variants
-[Control and variant descriptions with visuals]
-
-## Metrics
- Primary: [metric and definition]
- Secondary: [list]
- Guardrails: [list]
-
-## Implementation
- Method: Client-side / Server-side
- Tool: [Tool name]
- Dev requirements: [If any]
-
-## Analysis Plan
- Success criteria: [What constitutes a win]
- Segment analysis: [Planned segments]
-```
-
-### Results Summary
-When test is complete
-
-### Recommendations
-Next steps based on results
+**For templates**: See [references/test-templates.md](references/test-templates.md)

 ---

@@ -476,19 +233,16 @@ Next steps based on results
 - Testing too small a change (undetectable)
 - Testing too many things (can't isolate)
 - No clear hypothesis
- Wrong audience

 ### Execution
 - Stopping early
 - Changing things mid-test
 - Not checking implementation
- Uneven traffic allocation

 ### Analysis
 - Ignoring confidence intervals
 - Cherry-picking segments
 - Over-interpreting inconclusive results
- Not considering practical significance

 ---