app-store-optimization/skills/ab-test-setup/SKILL.md


#### Hypothesis Quality Checklist
A valid hypothesis includes:
- Observation or evidence
- Single, specific change
- Directional expectation
- Defined audience
- Measurable success criteria

---

### 3️⃣ Hypothesis Lock (Hard Gate)

Before designing variants or metrics, you MUST:

- Present the **final hypothesis**
- Specify:
  - Target audience
  - Primary metric
  - Expected direction of effect
  - Minimum Detectable Effect (MDE)

Ask explicitly:

> “Is this the final hypothesis we are committing to for this test?”

**Do NOT proceed until confirmed.**

---

### 4️⃣ Assumptions & Validity Check (Mandatory)

Explicitly list assumptions about:

- Traffic stability
- User independence
- Metric reliability
- Randomization quality
- External factors (seasonality, campaigns, releases)

If assumptions are weak or violated:
- Warn the user
- Recommend delaying or redesigning the test

---

### 5️⃣ Test Type Selection

Choose the simplest valid test:

- **A/B Test** – single change, two variants
- **A/B/n Test** – multiple variants, higher traffic required
- **Multivariate Test (MVT)** – interaction effects, very high traffic
- **Split URL Test** – major structural changes

Default to **A/B** unless there is a clear reason otherwise.

---

### 6️⃣ Metrics Definition

#### Primary Metric (Mandatory)
- Single metric used to evaluate success
- Directly tied to the hypothesis
- Pre-defined and frozen before launch

#### Secondary Metrics
- Provide context
- Explain *why* results occurred
- Must not override the primary metric

#### Guardrail Metrics
- Metrics that must not degrade
- Used to prevent harmful wins
- Trigger test stop if significantly negative

---

### 7️⃣ Sample Size & Duration

Define upfront:
- Baseline rate
- MDE
- Significance level (typically 95%)
- Statistical power (typically 80%)

Estimate:
- Required sample size per variant
- Expected test duration

**Do NOT proceed without a realistic sample size estimate.**

---

### 8️⃣ Execution Readiness Gate (Hard Stop)

You may proceed to implementation **only if all are true**:

- Hypothesis is locked
- Primary metric is frozen
- Sample size is calculated
- Test duration is defined
- Guardrails are set
- Tracking is verified

If any item is missing, stop and resolve it.

---

## Running the Test

### During the Test

**DO:**
- Monitor technical health
- Document external factors

**DO NOT:**
- Stop early due to “good-looking” results
- Change variants mid-test
- Add new traffic sources
- Redefine success criteria

---

## Analyzing Results

### Analysis Discipline

When interpreting results:

- Do NOT generalize beyond the tested population
- Do NOT claim causality beyond the tested change
- Do NOT override guardrail failures
- Separate statistical significance from business judgment

### Interpretation Outcomes

| Result | Action |
|------|-------|
| Significant positive | Consider rollout |
| Significant negative | Reject variant, document learning |
| Inconclusive | Consider more traffic or bolder change |
| Guardrail failure | Do not ship, even if primary wins |

---

## Documentation & Learning

### Test Record (Mandatory)

Document:
- Hypothesis
- Variants
- Metrics
- Sample size vs achieved
- Results
- Decision
- Learnings
- Follow-up ideas

Store records in a shared, searchable location to avoid repeated failures.

---

## Refusal Conditions (Safety)

Refuse to proceed if:
- Baseline rate is unknown and cannot be estimated
- Traffic is insufficient to detect the MDE
- Primary metric is undefined
- Multiple variables are changed without proper design
- Hypothesis cannot be clearly stated

Explain why and recommend next steps.

---

## Key Principles (Non-Negotiable)

- One hypothesis per test
- One primary metric
- Commit before launch
- No peeking
- Learning over winning
- Statistical rigor first

---

## Final Reminder

A/B testing is not about proving ideas right.
It is about **learning the truth with confidence**.

If you feel tempted to rush, simplify, or “just try it” —
that is the signal to **slow down and re-check the design**.