feat: integrate PR #28 and #29 (multi-agent brainstorming, design orchestration)

This commit is contained in:
sck_0
2026-01-25 17:53:35 +01:00
parent af57b96721
commit ae3d038711
6 changed files with 317 additions and 267 deletions

View File

@@ -1,6 +1,32 @@
---
name: ab-test-setup
description: Structured guide for setting up A/B tests with mandatory gates for hypothesis, metrics, and execution readiness.
---
# A/B Test Setup
## 1⃣ Purpose & Scope
Ensure every A/B test is **valid, rigorous, and safe** before a single line of code is written.
- Prevents "peeking"
- Enforces statistical power
- Blocks invalid hypotheses
---
## 2⃣ Pre-Requisites
You must have:
- A clear user problem
- Access to an analytics source
- Roughly estimated traffic volume
#### Hypothesis Quality Checklist
A valid hypothesis includes:
- Observation or evidence
- Single, specific change
- Directional expectation
@@ -39,6 +65,7 @@ Explicitly list assumptions about:
- External factors (seasonality, campaigns, releases)
If assumptions are weak or violated:
- Warn the user
- Recommend delaying or redesigning the test
@@ -60,16 +87,19 @@ Default to **A/B** unless there is a clear reason otherwise.
### 6⃣ Metrics Definition
#### Primary Metric (Mandatory)
- Single metric used to evaluate success
- Directly tied to the hypothesis
- Pre-defined and frozen before launch
#### Secondary Metrics
- Provide context
- Explain *why* results occurred
- Explain _why_ results occurred
- Must not override the primary metric
#### Guardrail Metrics
- Metrics that must not degrade
- Used to prevent harmful wins
- Trigger test stop if significantly negative
@@ -79,12 +109,14 @@ Default to **A/B** unless there is a clear reason otherwise.
### 7⃣ Sample Size & Duration
Define upfront:
- Baseline rate
- MDE
- Significance level (typically 95%)
- Statistical power (typically 80%)
Estimate:
- Required sample size per variant
- Expected test duration
@@ -112,10 +144,12 @@ If any item is missing, stop and resolve it.
### During the Test
**DO:**
- Monitor technical health
- Document external factors
**DO NOT:**
- Stop early due to “good-looking” results
- Change variants mid-test
- Add new traffic sources
@@ -136,12 +170,12 @@ When interpreting results:
### Interpretation Outcomes
| Result | Action |
|------|-------|
| Significant positive | Consider rollout |
| Significant negative | Reject variant, document learning |
| Inconclusive | Consider more traffic or bolder change |
| Guardrail failure | Do not ship, even if primary wins |
| Result | Action |
| -------------------- | -------------------------------------- |
| Significant positive | Consider rollout |
| Significant negative | Reject variant, document learning |
| Inconclusive | Consider more traffic or bolder change |
| Guardrail failure | Do not ship, even if primary wins |
---
@@ -150,6 +184,7 @@ When interpreting results:
### Test Record (Mandatory)
Document:
- Hypothesis
- Variants
- Metrics
@@ -166,6 +201,7 @@ Store records in a shared, searchable location to avoid repeated failures.
## Refusal Conditions (Safety)
Refuse to proceed if:
- Baseline rate is unknown and cannot be estimated
- Traffic is insufficient to detect the MDE
- Primary metric is undefined