Payments Experimentation (Operator Field Manual)
On this page
Most merchants flip fraud rules live without testing, then scramble when good orders die. Shadow first, then enforce. Treat payment rules like code deploys: stage, monitor, promote.
Last verified: Dec 2025. Experimentation frameworks evolve; adapt to your stack.
What Matters (5 bullets)
- Shadow mode first. Log decisions, do not block. Measure false positives before going live.
- Pick a single success metric per test. Auth lift, fraud rate, CX impact. Not all three at once.
- Run by cohort. Method, BIN, country, device, CP vs CNP. Never test on all traffic.
- Set stop rules before launch. Max false positive %, max revenue at risk. Honor them.
- Feedback loops lag. Use alerts/SAFE/TC40 to shorten the chargeback feedback delay.
Shadow Mode: The Foundation
Shadow mode runs your new rule in parallel without enforcing it. Every transaction gets two decisions: actual (what happened) and shadow (what would have happened).
How to Implement Shadow Mode
- Log both decisions - Actual outcome + shadow rule outcome
- Tag transactions - Mark "would-have-blocked" for tracking
- Don't affect the customer - Shadow decisions are invisible
- Track over time - 7-14 days minimum
What to Measure in Shadow
| Metric | What It Tells You |
|---|---|
| Would-block rate | How aggressive is the new rule |
| False positive rate | Good orders that would have been blocked |
| True positive rate | Bad orders correctly caught |
| Coverage | What % of fraud would this catch |
Shadow Decision Matrix
| Actual Outcome | Shadow Decision | Interpretation |
|---|---|---|
| Approved, no dispute | Would block | False positive (bad) |
| Approved, disputed | Would block | True positive (good) |
| Approved, no dispute | Would allow | Correct allow |
| Approved, disputed | Would allow | Missed fraud |
Experiment Design
Test Types
| Test Type | What You're Testing | Key Metric |
|---|---|---|
| Fraud rule tightening | New velocity limit | Block rate vs fraud rate |
| Fraud rule loosening | Relaxing a rule | Auth rate vs fraud increase |
| 3DS threshold | When to challenge | Conversion vs liability shift |
| Auth retry | Decline handling | Recovery rate vs cost |
| Checkout flow | Payment form changes | Conversion rate |
Cohort Selection
Never test on all traffic. Pick cohorts that:
- Are large enough for statistical significance (500+ decisions)
- Represent meaningful segments
- Can be isolated
Good cohorts:
- Geographic (US vs EU vs APAC)
- Payment method (card vs wallet)
- Transaction type (CP vs CNP)
- BIN range (specific issuers)
- Device type (mobile vs desktop)
- Customer type (new vs returning)
Sample Size Guidelines
| Decision Volume | Minimum Test Duration | Notes |
|---|---|---|
| Under 100/day | 2-4 weeks | May be inconclusive |
| 100-500/day | 1-2 weeks | Standard test period |
| 500-2000/day | 3-7 days | Faster feedback |
| Over 2000/day | 1-3 days | Can iterate quickly |
Test to Run (2-4 weeks)
Week 1: Shadow Phase
- Choose one rule change - Example: tighten velocity on CNP high-risk BINs
- Implement shadow logging - Log would-block decisions
- Tag approved transactions - Mark those that would have been blocked
- Monitor daily - Check false positive rate
Week 2: Analysis
- Calculate false positives - Good orders that would have blocked
- Calculate true positives - Fraud/disputes that would have caught
- Assess impact - Revenue at risk vs fraud prevented
- Decide: proceed, modify, or abandon
Week 3: Ramp (if proceeding)
- Enable on 10-25% of traffic - Real enforcement, limited scope
- Monitor hourly - Watch for unexpected blocks
- Check customer support - Any complaints about declines?
- Compare to control - Does reality match shadow?
Week 4: Full Rollout (if successful)
- Roll to 100% - Only if Week 3 metrics are stable
- Document baseline - New normal for this rule
- Set ongoing alerts - Detect drift from baseline
- Plan next experiment - Continuous improvement
Metrics to Track
Primary Metrics (choose one per test)
| Metric | Definition | Target Direction |
|---|---|---|
| Auth rate | Approved / Attempted | Higher is better |
| Block rate | Blocked / Attempted | Lower is usually better |
| Fraud rate | Disputes / Approved | Lower is better |
| Conversion | Completed / Started | Higher is better |
Secondary Metrics (monitor, don't optimize)
| Metric | Why Track It |
|---|---|
| False positive rate | Catch good-order blocking |
| Support tickets | Detect customer friction |
| Revenue per attempt | Net effect on business |
| Soft vs hard decline mix | Understand decline sources |
Analyst Calculations
Block rate = Blocked transactions / Total attempts
False positive rate = (Would-block AND no dispute) / Would-block
True positive rate = (Would-block AND disputed) / Total disputed
Lift = (New auth rate - Baseline auth rate) / Baseline auth rate
Stop Rules
Define before launch. Honor when triggered.
Example Stop Rules
| Condition | Action |
|---|---|
| False positive rate > 2% | Pause experiment |
| Auth rate drops > 1% vs control | Investigate |
| Support tickets spike 2x | Pause and review |
| Revenue at risk > $X | Rollback |
| Any P0 incident | Immediate rollback |
Rollback Requirements
Before launching any experiment:
- Confirm rollback is one-click (or automated)
- Test rollback in staging
- Document rollback procedure
- Assign rollback authority
Scale Callout
| Volume | Approach |
|---|---|
| Under $100k/mo | Shadow only; avoid live blocks. Use alerts for spikes. No statistical significance for small tests. |
| $100k-$1M/mo | Shadow → 25% ramp → full if false positives under 1%. Document everything. |
| Over $1M/mo | Require rollback switch, alerting, daily review during ramp. Dedicated owner per experiment. |
Where This Breaks
- No labeled outcomes. If you can't tell good from bad orders, fix tagging first. No experimentation without truth labels.
- Chargeback feedback lags. 30-90 day delay on dispute data. Use alerts, SAFE, TC40 to shorten the loop.
- Testing during peak periods. Black Friday, promotions, holidays skew results. Avoid or heavily caveat.
- Multiple simultaneous changes. Can't attribute results. Isolate one variable per test.
- No operator-dev handshake. Engineers deploy, operators don't know. Add "show me the shadow logs" checkpoint.
Common Experimentation Mistakes
| Mistake | Consequence | Prevention |
|---|---|---|
| No shadow period | Blocked good orders immediately | Always shadow first |
| Too short test | Inconclusive results | Minimum sample sizes |
| No stop rules | Runaway false positives | Define before launch |
| Multiple changes | Can't attribute results | One variable at a time |
| No rollback plan | Stuck with bad rule | Test rollback first |
| Ignoring support signals | Customer friction unnoticed | Monitor tickets |
Experimentation Infrastructure
Minimum Requirements
- Shadow logging - Record shadow decisions separately
- Outcome tagging - Link transactions to disputes/refunds
- Cohort assignment - Deterministic customer/transaction bucketing
- Metrics dashboard - Real-time visibility
- Alert system - Trigger on stop rule conditions
- Rollback mechanism - Quick revert capability
Nice to Have
- Statistical significance calculator - Built into dashboard
- Automatic ramping - Gradual traffic increase
- Experiment registry - Track all active/past tests
- Cross-experiment interference detection - Catch conflicts
Next Steps
Setting up your first experiment?
- Implement shadow mode - Log decisions without blocking
- Design the test - Test type, cohort, sample size
- Define stop rules - Before you launch
Running a test now?
- Follow the 2-4 week timeline - Shadow, analyze, ramp, rollout
- Track key metrics - Primary and secondary
- Know when to stop - Honor the rules
Building experimentation infrastructure?
- Meet minimum requirements - Shadow logging, tagging, cohorts
- Avoid common mistakes - No shadow, too short, no rollback
- Scale appropriately - By transaction volume
Related
- Processor Rules Configuration - Native fraud tools
- Velocity Rules - Rate-based detection
- Auth Optimization - Improving approval rates
- Processor Reporting Checklist - Data requirements
- Alerts Configuration - Monitoring setup
- Risk Scoring - Score thresholds
- 3D Secure - Authentication testing
- Rules vs. ML - Detection approaches
- Checkout Conversion - Friction impact
- Fraud Metrics - Measuring performance
- Chargeback Metrics - Dispute tracking
- Benchmarks - Performance targets