Risk Scoring
Before implementing risk scoring, understand:
- Fraud types you're detecting
- Risk appetite and thresholds
- Rules vs ML approaches
- Fraud metrics for measurement
- Risk score = 0-100 number indicating fraud likelihood; use for auto-approve / review / auto-decline tiers
- Rules: Email domain, velocity, geo mismatch—transparent, fast to deploy, catches known patterns
- ML: Complex patterns, novel fraud—needs data science expertise and labeled training data
- Best approach: Combine both—rules for known fraud + ML for subtle patterns
- Threshold tuning: Run A/B tests; calculate cost of false positives vs. cost of fraud; adjust quarterly
A risk score is just a number. The question is: does it help you make better decisions?
Your thresholds are bets. You're trading false positives (blocking good customers) for true positives (blocking fraud). The "right" threshold depends on your margins, your fraud rate, and your tolerance for customer complaints.
Run 3 cutoffs in parallel on small slices of traffic:
- Segment A: Score under 30 auto-approve, 30-60 review, over 60 auto-block
- Segment B: Score under 40 auto-approve, 40-70 review, over 70 auto-block
- Segment C: Score under 50 auto-approve, 50-80 review, over 80 auto-block
Metrics: Fraud loss + review cost + estimated false positive cost (use average order value × block rate × estimated good customer %)
Run length: 4 weeks (you need time for chargebacks to materialize)
Decision: Pick the cutoff with lowest total cost. Probably not the tightest one.
What Is a Risk Score?
A risk score is a number assigned to each transaction indicating the likelihood that it's fraudulent. Higher scores mean higher risk.
Common scales:
- 0-100 (higher = riskier)
- 0-1000 (more granular)
- 0-1 probability (true probability)
How it's used:
Score 0-30: Auto-approve
Score 31-70: Manual review
Score 71-100: Auto-decline
The thresholds depend on your risk tolerance, margins, and operational capacity. These aren't magic numbers - they represent a trade-off between catching fraud (true positives) and wrongly declining good customers (false positives).
Rules-Based Scoring
Rules are explicit conditions that add or subtract from a transaction's risk score.
How Rules Work
Each rule evaluates a condition and applies a score adjustment:
IF email_domain = "tempmail.com" THEN +30
IF shipping_country != billing_country THEN +15
IF customer_has_previous_orders > 5 THEN -10
IF device_seen_on_fraud_before = true THEN +50
IF amount > $500 THEN +10
Final score = base score + sum of all triggered rules.
Types of Rules
Identity rules:
- Email validity (deliverable, disposable domain, recently created)
- Phone number validation
- Name consistency across data points
Transaction rules:
- Order amount (high value = higher risk)
- Product category (high-fraud categories)
- Shipping method (expedited = higher risk)
- Billing/shipping mismatch
- Time to complete checkout (too fast = bot)
- Session behavior (copy-paste vs. typing)
- Multiple failed attempts before success
- Orders per IP per hour
- Cards per email per day
- Shipping addresses per card per week
- Proxy/VPN detected
- Device fingerprint seen on fraud
- Geolocation vs. billing country mismatch (see AVS)
Rules: Pros and Cons
Pros:
- Transparent - you know exactly why a transaction was flagged
- Controllable - adjust instantly for new patterns
- Explainable - easy to justify decisions to customers, banks, auditors
- No training data required
Cons:
- Reactive - must manually add rules for new fraud patterns
- Brittle - fraudsters learn and adapt to your rules
- Maintenance burden - rule sets grow unwieldy over time
- Limited pattern recognition - can't catch subtle correlations
Machine Learning Scoring
ML models analyze historical transaction data to identify patterns that predict fraud, including patterns too complex for humans to define as rules.
How ML Scoring Works
- Training: Model is fed historical transactions labeled as fraud/legitimate
- Learning: Model identifies features and patterns correlated with fraud
- Scoring: For new transactions, model outputs fraud probability
- Feedback loop: New fraud outcomes are fed back to improve the model
Types of ML Models
Supervised learning:
- Learns from labeled examples (this was fraud, this wasn't)
- Most common approach for fraud scoring
- Requires clean, labeled historical data
Unsupervised learning:
- Identifies anomalies without labels
- Useful for catching new fraud types
- Higher false positive rate
Neural networks:
- Can find complex, non-linear patterns
- "Black box" - harder to explain why a score was assigned
- Requires large amounts of data
ML: Pros and Cons
Pros:
- Adaptive - learns from new fraud patterns automatically
- Scalable - handles millions of transactions without manual rule updates
- Pattern recognition - catches subtle correlations humans miss
- Continuous improvement - gets better with more data
Cons:
- Black box - hard to explain individual decisions
- Data requirements - needs substantial labeled data to train
- Cold start problem - poor performance until enough data collected
- Can learn biases from historical data
Combining Rules and ML
The best fraud prevention systems use both approaches:
Transaction arrives
↓
Rules evaluate (known patterns)
↓
ML model evaluates (complex patterns)
↓
Scores combined
↓
Decision + explanation
Why both?
- Rules catch known, obvious fraud patterns instantly
- ML catches emerging patterns and subtle signals
- Rules provide explainability when ML triggers
- ML reduces rule maintenance burden
Cold Start Strategy
When launching or with limited data:
- Use rules more heavily at launch - they work immediately without training data
- Slowly let ML carry more weight as you accumulate labeled outcomes
- Don't turn off rules just because you add ML - use your rule hits as labeled inputs to the model
- Feed chargeback/fraud outcomes back to improve ML over time
Example Combined System
Rule: Shipping to known fraud address → +70 points
Rule: Email domain is disposable → +20 points
Rule: Customer has 3+ successful orders → -15 points
ML score: 0.35 (35% fraud probability) → +35 points
___________
Final score: → 110 points → DECLINE
Setting Thresholds
Your threshold strategy depends on:
| Factor | Lower Thresholds (stricter) | Higher Thresholds (looser) |
|---|---|---|
| Margin | Low margin (can't absorb fraud) | High margin (can absorb some fraud) |
| Product | Physical goods (lost forever) | Digital (can revoke access) |
| Chargeback ratio | Near network thresholds | Comfortable buffer |
| Customer experience | Less important | Critical to business |
| Review capacity | Large review team | Limited/no review team |
The Trade-Off Curve
Your thresholds represent a choice along the ROC (Receiver Operating Characteristic) curve:
- Lower threshold = catch more fraud, but also decline more good customers
- Higher threshold = approve more good customers, but also let through more fraud
There's no "correct" threshold - it depends on what your business can tolerate. Optimizing metrics like AUC, precision, recall, or F1 score helps you find the right balance, but ultimately it's a business decision.
Three-Tier Strategy
Tier 1: Auto-approve (low scores)
- Fast customer experience
- No manual intervention
- Accept some fraud slippage
Tier 2: Manual review (middle scores)
- Human evaluates ambiguous cases
- Can request additional verification
- Higher operational cost
Tier 3: Auto-decline (high scores)
- Block obvious fraud
- Some false positives (lost good customers)
- Can offer alternative payment methods
Finding Your Thresholds
Those "approve below 40, decline above 70" recommendations are someone else's guess. Here's how to find yours:
1. Calculate your cost of false positive:
Average order value × Gross margin × Probability customer never returns
If your AOV is $100, margin is 30%, and 50% of blocked customers never return: $100 × 0.3 × 0.5 = $15 per false positive
2. Calculate your cost of fraud:
Average fraud amount + Chargeback fee + Operational cost
If average fraud is $150, CB fee is $25, ops cost is $10: $185 per fraud
3. Find the break-even: At what threshold does the cost of false positives equal the cost of fraud prevented?
4. Test your hypothesis: Set thresholds based on your calculation. Run for 30 days. Measure actual costs. Adjust.
- Score calibration: A score of 80 should mean 80% of those transactions are fraud. Check if yours does. Many vendor scores aren't well-calibrated.
- Score drift: Model performance degrades over time. Re-test quarterly.
- Feedback loops: If you never tell the model what was actually fraud, it gets stale. Make sure chargeback outcomes flow back.
Key Metrics
Fraud detection rate (True Positive Rate / Recall): What percentage of actual fraud did you catch?
Fraud detected / Total fraud × 100
False positive rate: What percentage of good transactions were wrongly declined?
Good transactions declined / Total good transactions × 100
Precision: Of transactions you flagged as fraud, how many actually were?
True fraud flagged / All transactions flagged × 100
Review rate: What percentage of transactions go to manual review?
Transactions in review / Total transactions × 100
Ideal: High detection rate, low false positive rate, manageable review rate.
Building vs. Buying
Build your own:
- Full control over rules and models
- Can optimize for your specific fraud patterns
- Requires data science expertise
- Ongoing maintenance burden
Buy a solution:
- Faster to implement
- Vendor has consortium data (sees fraud across many merchants)
- Less control over scoring logic
- Per-transaction costs
Hybrid:
- Use vendor for ML/consortium data
- Layer your own rules on top
- Best of both worlds for many merchants
Vendor Landscape
Note: This space evolves constantly. Evaluate vendors based on your specific stack, geography, and risk profile.
| Category | Examples |
|---|---|
| Standalone fraud platforms | Forter, Riskified, Signifyd, SEON |
| Processor-integrated | Stripe Radar, Adyen Risk, Checkout.com FDP |
| Identity/device | Kount, ThreatMetrix, BioCatch |
| Rules engines | Splunk, Datadog (DIY) |
Next Steps
Just getting started with scoring?
- Use your processor's built-in scoring → Stripe Radar, Adyen Risk, etc.
- Define three buckets → Auto-approve, review, auto-decline
- Track your false positive rate → Customer complaints are the signal
Tuning your thresholds?
- Run the threshold sweep experiment (see top of page) → Data beats intuition
- Segment by transaction type → Different thresholds for different products
- Track fraud rate AND false positive rate → Optimize the tradeoff, not just one metric
Building custom scoring?
- Review rules vs. ML tradeoffs → Know when to use which
- Start with rules on known patterns → ML for novel detection
- Invest in feature engineering → Good features beat complex models
See Also
- Rules vs ML - Choosing detection approaches
- Velocity Rules - Detecting abuse patterns
- Device Fingerprinting - Tracking devices across sessions
- Behavioral Analytics - How users interact
- AVS & CVV - Address and card verification
- 3D Secure - Authentication and liability shift
- Manual Review - Human investigation process
- Fraud Metrics - Measuring detection performance
- Chargeback Metrics - Tracking dispute rates
- Processor Rules Configuration - Processor-level rules
- Fraud Vendors - Third-party scoring tools
- Experimentation - Testing score thresholds