Risk Scoring

On this page

What Is a Risk Score?
Rules-Based Scoring
Machine Learning Scoring
Combining Rules and ML
- Cold Start Strategy
- Example Combined System
Setting Thresholds
Key Metrics
Building vs. Buying
Vendor Landscape
Next Steps
See Also

Prerequisites

Before implementing risk scoring, understand:

Fraud types you're detecting
Risk appetite and thresholds
Rules vs ML approaches
Fraud metrics for measurement

TL;DR

Risk score = 0-100 number indicating fraud likelihood; use for auto-approve / review / auto-decline tiers
Rules: Email domain, velocity, geo mismatch. Transparent, fast to deploy, catches known patterns
ML: Complex patterns, novel fraud. Needs data science expertise and labeled training data
Best approach: Combine both. Rules for known fraud + ML for subtle patterns
Threshold tuning: Run A/B tests; calculate cost of false positives vs. cost of fraud; adjust quarterly

A risk score is just a number. The question is: does it help you make better decisions?

Your thresholds are bets. You're trading false positives (blocking good customers) for true positives (blocking fraud). The "right" threshold depends on your margins, your fraud rate, and your tolerance for customer complaints.

Experiment to Run: Score Threshold Sweep

Run 3 cutoffs in parallel on small slices of traffic:

Segment A: Score under 30 auto-approve, 30-60 review, over 60 auto-block
Segment B: Score under 40 auto-approve, 40-70 review, over 70 auto-block
Segment C: Score under 50 auto-approve, 50-80 review, over 80 auto-block

Metrics: Fraud loss + review cost + estimated false positive cost (use average order value × block rate × estimated good customer %)

Run length: 4 weeks (you need time for chargebacks to materialize)

Decision: Pick the cutoff with lowest total cost. Probably not the tightest one.

What Is a Risk Score?

A risk score is a number assigned to each transaction indicating the likelihood that it's fraudulent. Higher scores mean higher risk.

Common scales:

0-100 (higher = riskier)
0-1000 (more granular)
0-1 probability (true probability)

How it's used:

Score 0-30:   Auto-approve
Score 31-70:  Manual review
Score 71-100: Auto-decline

The thresholds depend on your risk tolerance, margins, and operational capacity. These aren't magic numbers - they represent a trade-off between catching fraud (true positives) and wrongly declining good customers (false positives).

Rules-Based Scoring

Rules are explicit conditions that add or subtract from a transaction's risk score.

How Rules Work

Each rule evaluates a condition and applies a score adjustment:

IF email_domain = "tempmail.com" THEN +30
IF shipping_country != billing_country THEN +15
IF customer_has_previous_orders > 5 THEN -10
IF device_seen_on_fraud_before = true THEN +50
IF amount > $500 THEN +10

Final score = base score + sum of all triggered rules.

Types of Rules

Identity rules:

Email validity (deliverable, disposable domain, recently created)
Phone number validation
Name consistency across data points

Transaction rules:

Order amount (high value = higher risk)
Product category (high-fraud categories)
Shipping method (expedited = higher risk)
Billing/shipping mismatch

Behavioral rules:

Time to complete checkout (too fast = bot)
Session behavior (copy-paste vs. typing)
Multiple failed attempts before success

Velocity rules:

Orders per IP per hour
Cards per email per day
Shipping addresses per card per week

Device/network rules:

Proxy/VPN detected
Device fingerprint seen on fraud
Geolocation vs. billing country mismatch (see AVS)

Rules: Pros and Cons

Pros:

Transparent - you know exactly why a transaction was flagged
Controllable - adjust instantly for new patterns
Explainable - easy to justify decisions to customers, banks, auditors
No training data required

Cons:

Reactive - must manually add rules for new fraud patterns
Brittle - fraudsters learn and adapt to your rules
Maintenance burden - rule sets grow unwieldy over time
Limited pattern recognition - can't catch subtle correlations

Machine Learning Scoring

ML models analyze historical transaction data to identify patterns that predict fraud, including patterns too complex for humans to define as rules.

How ML Scoring Works

Training: Model is fed historical transactions labeled as fraud/legitimate
Learning: Model identifies features and patterns correlated with fraud
Scoring: For new transactions, model outputs fraud probability
Feedback loop: New fraud outcomes are fed back to improve the model

Types of ML Models

Supervised learning:

Learns from labeled examples (this was fraud, this wasn't)
Most common approach for fraud scoring
Requires clean, labeled historical data

Unsupervised learning:

Identifies anomalies without labels
Useful for catching new fraud types
Higher false positive rate

Neural networks:

Can find complex, non-linear patterns
"Black box" - harder to explain why a score was assigned
Requires large amounts of data

ML: Pros and Cons

Pros:

Adaptive - learns from new fraud patterns automatically
Scalable - handles millions of transactions without manual rule updates
Pattern recognition - catches subtle correlations humans miss
Continuous improvement - gets better with more data

Cons:

Black box - hard to explain individual decisions
Data requirements - needs substantial labeled data to train
Cold start problem - poor performance until enough data collected
Can learn biases from historical data

Combining Rules and ML

The best fraud prevention systems use both approaches:

Transaction arrives
       ↓
Rules evaluate (known patterns)
       ↓
ML model evaluates (complex patterns)
       ↓
Scores combined
       ↓
Decision + explanation

Why both?

Rules catch known, obvious fraud patterns instantly
ML catches emerging patterns and subtle signals
Rules provide explainability when ML triggers
ML reduces rule maintenance burden

Cold Start Strategy

When launching or with limited data:

Use rules more heavily at launch - they work immediately without training data
Slowly let ML carry more weight as you accumulate labeled outcomes
Don't turn off rules just because you add ML - use your rule hits as labeled inputs to the model
Feed chargeback/fraud outcomes back to improve ML over time

Example Combined System

Rule: Shipping to known fraud address     → +70 points
Rule: Email domain is disposable          → +20 points
Rule: Customer has 3+ successful orders   → -15 points
ML score: 0.35 (35% fraud probability)    → +35 points
                                          ___________
Final score:                              → 110 points → DECLINE

Setting Thresholds

Your threshold strategy depends on:

Factor	Lower Thresholds (stricter)	Higher Thresholds (looser)
Margin	Low margin (can't absorb fraud)	High margin (can absorb some fraud)
Product	Physical goods (lost forever)	Digital (can revoke access)
Chargeback ratio	Near network thresholds	Comfortable buffer
Customer experience	Less important	Critical to business
Review capacity	Large review team	Limited/no review team

The Trade-Off Curve

Your thresholds represent a choice along the ROC (Receiver Operating Characteristic) curve:

Lower threshold = catch more fraud, but also decline more good customers
Higher threshold = approve more good customers, but also let through more fraud

There's no "correct" threshold - it depends on what your business can tolerate. Optimizing metrics like AUC, precision, recall, or F1 score helps you find the right balance, but ultimately it's a business decision.

Three-Tier Strategy

Tier 1: Auto-approve (low scores)

Fast customer experience
No manual intervention
Accept some fraud slippage

Tier 2: Manual review (middle scores)

Human evaluates ambiguous cases
Can request additional verification
Higher operational cost

Tier 3: Auto-decline (high scores)

Block obvious fraud
Some false positives (lost good customers)
Can offer alternative payment methods

Finding Your Thresholds

Those "approve below 40, decline above 70" recommendations are someone else's guess. Here's how to find yours:

1. Calculate your cost of false positive:

Average order value × Gross margin × Probability customer never returns

If your AOV is $100, margin is 30%, and 50% of blocked customers never return: $100 × 0.3 × 0.5 = $15 per false positive

2. Calculate your cost of fraud:

Average fraud amount + Chargeback fee + Operational cost

If average fraud is $150, CB fee is $25, ops cost is $10: $185 per fraud

3. Find the break-even: At what threshold does the cost of false positives equal the cost of fraud prevented?

4. Test your hypothesis: Set thresholds based on your calculation. Run for 30 days. Measure actual costs. Adjust.

Where This Can Fool You

Score calibration: A score of 80 should mean 80% of those transactions are fraud. Check if yours does. Many vendor scores aren't well-calibrated.
Score drift: Model performance degrades over time. Re-test quarterly.
Feedback loops: If you never tell the model what was actually fraud, it gets stale. Make sure chargeback outcomes flow back.

Key Metrics

Fraud detection rate (True Positive Rate / Recall): What percentage of actual fraud did you catch?

Fraud detected / Total fraud × 100

False positive rate: What percentage of good transactions were wrongly declined?

Good transactions declined / Total good transactions × 100

Precision: Of transactions you flagged as fraud, how many actually were?

True fraud flagged / All transactions flagged × 100

Review rate: What percentage of transactions go to manual review?

Transactions in review / Total transactions × 100

Ideal: High detection rate, low false positive rate, manageable review rate.

Building vs. Buying

Build your own:

Full control over rules and models
Can optimize for your specific fraud patterns
Requires data science expertise
Ongoing maintenance burden

Buy a solution:

Faster to implement
Vendor has consortium data (sees fraud across many merchants)
Less control over scoring logic
Per-transaction costs

Hybrid:

Use vendor for ML/consortium data
Layer your own rules on top
Best of both worlds for many merchants

Vendor Landscape

Note: This space evolves constantly. Evaluate vendors based on your specific stack, geography, and risk profile.

Category	Examples
Standalone fraud platforms	Forter, Riskified, Signifyd, SEON
Processor-integrated	Stripe Radar, Adyen Risk, Checkout.com FDP
Identity/device	Kount, ThreatMetrix, BioCatch
Rules engines	Splunk, Datadog (DIY)

Next Steps

Just getting started with scoring?

Use your processor's built-in scoring → Stripe Radar, Adyen Risk, etc.
Define three buckets → Auto-approve, review, auto-decline
Track your false positive rate → Customer complaints are the signal

Tuning your thresholds?

Run the threshold sweep experiment (see top of page) → Data beats intuition
Segment by transaction type → Different thresholds for different products
Track fraud rate AND false positive rate → Optimize the tradeoff, not just one metric

Building custom scoring?

Review rules vs. ML tradeoffs → Know when to use which
Start with rules on known patterns → ML for novel detection
Invest in feature engineering → Good features beat complex models

What Is a Risk Score?​

Rules-Based Scoring​

How Rules Work​

Types of Rules​

Rules: Pros and Cons​

Machine Learning Scoring​

How ML Scoring Works​

Types of ML Models​

ML: Pros and Cons​

Combining Rules and ML​

Cold Start Strategy​

Example Combined System​

Setting Thresholds​

The Trade-Off Curve​

Three-Tier Strategy​

Finding Your Thresholds​

Key Metrics​

Building vs. Buying​

Vendor Landscape​

Next Steps​

See Also​