The Model That Looked Perfect (Until It Didn't)

I still remember the first backtesting disaster we had. Our model showed 12% ROI over two years of historical data. We were celebrating.

Then we deployed it. First month: -8%. Second month: -6%. What happened?

Leakage. We'd accidentally used closing odds to train a model that was supposed to predict at opening. Of course it looked amazing in backtests—it was seeing the future.

That experience taught me more about proper validation than any textbook ever could.

Leakage: The Silent Model Killer

Data leakage happens when your model accidentally sees information it shouldn't have at prediction time. It's surprisingly easy to do.

Common leakage sources we've caught:

1Closing odds in training data when you predict at opening
2Final lineup data when your prediction timestamp is before lineups are announced
3Post-match statistics sneaking into feature calculations
4Season-end information leaking into mid-season predictions

The fix is simple but requires discipline: timestamp lock everything. Every feature must be tied to a specific moment in time, and you can only use data that was available *before* that moment.

Cherry-Picking: How We Lie to Ourselves

This one is subtle because it often happens unconsciously.

"Let's just test on the top 5 leagues—that's where the data is cleanest."

"We'll drop the COVID seasons—those were weird anyway."

"Only matches with complete data—otherwise it's not fair."

Each of these sounds reasonable. But together, they create a dataset that doesn't represent reality. Your model learns to perform well on carefully selected conditions, then fails in the real world.

Our rule now: define inclusion criteria *before* you run any experiments, and stick to them no matter what.

The Time-Based Split Problem

Standard machine learning practice is to randomly split data into train/test sets. For sports prediction, this is wrong.

Why? Because matches from the same season share context. If you randomly mix 2023 and 2024 matches, your test set leaks information about 2023 that your model shouldn't know when predicting 2023 matches.

The right approach: train on earlier time periods, test on later ones. We use rolling windows:

Train on months 1-12

Test on months 13-18

Then train on months 1-18, test on 19-24

And so on

This mimics how the model will actually be used.

Football Changes. Your Model Might Not Notice.

Here's something that took us a while to learn: a model trained on 2020 data might not work in 2024.

Football evolves. Tactics change. Teams get new coaches. The relationship between features and outcomes shifts over time.

We now evaluate performance across multiple time windows and check for drift. If accuracy drops significantly in recent periods, that's a red flag—even if overall numbers look good.

What We Check Before Trusting Any Backtest

Our internal checklist:

1Timestamp audit: Is every feature locked to prediction time?
2Inclusion review: Are we using consistent criteria across train and test?
3Time-based splits: No random mixing of periods
4Multi-window evaluation: Does performance hold across different time periods?
5Baseline comparison: Are we actually beating the market consensus?

If any of these fail, the backtest results are meaningless.

Key Takeaways

1Leakage can make any model look amazing (until deployment)
2Cherry-picking happens subtly—define criteria upfront
3Time-based splits are mandatory for sports data
4Football changes; evaluate across multiple time windows
5Always compare against baselines, not just against random

📖 Related reading: Model Evaluation • Feature Engineering

*OddsFlow provides AI-powered sports analysis for educational and informational purposes.*

The Model That Looked Perfect (Until It Didn't)

I still remember the first backtesting disaster we had. Our model showed 12% ROI over two years of historical data. We were celebrating.

Then we deployed it. First month: -8%. Second month: -6%. What happened?

Leakage. We'd accidentally used closing odds to train a model that was supposed to predict at opening. Of course it looked amazing in backtests—it was seeing the future.

That experience taught me more about proper validation than any textbook ever could.

Leakage: The Silent Model Killer

Data leakage happens when your model accidentally sees information it shouldn't have at prediction time. It's surprisingly easy to do.

Common leakage sources we've caught:

1Closing odds in training data when you predict at opening
2Final lineup data when your prediction timestamp is before lineups are announced
3Post-match statistics sneaking into feature calculations
4Season-end information leaking into mid-season predictions

The fix is simple but requires discipline: timestamp lock everything. Every feature must be tied to a specific moment in time, and you can only use data that was available *before* that moment.

Cherry-Picking: How We Lie to Ourselves

This one is subtle because it often happens unconsciously.

"Let's just test on the top 5 leagues—that's where the data is cleanest."

"We'll drop the COVID seasons—those were weird anyway."

"Only matches with complete data—otherwise it's not fair."

Each of these sounds reasonable. But together, they create a dataset that doesn't represent reality. Your model learns to perform well on carefully selected conditions, then fails in the real world.

Our rule now: define inclusion criteria *before* you run any experiments, and stick to them no matter what.

The Time-Based Split Problem

Standard machine learning practice is to randomly split data into train/test sets. For sports prediction, this is wrong.

The right approach: train on earlier time periods, test on later ones. We use rolling windows:

Train on months 1-12

Test on months 13-18

Then train on months 1-18, test on 19-24

And so on

This mimics how the model will actually be used.

Football Changes. Your Model Might Not Notice.

Here's something that took us a while to learn: a model trained on 2020 data might not work in 2024.

Football evolves. Tactics change. Teams get new coaches. The relationship between features and outcomes shifts over time.

We now evaluate performance across multiple time windows and check for drift. If accuracy drops significantly in recent periods, that's a red flag—even if overall numbers look good.

What We Check Before Trusting Any Backtest

Our internal checklist:

1Timestamp audit: Is every feature locked to prediction time?
2Inclusion review: Are we using consistent criteria across train and test?
3Time-based splits: No random mixing of periods
4Multi-window evaluation: Does performance hold across different time periods?
5Baseline comparison: Are we actually beating the market consensus?

If any of these fail, the backtest results are meaningless.

Key Takeaways

1Leakage can make any model look amazing (until deployment)
2Cherry-picking happens subtly—define criteria upfront
3Time-based splits are mandatory for sports data
4Football changes; evaluate across multiple time windows
5Always compare against baselines, not just against random

📖 Related reading: Model Evaluation • Feature Engineering

*OddsFlow provides AI-powered sports analysis for educational and informational purposes.*

The Backtesting Mistakes That Fooled Us (And How We Fixed Them)

The Model That Looked Perfect (Until It Didn't)

Leakage: The Silent Model Killer

Cherry-Picking: How We Lie to Ourselves

The Time-Based Split Problem

Football Changes. Your Model Might Not Notice.

What We Check Before Trusting Any Backtest

Key Takeaways

Ready to get AI-powered predictions?

Related Articles

How to Interpret Football Odds: Turn Prices Into Probabilities

Why Win Rate Is a Misleading Metric: Calibration and Proper Evaluation

Opening vs Closing Data: How Timing Affects Market Information Quality

Ready to Try AI-Powered Predictions?

The Backtesting Mistakes That Fooled Us (And How We Fixed Them)

The Model That Looked Perfect (Until It Didn't)

Leakage: The Silent Model Killer

Cherry-Picking: How We Lie to Ourselves

The Time-Based Split Problem

Football Changes. Your Model Might Not Notice.

What We Check Before Trusting Any Backtest

Key Takeaways

Ready to get AI-powered predictions?

Related Articles

How to Interpret Football Odds: Turn Prices Into Probabilities

Why Win Rate Is a Misleading Metric: Calibration and Proper Evaluation

Opening vs Closing Data: How Timing Affects Market Information Quality

Ready to Try AI-Powered Predictions?