The Model That Looked Perfect (Until It Didn't)
I still remember the first backtesting disaster we had. Our model showed 12% ROI over two years of historical data. We were celebrating.
Then we deployed it. First month: -8%. Second month: -6%. What happened?
Leakage. We'd accidentally used closing odds to train a model that was supposed to predict at opening. Of course it looked amazing in backtests—it was seeing the future.
That experience taught me more about proper validation than any textbook ever could.
Leakage: The Silent Model Killer
Data leakage happens when your model accidentally sees information it shouldn't have at prediction time. It's surprisingly easy to do.
Common leakage sources we've caught:
- 1Closing odds in training data when you predict at opening
- 2Final lineup data when your prediction timestamp is before lineups are announced
- 3Post-match statistics sneaking into feature calculations
- 4Season-end information leaking into mid-season predictions
The fix is simple but requires discipline: timestamp lock everything. Every feature must be tied to a specific moment in time, and you can only use data that was available *before* that moment.
Cherry-Picking: How We Lie to Ourselves
This one is subtle because it often happens unconsciously.
"Let's just test on the top 5 leagues—that's where the data is cleanest."
"We'll drop the COVID seasons—those were weird anyway."
"Only matches with complete data—otherwise it's not fair."
Each of these sounds reasonable. But together, they create a dataset that doesn't represent reality. Your model learns to perform well on carefully selected conditions, then fails in the real world.
Our rule now: define inclusion criteria *before* you run any experiments, and stick to them no matter what.
The Time-Based Split Problem
Standard machine learning practice is to randomly split data into train/test sets. For sports prediction, this is wrong.
Why? Because matches from the same season share context. If you randomly mix 2023 and 2024 matches, your test set leaks information about 2023 that your model shouldn't know when predicting 2023 matches.
The right approach: train on earlier time periods, test on later ones. We use rolling windows:
- Train on months 1-12
- Test on months 13-18
- Then train on months 1-18, test on 19-24
- And so on
This mimics how the model will actually be used.
Football Changes. Your Model Might Not Notice.
Here's something that took us a while to learn: a model trained on 2020 data might not work in 2024.
Football evolves. Tactics change. Teams get new coaches. The relationship between features and outcomes shifts over time.
We now evaluate performance across multiple time windows and check for drift. If accuracy drops significantly in recent periods, that's a red flag—even if overall numbers look good.
What We Check Before Trusting Any Backtest
Our internal checklist:
- 1Timestamp audit: Is every feature locked to prediction time?
- 2Inclusion review: Are we using consistent criteria across train and test?
- 3Time-based splits: No random mixing of periods
- 4Multi-window evaluation: Does performance hold across different time periods?
- 5Baseline comparison: Are we actually beating the market consensus?
If any of these fail, the backtest results are meaningless.
Key Takeaways
- 1Leakage can make any model look amazing (until deployment)
- 2Cherry-picking happens subtly—define criteria upfront
- 3Time-based splits are mandatory for sports data
- 4Football changes; evaluate across multiple time windows
- 5Always compare against baselines, not just against random
📖 Related reading: Model Evaluation • Feature Engineering
*OddsFlow provides AI-powered sports analysis for educational and informational purposes.*

