Training & Evaluation Audit Report — SoccerPredictAI¶
Date: 2026-04-28
Auditor: GitHub Copilot (Claude Opus 4.7) — /skill-ml-system-audit full (audit 03/12)
Scope: Training, evaluation, CV, model selection, calibration
Baseline: docs/validation/20260424/03_training_evaluation_audit.md
Delta vs baseline¶
No source files under src/models/, src/pipelines/{classification,tune,final_train,ablation}.py, src/data/splitting.py, or params.yaml modified since 2026-04-26. Baseline findings remain in force.
Confirmed configuration¶
| Aspect | Value | Source |
|---|---|---|
| Train/test split | time-based, test_start=2024-01-01 |
params.yaml: temporal.test_start |
| CV | walk-forward year folds 2022, 2023 | params.yaml: temporal.folds_*_year |
| Models | baseline / logreg / sgd_logloss / HGBT / XGBoost | src/pipelines/classification.py |
fracs_for_train |
[0.001, 0.002] |
params.yaml: classification.fracs_for_train |
| Tuning | Optuna n_trials=2, frac=0.1 |
params.yaml: tuning |
| Calibration | isotonic, calib_frac=0.15, min_calib_samples=100, temporal split |
src/pipelines/final_train.py |
| Metrics | logloss (CV/holdout), ECE raw + calibrated, accuracy, segment metrics | src/models/metrics.py |
Risk register (re-confirmed)¶
| ID | Severity | Description | Status |
|---|---|---|---|
| TR-01 | P0 | fracs_for_train=[0.001, 0.002] — smoke values, model trained on 0.1–0.2% of data |
Open |
| TR-02 | P0 | tuning.n_trials=2 — Optuna tuning meaningless |
Open |
| TR-03 | P1 | Holdout used for model selection in classification_models (not blind) |
Open |
| TR-04 | P1 | ablation_study stage isolated from tune_xgb / final_train DAG path |
Open |
| TR-05 | P2 | No explicit seed for HGBT/XGB CV runs in classification pipeline | Open |
| TR-06 | P2 | If best_model_name != xgb, xgb_best_params.json may be applied to a different model in final_train |
Open |
Summary¶
| Aspect | Status |
|---|---|
| Time-based split, no leakage | ✅ |
| Walk-forward CV (2 folds) | ✅ |
| Calibration design (temporal split, isotonic) | ✅ |
| Metrics coverage (logloss, ECE, segments) | ✅ |
| Production-grade training params | ❌ smoke (TR-01, TR-02) |
| Holdout truly blind | ❌ (TR-03) |
| Ablation feeds selection | ❌ (TR-04) |
Recommendation: TR-01 and TR-02 are production-blockers — restore fracs_for_train=[1.0] (or full) and raise tuning.n_trials before any deployment of newly-trained model. Then TR-03 (introduce a held-out validation distinct from the final holdout used by final_train).
See baseline §1–§5 for code-level detail.