Skip to content

Training & Evaluation Audit Report — SoccerPredictAI

Date: 2026-04-28 Auditor: GitHub Copilot (Claude Opus 4.7) — /skill-ml-system-audit full (audit 03/12) Scope: Training, evaluation, CV, model selection, calibration Baseline: docs/validation/20260424/03_training_evaluation_audit.md


Delta vs baseline

No source files under src/models/, src/pipelines/{classification,tune,final_train,ablation}.py, src/data/splitting.py, or params.yaml modified since 2026-04-26. Baseline findings remain in force.


Confirmed configuration

Aspect Value Source
Train/test split time-based, test_start=2024-01-01 params.yaml: temporal.test_start
CV walk-forward year folds 2022, 2023 params.yaml: temporal.folds_*_year
Models baseline / logreg / sgd_logloss / HGBT / XGBoost src/pipelines/classification.py
fracs_for_train [0.001, 0.002] params.yaml: classification.fracs_for_train
Tuning Optuna n_trials=2, frac=0.1 params.yaml: tuning
Calibration isotonic, calib_frac=0.15, min_calib_samples=100, temporal split src/pipelines/final_train.py
Metrics logloss (CV/holdout), ECE raw + calibrated, accuracy, segment metrics src/models/metrics.py

Risk register (re-confirmed)

ID Severity Description Status
TR-01 P0 fracs_for_train=[0.001, 0.002] — smoke values, model trained on 0.1–0.2% of data Open
TR-02 P0 tuning.n_trials=2 — Optuna tuning meaningless Open
TR-03 P1 Holdout used for model selection in classification_models (not blind) Open
TR-04 P1 ablation_study stage isolated from tune_xgb / final_train DAG path Open
TR-05 P2 No explicit seed for HGBT/XGB CV runs in classification pipeline Open
TR-06 P2 If best_model_name != xgb, xgb_best_params.json may be applied to a different model in final_train Open

Summary

Aspect Status
Time-based split, no leakage
Walk-forward CV (2 folds)
Calibration design (temporal split, isotonic)
Metrics coverage (logloss, ECE, segments)
Production-grade training params ❌ smoke (TR-01, TR-02)
Holdout truly blind ❌ (TR-03)
Ablation feeds selection ❌ (TR-04)

Recommendation: TR-01 and TR-02 are production-blockers — restore fracs_for_train=[1.0] (or full) and raise tuning.n_trials before any deployment of newly-trained model. Then TR-03 (introduce a held-out validation distinct from the final holdout used by final_train).

See baseline §1–§5 for code-level detail.