Orchestration Audit Report — SoccerPredictAI¶
Date: 2026-04-28
Auditor: GitHub Copilot (Claude Opus 4.7) — /skill-ml-system-audit full (audit 08/12)
Scope: Airflow DAGs — scheduling, dependency graph, retrain loop, fail handling
Baseline: docs/validation/20260424/08_orchestration_audit.md
Delta vs baseline¶
airflow/dags/*.py unchanged since 2026-04-26. Baseline findings remain in force.
Confirmed inventory (5 DAGs)¶
| DAG | Schedule | Purpose |
|---|---|---|
soccer_etl_livescores_01 |
@hourly |
3-day rolling livescores scraping |
soccer_etl_livescores_02_next_matches |
@daily |
15-day forward upcoming matches |
soccer_etl_livescores_backfill_monthly |
None (manual) |
historical backfill (1998 → 2026-02) |
soccer_etl_livescores_manual_trigger |
None (manual) |
90-day window manual trigger |
soccer_etl_export_matches_to_source |
None (manual) |
PostgreSQL → MinIO export (match, match_raw) |
All DAGs: retries=3, retry_delay=5min. All HttpSensors: poke_interval=60, timeout=3600, mode="reschedule". Auth via X-Token from Airflow Variable SOCCER_FASTAPI_HEADER_TOKEN.
Missing DAGs: none for dvc repro, batch_inference, model retrain, alias promotion, or worker reload.
Risk register (re-confirmed)¶
| ID | Severity | Description | Status |
|---|---|---|---|
| OR-01 | P0 | No DAG for dvc repro — retrain loop fully manual |
Open (= R2) |
| OR-02 | P0 | No DAG for batch_inference — serving features go stale |
Open (= R5) |
| OR-03 | P0 | etl_export_matches_to_source is manual — no automatic raw-data refresh |
Open (= D-03) |
| OR-04 | P1 | No DAG-level alerting (email/Slack) on failures | Open |
| OR-05 | P1 | No automatic retrain trigger (drift/metric/volume) | Open |
| OR-06 | P2 | backfill_monthly uses hardcoded date bounds (1998 → 2026-02) |
Open |
| OR-07 | P2 | No DAG to monitor freshness across pipeline stages | Open |
Pipeline → Registry → Serving gap¶
[Airflow] scraping → PostgreSQL ✅ automatic (@hourly, @daily)
[Manual] PostgreSQL → MinIO export ⚠
[Manual] MinIO → DVC repro ⚠
[Manual] DVC → MLflow Registry ⚠
[Manual] Registry → Celery worker restart ⚠
Three manual hops separate fresh data from updated predictions.
Summary¶
| Aspect | Status |
|---|---|
| DAG inventory + scheduling | ✅ |
| Retry / timeout / reschedule semantics | ✅ |
| Sequential dependencies in export DAG | ✅ |
| Automated retrain loop | ❌ (OR-01, OR-02, OR-03) |
| Failure alerting | ❌ (OR-04) |
Recommendation: OR-01/02/03 collectively form the most operationally painful gap; address by introducing a soccer_retrain_pipeline DAG (export → repro → batch_inference → registry validation → worker rolling-restart). Add OR-04 in parallel.
See baseline §1–§5 for code-level detail.