Skip to content

System Audit Report — SoccerPredictAI

Date: 2026-04-28 Auditor: GitHub Copilot (Claude Opus 4.7) — /skill-ml-system-audit full Scope: High-level audit — architecture, flows, contracts, risks Method: Code analysis (dvc.yaml, params.yaml, src/, airflow/, docker/, k8s/) + diff vs prior cycles Baseline cycle: 2026-04-24 (00_system_audit_v2.md + audits 01–11) Last weekly check: 2026-04-26 (00_system_audit.md)


⚡ Delta since 2026-04-26

Check Result
New commits since 2026-04-26 None (HEAD = c64561d, dated 2026-04-22)
Source files modified since 2026-04-26 (via find -newer) None in src/, airflow/, dvc.yaml, params.yaml, docker/
Working-tree status Same uncommitted modifications as observed in 20260426/00

Conclusion: Production code, DVC pipeline, FastAPI/Celery, Airflow DAGs — unchanged. All findings, risks, and contracts from 20260424 baseline remain in force. This full cycle (audits 00–12) re-confirms baseline; only audit 12_docs_validation is new.


1. Architecture (current)

[WhoScored.com]
       │ (Selenoid browser automation)
[Celery worker-api] ──► [PostgreSQL]
       ▲                      │
[Airflow DAGs ×5]        [export task]
 @hourly / manual             │
 X-Token via Variable    [MinIO: data-raw/]
                         [DVC Pipeline: 15 stages]
                         params.yaml / Hydra conf/
                         [MLflow Tracking + Registry]
                         soccer_clf@champion
                    ┌─────────┴──────────┐
            [Celery worker-ml]    [batch_inference DVC]
            PredictionService      match_features.parquet
            (load on init)         → MinIO predictions/
                    │                      │
                    └──────────┬───────────┘
                          [FastAPI]
                     /predict /livescores /monitoring /healthcheck
                          [Nginx / K8s Ingress]
                          [Streamlit UI on external VPS]

No structural change vs 20260424.


2. Layers — status

Layer Implementation Status
Product / Problem 1×2 classification, target=outcome_1x2
Data scraping, PostgreSQL, MinIO, DVC ingestion
Feature rolling stats (5 windows), ELO per tournament, side=diff
Model baseline / LogReg / HGBT / XGBoost + Optuna + isotonic ✅ ⚠️ smoke params
Experimentation MLflow matches_clf_smoke, nested runs
Pipeline (DVC/Hydra) 15 stages, GE gates ×3
Serving FastAPI sync/async, FeatureLookupService, Prometheus
UI Streamlit + nginx
Orchestration Airflow 5 DAGs (livescores ×4, export ×1)
Ops / Infra Docker ×10, K8s Helm (single-node), pydantic-settings
Testing unit, property, service, contract, load (Locust), GE ✅ 🚧 live integration
Documentation MkDocs, ADRs, runbooks, status, validation audits

3. Key flows (unchanged from baseline)

  • Data: WhoScored → Selenoid → worker-api → PostgreSQL → export → MinIO data-raw/ → DVC load_data_from_sources → preprocessing → features → splits.
  • Features: finished.parquetstats_matches.py (rolling) + elo.py (per tournamentId, k=32, init=1500, home_adv=50) → features.parquet + features_meta.parquet (single contract).
  • Model: dataset.parquet + folds → classification_models (screening) → tune_xgbfinal_train (isotonic, calib_frac=0.15) → register_modelsoccer_clf@champion.
  • Execution (training): manual dvc repro + manual worker-ml restart. No auto-trigger.
  • Execution (serving): UI → FastAPI → Celery ml (sync 30s timeout) / async via Redis polling; batch lookup via FeatureLookupService.

Details and code references in 20260424 baseline 00_system_audit_v2.md §3.


4. Contracts (unchanged)

Type Contract Source of truth
Feature column names + dtypes features_meta.parquet
Model sklearn Pipeline + XGBoost via mlflow.sklearn, output proba[0,1,2] MLflow Registry soccer_clf@champion
API Pydantic schemas + auto-OpenAPI src/app/schemas/predict.py
Data .parquet at all boundaries, .minio.json versioning DVC outs + MinIO metadata

5. Risk register (delta: no changes)

ID Risk Severity Status
R1 params.yaml in smoke mode (n_trials=2, fracs=[0.001, 0.002]) 🔴 HIGH Open
R2 DVC pipeline manual; no auto-trigger after ingestion 🔴 HIGH Open
R3 No model hot-reload in serving 🔴 HIGH Open
R4 stats.py router not registered in main.py (dead endpoint) 🟡 MED Open
R5 batch_inference staleness — serving may use stale features silently 🔴 HIGH Open
R6 No metric gate before champion promotion 🔴 HIGH Open
R7 No drift detection (Evidently not integrated) 🟡 MED Open
R8 Single-node K8s; no HA for PostgreSQL / MinIO 🟡 MED Open (known)

6. Component map (unchanged)

See 20260424 00_system_audit_v2.md §6 — same set of components in same layers.


7. Summary

System maturity: MEDIUM  (unchanged)

Strengths:
- Full end-to-end MLOps stack (DVC, MLflow, Celery, Airflow, Prometheus, K8s)
- features_meta.parquet as single feature contract
- GE validation gates at 3 boundaries (raw, finished, features)
- Independent batch_inference DVC branch
- Prometheus metrics on inference latency, confidence, model info
- Detailed documentation and audit trail (3 cycles in docs/validation/)

Top open risks:
- R1 smoke params in params.yaml
- R2 no automated retrain trigger
- R3 no hot-reload of model
- R6 no metric gate before champion promotion

Unknowns (require deeper audits):
- Current champion in MLflow registry and its metrics → see 05
- Real cadence of batch_inference re-runs → see 08

8. References

  • Baseline detailed audits 01–11: docs/validation/20260424/*.md
  • Last weekly check: docs/validation/20260426/00_system_audit.md
  • Detailed audits this cycle: docs/validation/20260428/01..12_*.md

Recommendation: since no production change since 20260422, audits 01–11 in this cycle are confirmation reports against 20260424 baseline. The next material full cycle should be triggered by an actual commit to src/, dvc.yaml, airflow/, or docker/.