System Audit Report — SoccerPredictAI¶

Date: 2026-04-28 Auditor: GitHub Copilot (Claude Opus 4.7) — /skill-ml-system-audit full Scope: High-level audit — architecture, flows, contracts, risks Method: Code analysis (dvc.yaml, params.yaml, src/, airflow/, docker/, k8s/) + diff vs prior cycles Baseline cycle: 2026-04-24 (00_system_audit_v2.md + audits 01–11) Last weekly check: 2026-04-26 (00_system_audit.md)

⚡ Delta since 2026-04-26¶

Check	Result
New commits since 2026-04-26	None (HEAD = `c64561d`, dated 2026-04-22)
Source files modified since 2026-04-26 (via `find -newer`)	None in `src/`, `airflow/`, `dvc.yaml`, `params.yaml`, `docker/`
Working-tree status	Same uncommitted modifications as observed in 20260426/00

Conclusion: Production code, DVC pipeline, FastAPI/Celery, Airflow DAGs — unchanged. All findings, risks, and contracts from 20260424 baseline remain in force. This full cycle (audits 00–12) re-confirms baseline; only audit 12_docs_validation is new.

1. Architecture (current)¶

[WhoScored.com]
       │ (Selenoid browser automation)
       ▼
[Celery worker-api] ──► [PostgreSQL]
       ▲                      │
[Airflow DAGs ×5]        [export task]
 @hourly / manual             │
 X-Token via Variable    [MinIO: data-raw/]
                              │
                         [DVC Pipeline: 15 stages]
                         params.yaml / Hydra conf/
                              │
                         [MLflow Tracking + Registry]
                         soccer_clf@champion
                              │
                    ┌─────────┴──────────┐
            [Celery worker-ml]    [batch_inference DVC]
            PredictionService      match_features.parquet
            (load on init)         → MinIO predictions/
                    │                      │
                    └──────────┬───────────┘
                               ▼
                          [FastAPI]
                     /predict /livescores /monitoring /healthcheck
                               │
                          [Nginx / K8s Ingress]
                               │
                          [Streamlit UI on external VPS]

No structural change vs 20260424.

2. Layers — status¶

Layer	Implementation	Status
Product / Problem	1×2 classification, target=`outcome_1x2`	✅
Data	scraping, PostgreSQL, MinIO, DVC ingestion	✅
Feature	rolling stats (5 windows), ELO per tournament, side=diff	✅
Model	baseline / LogReg / HGBT / XGBoost + Optuna + isotonic	✅ ⚠️ smoke params
Experimentation	MLflow `matches_clf_smoke`, nested runs	✅
Pipeline (DVC/Hydra)	15 stages, GE gates ×3	✅
Serving	FastAPI sync/async, FeatureLookupService, Prometheus	✅
UI	Streamlit + nginx	✅
Orchestration	Airflow 5 DAGs (livescores ×4, export ×1)	✅
Ops / Infra	Docker ×10, K8s Helm (single-node), pydantic-settings	✅
Testing	unit, property, service, contract, load (Locust), GE	✅ 🚧 live integration
Documentation	MkDocs, ADRs, runbooks, status, validation audits	✅

3. Key flows (unchanged from baseline)¶

Data: WhoScored → Selenoid → worker-api → PostgreSQL → export → MinIO data-raw/ → DVC load_data_from_sources → preprocessing → features → splits.
Features: finished.parquet → stats_matches.py (rolling) + elo.py (per tournamentId, k=32, init=1500, home_adv=50) → features.parquet + features_meta.parquet (single contract).
Model: dataset.parquet + folds → classification_models (screening) → tune_xgb → final_train (isotonic, calib_frac=0.15) → register_model → soccer_clf@champion.
Execution (training): manual dvc repro + manual worker-ml restart. No auto-trigger.
Execution (serving): UI → FastAPI → Celery ml (sync 30s timeout) / async via Redis polling; batch lookup via FeatureLookupService.

Details and code references in 20260424 baseline 00_system_audit_v2.md §3.

4. Contracts (unchanged)¶

Type	Contract	Source of truth
Feature	column names + dtypes	`features_meta.parquet`
Model	sklearn Pipeline + XGBoost via `mlflow.sklearn`, output proba[0,1,2]	MLflow Registry `soccer_clf@champion`
API	Pydantic schemas + auto-OpenAPI	`src/app/schemas/predict.py`
Data	`.parquet` at all boundaries, `.minio.json` versioning	DVC outs + MinIO metadata

5. Risk register (delta: no changes)¶

ID	Risk	Severity	Status
R1	`params.yaml` in smoke mode (`n_trials=2`, `fracs=[0.001, 0.002]`)	🔴 HIGH	Open
R2	DVC pipeline manual; no auto-trigger after ingestion	🔴 HIGH	Open
R3	No model hot-reload in serving	🔴 HIGH	Open
R4	`stats.py` router not registered in `main.py` (dead endpoint)	🟡 MED	Open
R5	`batch_inference` staleness — serving may use stale features silently	🔴 HIGH	Open
R6	No metric gate before `champion` promotion	🔴 HIGH	Open
R7	No drift detection (Evidently not integrated)	🟡 MED	Open
R8	Single-node K8s; no HA for PostgreSQL / MinIO	🟡 MED	Open (known)

6. Component map (unchanged)¶

See 20260424 00_system_audit_v2.md §6 — same set of components in same layers.

7. Summary¶

System maturity: MEDIUM  (unchanged)

Strengths:
- Full end-to-end MLOps stack (DVC, MLflow, Celery, Airflow, Prometheus, K8s)
- features_meta.parquet as single feature contract
- GE validation gates at 3 boundaries (raw, finished, features)
- Independent batch_inference DVC branch
- Prometheus metrics on inference latency, confidence, model info
- Detailed documentation and audit trail (3 cycles in docs/validation/)

Top open risks:
- R1 smoke params in params.yaml
- R2 no automated retrain trigger
- R3 no hot-reload of model
- R6 no metric gate before champion promotion

Unknowns (require deeper audits):
- Current champion in MLflow registry and its metrics → see 05
- Real cadence of batch_inference re-runs → see 08

8. References¶

Baseline detailed audits 01–11: docs/validation/20260424/*.md
Last weekly check: docs/validation/20260426/00_system_audit.md
Detailed audits this cycle: docs/validation/20260428/01..12_*.md

Recommendation: since no production change since 20260422, audits 01–11 in this cycle are confirmation reports against 20260424 baseline. The next material full cycle should be triggered by an actual commit to src/, dvc.yaml, airflow/, or docker/.