Pipeline DVC + Hydra Audit Report — SoccerPredictAI¶
Date: 2026-04-28
Auditor: GitHub Copilot (Claude Opus 4.7) — /skill-ml-system-audit full (audit 04/12)
Scope: DVC DAG correctness, params coverage, reproducibility, Hydra integration
Baseline: docs/validation/20260424/04_pipeline_dvc_hydra_audit.md
Delta vs baseline¶
dvc.yaml, params.yaml, conf/, and src/pipelines/** not modified since 2026-04-26. Baseline findings remain in force.
Confirmed structure¶
- 15 DVC stages (
load_data_from_sources→validate_raw→export_metadata→preprocessing→validate_finished/future→feature_engineering→validate_features→split_data→batch_inference(independent) →classification_models→ablation_study(parallel, isolated outs) →tune_xgb→final_train→register_model). load_data_from_sources: always_changed: true(intentional MinIO polling).dvc.lockpresent; all declared params reachable in code; no dead params.- Hydra
conf/exists but is not wired — pipeline entrypoints callload_params()directly onparams.yaml.
Risk register (re-confirmed)¶
| ID | Severity | Description | Status |
|---|---|---|---|
| P-01 | P1 | split_data dep references src/pipelines/validation.py, not src/data/splitting.py — possible misname; splitting logic may not trigger DVC invalidation |
Open |
| P-02 | P1 | ablation_study lacks dep on test_ids.parquet despite consuming it via make_classification_runs |
Open |
| P-03 | P1 | validate_* stages lack deps on src/data_quality/*.py — GE expectation edits do not trigger re-run |
Open |
| P-04 | P1 | Hydra conf/ shipped but unused — parallel config sources cause confusion |
Open |
| P-05 | P2 | match.parquet declared as DVC out but unused downstream |
Open |
| P-06 | P2 | No explicit random seed for HGBT and Optuna study — reproducibility partial | Open |
| P-07 | P2 | batch_inference deps don't include load_data_from_sources explicitly (only via preprocessing) — parallel-run ordering not guaranteed |
Open |
Reproducibility scorecard¶
| Aspect | Status |
|---|---|
params.yaml covers all ML params |
✅ |
| DVC tracks deps/outs (with caveats above) | ✅ |
dvc.lock present |
✅ |
| Random seeds explicit | ⚠️ partial (logreg/sgd=42; HGBT/Dummy/Optuna unset) |
| MLflow tracking via env | ✅ |
| MinIO artifacts via env | ✅ |
Summary¶
| Aspect | Status |
|---|---|
| 15-stage DAG | ✅ |
| GE gates × 3 (raw, finished, features) | ✅ |
| Param coverage in code | ✅ |
Independent batch_inference branch |
✅ |
| Hydra integration | ❌ scaffold only (P-04) |
| Validation-code deps wired into DVC | ❌ (P-03) |
Recommendation: P-01 and P-03 are correctness risks (silent skipping of necessary re-runs); P-04 should be resolved by either wiring Hydra into entrypoints or removing conf/ to eliminate dead config.
See baseline §1–§4 for code-level detail.