Skip to content

Pipeline DVC + Hydra Audit Report — SoccerPredictAI

Date: 2026-04-28 Auditor: GitHub Copilot (Claude Opus 4.7) — /skill-ml-system-audit full (audit 04/12) Scope: DVC DAG correctness, params coverage, reproducibility, Hydra integration Baseline: docs/validation/20260424/04_pipeline_dvc_hydra_audit.md


Delta vs baseline

dvc.yaml, params.yaml, conf/, and src/pipelines/** not modified since 2026-04-26. Baseline findings remain in force.


Confirmed structure

  • 15 DVC stages (load_data_from_sourcesvalidate_rawexport_metadatapreprocessingvalidate_finished/futurefeature_engineeringvalidate_featuressplit_databatch_inference (independent) → classification_modelsablation_study (parallel, isolated outs) → tune_xgbfinal_trainregister_model).
  • load_data_from_sources: always_changed: true (intentional MinIO polling).
  • dvc.lock present; all declared params reachable in code; no dead params.
  • Hydra conf/ exists but is not wired — pipeline entrypoints call load_params() directly on params.yaml.

Risk register (re-confirmed)

ID Severity Description Status
P-01 P1 split_data dep references src/pipelines/validation.py, not src/data/splitting.py — possible misname; splitting logic may not trigger DVC invalidation Open
P-02 P1 ablation_study lacks dep on test_ids.parquet despite consuming it via make_classification_runs Open
P-03 P1 validate_* stages lack deps on src/data_quality/*.py — GE expectation edits do not trigger re-run Open
P-04 P1 Hydra conf/ shipped but unused — parallel config sources cause confusion Open
P-05 P2 match.parquet declared as DVC out but unused downstream Open
P-06 P2 No explicit random seed for HGBT and Optuna study — reproducibility partial Open
P-07 P2 batch_inference deps don't include load_data_from_sources explicitly (only via preprocessing) — parallel-run ordering not guaranteed Open

Reproducibility scorecard

Aspect Status
params.yaml covers all ML params
DVC tracks deps/outs (with caveats above)
dvc.lock present
Random seeds explicit ⚠️ partial (logreg/sgd=42; HGBT/Dummy/Optuna unset)
MLflow tracking via env
MinIO artifacts via env

Summary

Aspect Status
15-stage DAG
GE gates × 3 (raw, finished, features)
Param coverage in code
Independent batch_inference branch
Hydra integration ❌ scaffold only (P-04)
Validation-code deps wired into DVC ❌ (P-03)

Recommendation: P-01 and P-03 are correctness risks (silent skipping of necessary re-runs); P-04 should be resolved by either wiring Hydra into entrypoints or removing conf/ to eliminate dead config.

See baseline §1–§4 for code-level detail.