Data Audit Report — SoccerPredictAI¶
Date: 2026-04-28
Auditor: GitHub Copilot (Claude Opus 4.7) — /skill-ml-system-audit full (audit 01/12)
Scope: Data layer — ingestion, storage, versioning, freshness, validation boundary
Baseline: docs/validation/20260424/01_data_audit.md
Delta vs baseline¶
No source files under src/data/, src/app/tasks/export.py, src/pipelines/source.py, dvc.yaml, or airflow/dags/etl_* modified since 2026-04-26 (verified via find -newer and git log). Working-tree status identical to that observed in 20260426/00. Baseline findings remain in force as-is.
Confirmed structure¶
- Ingestion: WhoScored → Selenoid → Celery worker-api (queue
api) → PostgreSQL (match,match_raw) → manual Airflowetl_export_01→MINIO_BUCKET_DATA_RAW→ DVCload_data_from_sources→data/raw/match{,_raw}.parquet. - Versioning:
.minio.jsonsidecar per object (etag,size,last_modified,bucket,key,ingested_at); tracked by DVC outs. - Smart-skip: ETag + size comparison in
export_data_raw(); idempotent re-runs. - Validation gates:
validate_raw,validate_finished,validate_future,validate_features(GE outputs todata/evaluation/).
Risk register (re-confirmed)¶
| ID | Severity | Description | Status |
|---|---|---|---|
| D-01 | P0 | validate_interim expected in contract tests but absent from dvc.yaml (tests/contract/test_pipeline_contracts.py::test_expected_stage_exists should fail) |
Open |
| D-02 | P1 | match.parquet declared as DVC out but no downstream stage consumes it as dep — dead artifact |
Open |
| D-03 | P1 | etl_export_01 PostgreSQL→MinIO is schedule=None (manual); no staleness alerting |
Open |
| D-04 | P1 | MinIO ETag for multipart objects is not a content hash; silent false-positive skip theoretically possible | Open |
| D-05 | P2 | No schema validation at MinIO download time; bad parquet only caught at validate_raw |
Open |
| D-06 | P2 | No transactional guarantee across match.parquet + match_raw.parquet downloads |
Open |
| D-07 | P2 | No content hash (MD5/SHA) on raw artifacts | Open |
Summary¶
| Aspect | Status |
|---|---|
| Ingestion flow documented | ✅ |
| Smart-skip ETag + size | ✅ |
| Sidecar metadata | ✅ |
| GE validation in pipeline | ✅ |
| Export automation | ❌ manual |
validate_interim in pipeline |
❌ contract mismatch |
match.parquet downstream usage |
❌ dead artifact |
Recommendation: prioritize D-01 (contract drift — likely already a red CI signal), then D-03 (replace manual export with scheduled DAG + freshness SLO).
See baseline §1–§5 for code-level detail.