Skip to content

Data Audit Report — SoccerPredictAI

Date: 2026-04-28 Auditor: GitHub Copilot (Claude Opus 4.7) — /skill-ml-system-audit full (audit 01/12) Scope: Data layer — ingestion, storage, versioning, freshness, validation boundary Baseline: docs/validation/20260424/01_data_audit.md


Delta vs baseline

No source files under src/data/, src/app/tasks/export.py, src/pipelines/source.py, dvc.yaml, or airflow/dags/etl_* modified since 2026-04-26 (verified via find -newer and git log). Working-tree status identical to that observed in 20260426/00. Baseline findings remain in force as-is.


Confirmed structure

  • Ingestion: WhoScored → Selenoid → Celery worker-api (queue api) → PostgreSQL (match, match_raw) → manual Airflow etl_export_01MINIO_BUCKET_DATA_RAW → DVC load_data_from_sourcesdata/raw/match{,_raw}.parquet.
  • Versioning: .minio.json sidecar per object (etag, size, last_modified, bucket, key, ingested_at); tracked by DVC outs.
  • Smart-skip: ETag + size comparison in export_data_raw(); idempotent re-runs.
  • Validation gates: validate_raw, validate_finished, validate_future, validate_features (GE outputs to data/evaluation/).

Risk register (re-confirmed)

ID Severity Description Status
D-01 P0 validate_interim expected in contract tests but absent from dvc.yaml (tests/contract/test_pipeline_contracts.py::test_expected_stage_exists should fail) Open
D-02 P1 match.parquet declared as DVC out but no downstream stage consumes it as dep — dead artifact Open
D-03 P1 etl_export_01 PostgreSQL→MinIO is schedule=None (manual); no staleness alerting Open
D-04 P1 MinIO ETag for multipart objects is not a content hash; silent false-positive skip theoretically possible Open
D-05 P2 No schema validation at MinIO download time; bad parquet only caught at validate_raw Open
D-06 P2 No transactional guarantee across match.parquet + match_raw.parquet downloads Open
D-07 P2 No content hash (MD5/SHA) on raw artifacts Open

Summary

Aspect Status
Ingestion flow documented
Smart-skip ETag + size
Sidecar metadata
GE validation in pipeline
Export automation ❌ manual
validate_interim in pipeline ❌ contract mismatch
match.parquet downstream usage ❌ dead artifact

Recommendation: prioritize D-01 (contract drift — likely already a red CI signal), then D-03 (replace manual export with scheduled DAG + freshness SLO).

See baseline §1–§5 for code-level detail.