System Audit Report — SoccerPredictAI¶

Дата: 2026-04-26 Аудитор: GitHub Copilot (Claude Sonnet 4.6) — /skill-ml-system-audit full Скоуп: Верхнеуровневый аудит системы — архитектура, потоки, контракты, риски Метод: Анализ кода (dvc.yaml, params.yaml, src/, airflow/, docker/, k8s/) + diff с предыдущим циклом Предыдущий полный цикл: 2026-04-24 (00_system_audit_20260424_v2.md + аудиты 01–11)

⚡ Изменения с 2026-04-24¶

Изменение	Файлы	Влияние на систему
Добавлена Copilot-конфигурация	`.github/copilot-instructions.md` (раздел 19), `.github/instructions/.instructions.md` (5 файлов), `.github/prompts/.prompt.md` (2 файла), `.github/hooks/`, `.github/skills/`	Нет влияния на production. Dev-tooling only.

Вывод: Production-код, DVC pipeline, FastAPI, Celery, Airflow DAGs — не изменялись. Риски и контракты, зафиксированные 20260424, остаются в силе.

1. Архитектурная схема (актуальная)¶

[WhoScored.com]
       │ (Selenoid browser automation)
       ▼
[Celery worker-api] ──► [PostgreSQL]
       ▲                      │
[Airflow DAGs ×5]        [export task]
 @hourly / manual             │
 X-Token via Variable    [MinIO: data-raw/]
                              │
                         [DVC Pipeline: 15 stages]
                         params.yaml / Hydra conf/
                              │
                         [MLflow Tracking]
                              │
                         [MLflow Registry]
                         model: soccer_clf@champion
                              │
                    ┌─────────┴──────────┐
            [Celery worker-ml]    [batch_inference]
            PredictionService      match_features.parquet
            (loaded on init)       → MinIO predictions/
                    │                      │
                    └──────────┬───────────┘
                               ▼
                          [FastAPI]
                     /predict /livescores
                     /monitoring /healthcheck
                               │
                          [Nginx / K8s Ingress]
                               │
                          [Streamlit UI]
                          (external VPS)

2. Слои системы¶

Layer	Что реализовано	Где код	Статус
Product / Problem	Классификация 1×2, target=`outcome_1x2`	`params.yaml`, `src/models/`	✅
Data Layer	Scraping, PostgreSQL, MinIO, DVC ingestion	`src/data/`, `src/app/tasks/export.py`	✅
Feature Layer	rolling stats (5 windows), ELO per tournament, side=diff	`src/features/`	✅
Model Layer	Baseline, LogReg, HGBT, XGBoost + Optuna + isotonic calibration	`src/models/`	✅ ⚠️ smoke params
Experimentation Layer	MLflow `matches_clf_smoke`, nested runs, metrics, artifacts	`src/pipelines/classification.py`	✅
Pipeline Layer	DVC 15 stages, GE validation gates, Hydra conf/	`dvc.yaml`, `src/pipelines/`	✅
Serving Layer	FastAPI sync/async predict, FeatureLookupService, Prometheus	`src/app/`	✅
UI Layer	Streamlit, nginx	`src/ui/`	✅
Orchestration Layer	Airflow 5 DAGs (livescores ×4, export ×1)	`airflow/dags/`	✅
Ops / Infra	Docker ×10, K8s Helm (single-node), pydantic-settings	`docker/`, `k8s/`	✅
Testing / Validation	unit, property, service, contract, load (Locust), GE	`tests/`	✅ 🚧 live integration
Documentation	MkDocs, ADRs, runbooks, status.md, validation audits	`docs/`	✅

3. Ключевые потоки¶

3.1 Data flow¶

WhoScored → Selenoid → Celery worker-api → PostgreSQL
                                         → Celery export → MinIO data-raw/
                                                               │
DVC load_data_from_sources (always_changed: true)  ←──────────┘
  → validate_raw (GE) → preprocessing → validate_finished/future
  → feature_engineering → validate_features → split_data

Форматы: .parquet повсеместно, .json для metadata и GE-отчётов.

3.2 Feature flow¶

finished.parquet
  → stats_matches.py  (win/draw/loss/goals_for/goals_against × windows [1,2,3,5,10])
  → elo.py            (per tournamentId, k=32, initial=1500, home_adv=50)
  → features.parquet + features_meta.parquet  ← единый контракт

batch_inference (независимая ветка):
  future.parquet + finished.parquet + features_meta
  → inference.py → match_features.parquet → MinIO → FeatureLookupService

3.3 Model flow¶

dataset + splits + features_meta
  → classification_models (screening, fracs=[0.001, 0.002])  ⚠️ smoke
  → tune_xgb (n_trials=2, frac=0.1)                          ⚠️ smoke
  → final_train (isotonic calibration, calib_frac=0.15)
  → MLflow Registry soccer_clf@champion
  → worker_process_init → PredictionService (in-memory, no hot-reload)

3.4 Execution flow¶

Manual/CI:  dvc repro → register_model → ручной перезапуск worker-ml
Production ETL:  Airflow @hourly → PATCH /sources/livescores/ (X-Token) → Celery api
Inference sync:  UI → POST /predict/ → Celery ml (timeout=30s) → PredictionService
Inference async: UI → POST /predict/async/ → task_id → Redis → polling /monitoring/task_status/
Batch lookup:    UI → GET /predict/{match_id} → FeatureLookupService → match_features.parquet

4. Контракты¶

Тип	Контракт	Source of truth
Feature contract	имена и типы фичей	`features_meta.parquet`
Model contract	sklearn Pipeline + XGBoost, `mlflow.sklearn`, input=features_meta, output=proba[0,1,2]	MLflow Registry `soccer_clf@champion`
API contract	Pydantic schemas, OpenAPI автогенерация	`src/app/schemas/predict.py`
Data contract	`.parquet` на всех границах, `.minio.json` версионирование	DVC outs + `.minio.json`

5. Потенциальные риски (delta: без изменений с 20260424)¶

ID	Риск	Серьёзность	Статус
R1	`params.yaml` в smoke-режиме (`n_trials=2`, `fracs=[0.001, 0.002]`)	🔴 HIGH	Не устранён
R2	DVC pipeline запускается вручную, нет auto-trigger после ingestion	🔴 HIGH	Не устранён
R3	Нет hot-reload модели в serving — worker нужен перезапуск после `register_model`	🔴 HIGH	Не устранён
R4	`stats.py` router существует, но не зарегистрирован в `main.py` (dead endpoint)	🟡 MEDIUM	Не устранён
R5	`batch_inference` staleness — serving может работать на устаревших features без сигнала	🔴 HIGH	Не устранён
R6	Нет metric gate перед `champion` promotion — деградировавшая модель может стать champion	🔴 HIGH	Не устранён
R7	Нет drift detection (Evidently не интегрирован)	🟡 MEDIUM	Не устранён
R8	Single-node K8s, нет HA для PostgreSQL и MinIO	🟡 MEDIUM	Не устранён (known)

6. Компонентная таблица¶

Component	Layer	Responsibility	Inputs	Outputs
`src/data/source.py`	Data	MinIO download	MinIO bucket	`match.parquet`, `match_raw.parquet`
`src/data/preprocess.py`	Data	Outlier removal, split finished/future	`match_raw.parquet`	`finished.parquet`, `future.parquet`
`src/features/stats_matches.py`	Features	Rolling match statistics	`finished.parquet`	feature columns
`src/features/elo.py`	Features	ELO ratings per tournament	`finished.parquet`	elo columns
`src/pipelines/inference.py`	Pipeline	Batch feature computation for future matches	`future.parquet`, `finished.parquet`, `features_meta.parquet`	`match_features.parquet`
`src/models/classification.py`	Model	Model screening, CV training	`dataset.parquet`, `folds.parquet`, `features_meta.parquet`	MLflow runs
`src/models/final_train.py`	Model	Final training + calibration	`dataset.parquet`, tuned params	MLflow model artifact
`src/pipelines/register_model.py`	Pipeline	MLflow registry promotion	`final_run_id.json`	`soccer_clf@champion`
`src/app/tasks/predict.py`	Serving	Celery ML task, model inference	features dict	probabilities
`src/app/services/predict.py`	Serving	Feature lookup from parquet/MinIO	`match_id`	feature row
`airflow/dags/etl_livescores_*.py`	Orchestration	Scheduled livescores ETL	—	PostgreSQL rows
`airflow/dags/etl_export_01.py`	Orchestration	Manual data export	—	MinIO parquet

7. Итоговая оценка¶

System maturity: MEDIUM

Сильные стороны:
- Полный end-to-end MLOps стек (DVC, MLflow, Celery, Airflow, Prometheus, K8s)
- features_meta.parquet как единый feature contract
- GE validation gates на 3 уровнях (raw, finished, features)
- Чёткое разделение batch_inference и training (независимые DVC ветки)
- Prometheus метрики по inference latency, confidence, model info
- Детальная документация и audit trail

Основные риски:
- params.yaml в smoke-режиме (R1) — production deploy с недотренированной моделью
- Нет auto-trigger dvc repro после ingestion (R2)
- Нет hot-reload модели (R3) — деплой требует ручного перезапуска worker-ml
- Нет metric gate перед champion promotion (R6)

Что непонятно:
- Текущий champion в MLflow Registry и его метрики (требует 05_mlflow_registry_audit)
- Реальная частота обновления batch_inference (требует 08_orchestration_audit)

8. Ссылки на предыдущий полный цикл¶

Детальные аудиты от 2026-04-24 остаются актуальными (production-код не изменялся):

Аудит	Файл
01 Data	`report/01_data_audit_20260424.md`
02 Features	`report/02_feature_audit_20260424.md`
03 Training	`report/03_training_evaluation_audit_20260424.md`
04 DVC Pipeline	`report/04_pipeline_dvc_hydra_audit_20260424.md`
05 MLflow	`report/05_mlflow_registry_audit_20260424.md`
06 Train/Serve	`report/06_train_serve_consistency_audit_20260424.md`
07 Serving	`report/07_serving_audit_20260424.md`
08 Orchestration	`report/08_orchestration_audit_20260424.md`
09 UI	`report/09_ui_audit_20260424.md`
10 Ops/Security	`report/10_ops_security_observability_audit_20260424.md`
11 Docs/Testing	`report/11_docs_testing_audit_20260424.md`

Следующий полный цикл 01–11 рекомендуется после следующего production-изменения в src/, dvc.yaml, или airflow/.