Serving Audit Report — SoccerPredictAI¶

Date: 2026-04-24 Auditor: GitHub Copilot (Claude Sonnet 4.6) Scope: FastAPI endpoints, model loading, Celery async, batch serving, error handling Method: анализ src/app/routers/, src/app/services/predict.py, src/app/tasks/predict.py, src/app/schemas/predict.py, src/app/worker_ml.py

1. API Endpoints Audit¶

1.1 Инвентаризация endpoints¶

Endpoint	Method	Назначение	Sync/Async	Auth
`GET /predict/matches/`	GET	Список upcoming матчей из batch inference	Sync	❌ (нет)
`POST /predict/`	POST	Предсказание по inline features	Sync (Celery 30s timeout)	❌
`GET /predict/{match_id}`	GET	Предсказание по ID (lookup)	Sync (Celery 30s timeout)	❌
`POST /predict/async/`	POST	Async предсказание → task_id	Async	❌
`GET /predict/model/info`	GET	MLflow model metadata	Sync (Celery)	❌
`GET /monitoring/task_status/{task_id}`	GET	Статус Celery task	Sync	❌
`GET /monitoring/celery/queues`	GET	Статистика очередей	Sync	❌
`GET /monitoring/celery/workers`	GET	Статистика workers	Sync	❌
`GET /livescores/`	GET	Livescores из PostgreSQL	Sync	❌
`PATCH /sources/livescores/`	PATCH	Trigger scraping	Async (Celery)	✅ X-Token
`GET /sources/export/{table}`	GET	Export table → MinIO	Async (Celery)	✅ X-Token
`GET /healthcheck/`	GET	Liveness check	Sync	❌
`GET /metrics`	GET	Prometheus scrape	Sync	❌

⚠️ P1 Security: /predict/* endpoints не требуют аутентификации — любой может запрашивать предсказания. Это может быть намеренным (публичный сервис), но не задокументировано как решение.

1.2 Request / Response schemas¶

Endpoint	Request schema	Response schema	Validated	Расхождения
`POST /predict/`	`PredictRequest`	`PredictResponse`	✅ Pydantic	—
`GET /predict/{match_id}`	path param `int`	`PredictResponse`	✅	—
`POST /predict/async/`	`AsyncPredictRequest` (match_id: int)	`AsyncPredictResponse`	✅	—
`GET /predict/model/info`	—	`ModelInfoResponse`	✅	—
`GET /predict/matches/`	—	`list[dict]`	⚠️ нет Pydantic schema

⚠️ P2: GET /predict/matches/ возвращает list[dict] без Pydantic response model — нет schema validation ответа.

1.3 Error responses¶

Сценарий	HTTP код	Реализован
match_id не найден	404	✅
ML worker timeout (30s)	504	✅
ML inference error	500	✅
Неверный request body	422 (Pydantic)	✅
MLflow недоступен	500 (через retry)	✅
Celery недоступен	—	⚠️ вернёт 500, нет явного 503

2. Model Loading Audit¶

2.1 Механизм загрузки¶

# worker_process_init signal → PredictionService.load() → mlflow.pyfunc.load_model()
model_uri = f"models:/{self._model_name}@{self._model_stage}"
# = "models:/soccer_clf@champion"

Аспект	Статус
Источник модели	MLflow Registry по alias `champion` ✅
Hardcoded путь	❌ нет
Lazy loading с double-checked locking	✅ thread-safe
Cold-start avoidance (worker_process_init)	✅
Fallback при cold init outside worker	✅ lazy fallback с warning

2.2 Проблема reload¶

⚠️ P1: Модель загружается один раз при старте worker process. При обновлении champion alias в MLflow Registry — worker не перезагружает модель. Требуется ручной restart Celery workers.

2.3 pyfunc predict_proba fallback¶

raw = model.predict(df)
if hasattr(raw, "ndim") and raw.ndim == 2:
    return raw  # Already (N, 3) probabilities

# Fallback: 1-D label output → try sklearn predict_proba

✅ Надёжный fallback для разных MLflow flavours.

3. Batch Inference Serving¶

Batch artifacts path¶

DVC batch_inference → data/predictions/match_features.parquet
    → (optional) upload to MinIO predictions/match_features.parquet
    ↓
FeatureLookupService._load()
    ├── local file если существует (dev)
    └── MinIO (production K8s)

Caching strategy¶

# MinIO re-check interval: FEATURE_CACHE_CHECK_INTERVAL = 60s (default)
# Cache invalidation: MinIO LastModified изменился → перезагрузить

Аспект	Статус
Local file cache	✅ по mtime
MinIO stale detection	✅ по LastModified с 60s polling
Graceful degraded mode	✅ stale cache + warning при недоступности MinIO
Thread-safety	⚠️ нет lock при MinIO reload — concurrent requests могут вызвать double-load

Redis prediction cache¶

Аспект	Статус
Cache key	`predict:{match_id}:{run_id}`
TTL	`PREDICTION_CACHE_TTL` env (default 3600s)
Redis unavailable	graceful degradation — caching disabled
Cache hit метрика	`cached: true` в ответе

4. Celery Task Audit¶

predict_match task¶

Аспект	Значение
Queue	`ml`
max_retries	2
default_retry_delay	10s
Idempotency	✅ (same input → same output; Redis cache)
State updates	✅ `PROGRESS` state
task_time_limit	3600s
task_acks_late	✅ — ack только после успешного выполнения
task_reject_on_worker_lost	✅ — задача переназначается при потере worker

⚠️ P2: max_retries=2 при default_retry_delay=10s — total retry window = 20s. При 30s sync timeout в FastAPI: retry закончатся до timeout, но клиент уже получил 504. Последний retry может всё равно выполниться.

Async flow (POST /predict/async/)¶

POST /predict/async/
    ↓
predict_match_task.apply_async(queue="ml")
    ↓ returns task_id
AsyncPredictResponse(task_id=...)
    ↓
GET /monitoring/task_status/{task_id}
    ↓
{"status": "SUCCESS"|"FAILURE"|"PROGRESS", "result": {...}}

✅ Полный async flow реализован. Polling endpoint существует.

5. Findings¶

ID	Severity	Описание
SRV-01	P1	Нет auth на prediction endpoints (`/predict/*`) — открытый доступ
SRV-02	P1	Нет механизма автоматического reload модели при смене champion alias
SRV-03	P2	`GET /predict/matches/` возвращает `list[dict]` без Pydantic response schema — нет гарантий формата
SRV-04	P2	`FeatureLookupService._load()` при MinIO reload нет threading.Lock — concurrent requests могут вызвать параллельную загрузку
SRV-05	P2	`predict_match` retry (2×10s) + FastAPI sync timeout (30s) — race condition: последний retry может выполниться после 504 возвращён клиенту
SRV-06	P2	Нет явного 503 при недоступности Celery broker — клиент получит 500 или timeout
SRV-07	P3	`asyncio.get_event_loop()` устарел в Python 3.10+, deprecated в 3.12 — нужен `asyncio.get_running_loop()`