Runtime View¶
This page describes how the system behaves at runtime for its two core prediction paths: synchronous prediction and asynchronous prediction. It also covers cache interaction, model loading, and runtime invariants.
Runtime Invariants¶
These are architectural invariants — properties that must hold regardless of how implementation details evolve:
- The model serving any request is always a registered MLflow model with the
championalias. No local model file is ever loaded directly. - The request input schema is always validated against the Pydantic
PredictRequestschema before any inference logic runs. Malformed requests are rejected with a structured error. - Feature assembly at inference time uses the same code path (
src/features/) as the offline pipeline. There is no duplicate feature implementation for serving. - Model promotion is the explicit, logged handoff point between the offline pipeline and the serving layer. Models enter serving only via the MLflow Registry.
Caching is an optimization, not an invariant. Redis is used to avoid redundant inference for repeated queries. The inference path must remain correct and available when Redis is unavailable. See Degraded Modes.
Sync Prediction Path¶
Endpoint: POST /predict
Latency bound: bounded by Celery ml queue timeout (see Operational Targets)
Step-by-step:
1. Client sends POST /predict with match context.
2. FastAPI validates the request via Pydantic schema; returns 422 on schema violation.
3. FastAPI checks Redis cache for an existing prediction for this input key.
4. On cache hit: return cached result immediately (sub-second; no inference or feature assembly).
5. On cache miss: FastAPI enqueues a task on the Celery ml queue via RabbitMQ.
6. celery-worker-ml picks up the task.
7. Worker assembles feature vectors using src/features/ (same logic as offline pipeline).
8. Worker runs inference using the lazily loaded model (loaded once per worker process on first request).
9. Worker writes the result to Redis with a configured TTL.
10. FastAPI receives the task result (sync wait, 30s timeout) and returns it to the client.
11. On timeout: FastAPI returns 504 Gateway Timeout.
Async Prediction Path¶
Endpoints:
- POST /predict/async/ → submit job, returns task_id
- GET /monitoring/task_status/{task_id} → poll for result
Step-by-step:
1. Client sends POST /predict/async/ with match context.
2. FastAPI validates the request; enqueues the task on the Celery ml queue.
3. FastAPI immediately returns 202 Accepted with a task_id.
4. Client polls GET /monitoring/task_status/{task_id} until status is success or failure.
5. celery-worker-ml processes the task (same inference logic as sync path).
6. Result is stored in Redis keyed by task ID.
7. On next poll: FastAPI retrieves the result and returns 200 OK.
Cache Interaction Rules¶
| Scenario | Behavior |
|---|---|
Cache hit on POST /predict |
Return cached result; skip Celery dispatch entirely |
Cache miss on POST /predict |
Dispatch to Celery; write result to cache after completion |
Cache hit on GET /predict/{match_id} |
Return from batch-inference parquet (pre-computed, not Redis) |
| Async result retrieval | Keyed by task_id; TTL-bounded |
| Cache eviction | TTL-based; no manual invalidation today |
Cache key structure: keyed on a hash of the input feature vector (implementation detail in src/app/).
Current limitations: - No explicit cache invalidation on model version change. When a new model is promoted, stale cache entries may be served until TTL expires. - 📋 Planned: cache key to include model version or invalidate on promotion event.
Cache Strategy¶
Redis serves as an optimization layer — not a source of truth, and not an architectural invariant. The inference path must function correctly without it (see Degraded Modes).
Role of Redis in this system:
- Reduces redundant inference for repeated queries about the same match input.
- Stores assembled prediction results, not raw feature vectors.
- Operates on a TTL-based expiry: no entry persists indefinitely.
Cache key structure (conceptual):
Cache keys are derived from a deterministic hash of the input feature vector. This ensures that: - Structurally identical inputs (same match context) produce the same key. - Structurally different inputs (different match context or feature values) always produce different keys. - The key is not tied to the model version today (a known limitation — see Consistency Model below).
Keys for async task results are keyed by task_id, not by input hash.
TTL-based expiration:
All cache entries expire automatically after a configured TTL. There is no manual expiration logic in the normal path. The TTL is a trade-off parameter: - Too short → poor cache hit rate; redundant inference. - Too long → stale predictions served after match conditions change or a new model is promoted.
Consistency Model¶
Redis provides eventual consistency with respect to model version. This is an accepted trade-off, not a defect.
What this means in practice:
- After a new model is promoted to
champion, existing cache entries produced by the previous model remain valid until TTL expires. - Stale predictions from the previous model will be served to clients during the TTL window after a promotion event.
- There is no strict synchronization between the MLflow Registry (model promotion) and the Redis cache (cached predictions).
Why this is acceptable:
- Predictions are probabilistic and advisory. A small window of stale results does not constitute a system failure.
- The TTL window is bounded. Stale entries are not permanent.
- The operational complexity of tight cache-model synchronization is not justified at current scale or SLA.
Current status and limitation:
Cache invalidation on model promotion is not implemented. It is documented as a known limitation in Known Architectural Limitations and as a planned improvement in Roadmap.
Model Loading Rules¶
- The model is loaded lazily: on first inference request in each worker process, not at startup.
- The
model_uriis resolved from MLflow Registry using thechampionalias. - The same
PredictionServicesingleton is reused for all subsequent requests in that process. - Model artifact includes preprocessing steps packaged as an MLflow
pyfuncwrapper.
Consequence: the first request to a freshly started worker is slower (model load time). Subsequent requests serve from the in-process loaded model.
Batch Feature Lookup Path¶
Endpoint: GET /predict/{match_id}
This is a separate, lighter path for pre-computed predictions:
celery-worker-mlruns abatch_inferenceDVC stage that assembles feature vectors for all upcoming matches and writes them todata/predictions/match_features.parquet.- The FastAPI
GET /predict/{match_id}handler reads from this parquet file (or a cached version of it). - No model inference is triggered at request time; the response is a pre-computed feature lookup.
Performance Notes¶
Concrete latency targets are documented in System Requirements — Operational Targets. This section describes the qualitative profile of each path.
| Path | Latency profile | Bottleneck |
|---|---|---|
| Cache hit | Sub-second | Redis lookup; no model inference |
| Cache miss — sync | Bounded by Celery ml queue timeout |
Feature assembly + model inference + RabbitMQ round-trip |
| Async submission | Near-instant | FastAPI enqueue only; no blocking on result |
| Model load (first request per worker) | One-time cost at startup | MLflow artifact download from MinIO |
Degraded Modes¶
The system is designed to continue serving predictions when non-critical components are unavailable.
| Degraded component | Behavior | Severity |
|---|---|---|
| Redis unavailable | Cache bypassed; all requests fall through to full inference path | P2 — degraded performance; inference still functional |
| MLflow unavailable (registry read) | Running workers continue to serve from already-loaded model; new worker processes fail to start | P2 — no impact until worker restart |
| RabbitMQ unavailable | All inference fails (sync and async); no task dispatch possible | P1 — full serving outage |
| Celery worker-ml down | Sync requests time out; async tasks queue but do not process | P1 — inference unavailable |
See Failure Modes for recovery procedures.
Current Limitations¶
| Limitation | Impact | Status |
|---|---|---|
| No Redis HA | Redis pod failure = cache miss on all requests; sync path still works (slower) | 🚧 Known |
| No streaming inference | Batch-only feature input | 📋 Planned |
| Cache not invalidated on model promotion | Stale results served until TTL | 📋 Planned fix |
| Single RabbitMQ broker | Queue unavailable = all inference fails | 🚧 Known |
Latency Trade-offs¶
The sync inference path (POST /predict) is bounded by the Celery ml queue timeout (p95 < 30 s,
per Operational Targets). This is intentionally generous. Several design decisions
contribute to this latency profile:
Why sync inference can take up to 30 s:
- Feature computation is expensive. Feature assembly at inference time reuses the same logic from
src/features/as the offline pipeline. This means time-windowed statistical aggregations, ELO calculations, and other stateful computations run in the request path. - Offline logic is reused by design, not by accident. The decision to share feature code between training and serving is an explicit architectural invariant (see Runtime Invariants). This eliminates training/serving skew at the cost of inference latency. The trade-off is accepted.
- No precompute layer exists yet. There is no online feature store or precomputed feature cache. Every cache miss triggers full feature assembly from scratch. This is a known limitation.
- Model load on first request. A freshly started worker must download and load the model artifact from MinIO/MLflow before the first inference completes. Subsequent requests are faster (in-process model).
- RabbitMQ round-trip. Every cache miss dispatches through RabbitMQ, adding message broker round-trip latency to the total.
The 30 s bound is a p95 operational target, not a hard SLA. It reflects the current implementation profile, not an architectural ceiling.
Why Sync Path Uses Celery¶
The synchronous POST /predict endpoint dispatches to a Celery worker and waits for the result,
rather than executing inference inline in the FastAPI process. This appears to add latency — and it does.
The trade-off is deliberate.
Reasons for this design:
- Uniform execution path. Sync and async inference both run in
celery-worker-ml. There is a single inference implementation, not two divergent paths. This eliminates the risk of sync and async paths producing different results from the same input. - Process isolation. FastAPI HTTP workers remain responsive under model load, feature assembly, or I/O delays. A slow inference job in a Celery worker cannot block the HTTP process.
- Reuse of async infrastructure. The same queue, the same workers, and the same feature/model code are used for both sync and async jobs. There is no separate serving stack to maintain.
- Predictable failure modes. Task timeout, queue unavailability, and worker crash all have well-defined outcomes and recovery paths. Inline inference in FastAPI would scatter these failure modes across the HTTP layer.
The trade-off: sync latency includes Celery dispatch overhead (RabbitMQ round-trip + task pickup). This is acceptable given the 30 s budget and the benefit of a single unified inference path.
Future Optimization Paths¶
These are potential improvements to the inference latency profile. They are listed here as architectural options, not commitments. None are currently implemented.
- Precomputed features (offline feature assembly): Run feature assembly as a batch job before match time; store feature vectors in a lookup store. Inference at request time becomes a vector lookup + model forward pass only.
- Lighter model for sync path: Replace or supplement the current model with a faster variant (e.g., smaller gradient boosting tree, logistic regression) for the latency-sensitive sync path. The current model would remain available for async batch scoring.
- Dedicated online feature store (optional): A purpose-built feature store (e.g., Feast, Redis-backed) could decouple feature freshness from request-time computation. Not warranted at current scale.
- Cache warming on model promotion: When a new
championmodel is promoted, pre-populate the Redis cache for known upcoming matches. This avoids the first-request cold-start penalty after a model switch. - Worker pre-warming: Start inference workers and trigger a dummy prediction at deploy time to force model load before the first real request.
Related¶
- Container View — Celery worker types, Redis, RabbitMQ
- Component View — inference components and feature assembly
- Failure Modes — what happens when cache, queue, or model is unavailable
- Serving — API Contract — full endpoint specification