Runtime View¶

This page describes how the system behaves at runtime for its two core prediction paths: synchronous prediction and asynchronous prediction. It also covers cache interaction, model loading, and runtime invariants.

Runtime Invariants¶

These are architectural invariants — properties that must hold regardless of how implementation details evolve:

The model serving any request is always a registered MLflow model with the champion alias. No local model file is ever loaded directly.
The request input schema is always validated against the Pydantic PredictRequest schema before any inference logic runs. Malformed requests are rejected with a structured error.
Feature assembly at inference time uses the same code path (src/features/) as the offline pipeline. There is no duplicate feature implementation for serving.
Model promotion is the explicit, logged handoff point between the offline pipeline and the serving layer. Models enter serving only via the MLflow Registry.

Caching is an optimization, not an invariant. Redis is used to avoid redundant inference for repeated queries. The inference path must remain correct and available when Redis is unavailable. See Degraded Modes.

Sync Prediction Path¶

Endpoint: POST /predict

Latency bound: bounded by Celery ml queue timeout (see Operational Targets)

sequenceDiagram participant Client participant FastAPI participant Redis participant RabbitMQ participant WorkerML Client->>FastAPI: POST /predict {match_id, features...} FastAPI->>FastAPI: Validate PredictRequest schema (Pydantic) FastAPI->>Redis: GET prediction cache key alt Cache hit Redis-->>FastAPI: Cached prediction result FastAPI-->>Client: 200 OK {probabilities, model_version, cache: true} else Cache miss FastAPI->>RabbitMQ: Enqueue task → ml queue RabbitMQ->>WorkerML: Pick up inference task WorkerML->>WorkerML: Assemble features (src/features/) WorkerML->>WorkerML: Run inference (lazily loaded model) WorkerML->>Redis: SET prediction cache key (TTL) WorkerML-->>FastAPI: Task result (30s timeout) FastAPI-->>Client: 200 OK {probabilities, model_version, cache: false} end

Step-by-step: 1. Client sends POST /predict with match context. 2. FastAPI validates the request via Pydantic schema; returns 422 on schema violation. 3. FastAPI checks Redis cache for an existing prediction for this input key. 4. On cache hit: return cached result immediately (sub-second; no inference or feature assembly). 5. On cache miss: FastAPI enqueues a task on the Celery ml queue via RabbitMQ. 6. celery-worker-ml picks up the task. 7. Worker assembles feature vectors using src/features/ (same logic as offline pipeline). 8. Worker runs inference using the lazily loaded model (loaded once per worker process on first request). 9. Worker writes the result to Redis with a configured TTL. 10. FastAPI receives the task result (sync wait, 30s timeout) and returns it to the client. 11. On timeout: FastAPI returns 504 Gateway Timeout.

Async Prediction Path¶

Endpoints: - POST /predict/async/ → submit job, returns task_id - GET /monitoring/task_status/{task_id} → poll for result

sequenceDiagram participant Client participant FastAPI participant RabbitMQ participant WorkerML participant Redis Client->>FastAPI: POST /predict/async/ {match_id, features...} FastAPI->>FastAPI: Validate PredictRequest schema FastAPI->>RabbitMQ: Enqueue async task → ml queue FastAPI-->>Client: 202 Accepted {task_id} loop Poll until ready Client->>FastAPI: GET /monitoring/task_status/{task_id} FastAPI->>FastAPI: Check Celery task state alt Task pending/running FastAPI-->>Client: 202 {status: pending} else Task complete FastAPI-->>Client: 200 {status: success, result: {probabilities, model_version}} end end RabbitMQ->>WorkerML: Pick up async task WorkerML->>WorkerML: Assemble features + run inference WorkerML->>Redis: SET result (keyed by task_id)

Step-by-step: 1. Client sends POST /predict/async/ with match context. 2. FastAPI validates the request; enqueues the task on the Celery ml queue. 3. FastAPI immediately returns 202 Accepted with a task_id. 4. Client polls GET /monitoring/task_status/{task_id} until status is success or failure. 5. celery-worker-ml processes the task (same inference logic as sync path). 6. Result is stored in Redis keyed by task ID. 7. On next poll: FastAPI retrieves the result and returns 200 OK.

Cache Interaction Rules¶

Scenario	Behavior
Cache hit on `POST /predict`	Return cached result; skip Celery dispatch entirely
Cache miss on `POST /predict`	Dispatch to Celery; write result to cache after completion
Cache hit on `GET /predict/{match_id}`	Return from batch-inference parquet (pre-computed, not Redis)
Async result retrieval	Keyed by `task_id`; TTL-bounded
Cache eviction	TTL-based; no manual invalidation today

Cache key structure: keyed on a hash of the input feature vector (implementation detail in src/app/).

Current limitations: - No explicit cache invalidation on model version change. When a new model is promoted, stale cache entries may be served until TTL expires. - 📋 Planned: cache key to include model version or invalidate on promotion event.

Cache Strategy¶

Redis serves as an optimization layer — not a source of truth, and not an architectural invariant. The inference path must function correctly without it (see Degraded Modes).

Role of Redis in this system:

Reduces redundant inference for repeated queries about the same match input.
Stores assembled prediction results, not raw feature vectors.
Operates on a TTL-based expiry: no entry persists indefinitely.

Cache key structure (conceptual):

Cache keys are derived from a deterministic hash of the input feature vector. This ensures that: - Structurally identical inputs (same match context) produce the same key. - Structurally different inputs (different match context or feature values) always produce different keys. - The key is not tied to the model version today (a known limitation — see Consistency Model below).

Keys for async task results are keyed by task_id, not by input hash.

TTL-based expiration:

All cache entries expire automatically after a configured TTL. There is no manual expiration logic in the normal path. The TTL is a trade-off parameter: - Too short → poor cache hit rate; redundant inference. - Too long → stale predictions served after match conditions change or a new model is promoted.

Consistency Model¶

Redis provides eventual consistency with respect to model version. This is an accepted trade-off, not a defect.

What this means in practice:

After a new model is promoted to champion, existing cache entries produced by the previous model remain valid until TTL expires.
Stale predictions from the previous model will be served to clients during the TTL window after a promotion event.
There is no strict synchronization between the MLflow Registry (model promotion) and the Redis cache (cached predictions).

Why this is acceptable:

Predictions are probabilistic and advisory. A small window of stale results does not constitute a system failure.
The TTL window is bounded. Stale entries are not permanent.
The operational complexity of tight cache-model synchronization is not justified at current scale or SLA.

Current status and limitation:

Cache invalidation on model promotion is not implemented. It is documented as a known limitation in Known Architectural Limitations and as a planned improvement in Roadmap.

Model Loading Rules¶

The model is loaded lazily: on first inference request in each worker process, not at startup.
The model_uri is resolved from MLflow Registry using the champion alias.
The same PredictionService singleton is reused for all subsequent requests in that process.
Model artifact includes preprocessing steps packaged as an MLflow pyfunc wrapper.

Consequence: the first request to a freshly started worker is slower (model load time). Subsequent requests serve from the in-process loaded model.

Batch Feature Lookup Path¶

Endpoint: GET /predict/{match_id}

This is a separate, lighter path for pre-computed predictions:

celery-worker-ml runs a batch_inference DVC stage that assembles feature vectors for all upcoming matches and writes them to data/predictions/match_features.parquet.
The FastAPI GET /predict/{match_id} handler reads from this parquet file (or a cached version of it).
No model inference is triggered at request time; the response is a pre-computed feature lookup.

Performance Notes¶

Concrete latency targets are documented in System Requirements — Operational Targets. This section describes the qualitative profile of each path.

Path	Latency profile	Bottleneck
Cache hit	Sub-second	Redis lookup; no model inference
Cache miss — sync	Bounded by Celery `ml` queue timeout	Feature assembly + model inference + RabbitMQ round-trip
Async submission	Near-instant	FastAPI enqueue only; no blocking on result
Model load (first request per worker)	One-time cost at startup	MLflow artifact download from MinIO

Degraded Modes¶

The system is designed to continue serving predictions when non-critical components are unavailable.

Degraded component	Behavior	Severity
Redis unavailable	Cache bypassed; all requests fall through to full inference path	P2 — degraded performance; inference still functional
MLflow unavailable (registry read)	Running workers continue to serve from already-loaded model; new worker processes fail to start	P2 — no impact until worker restart
RabbitMQ unavailable	All inference fails (sync and async); no task dispatch possible	P1 — full serving outage
Celery worker-ml down	Sync requests time out; async tasks queue but do not process	P1 — inference unavailable

See Failure Modes for recovery procedures.

Current Limitations¶

Limitation	Impact	Status
No Redis HA	Redis pod failure = cache miss on all requests; sync path still works (slower)	🚧 Known
No streaming inference	Batch-only feature input	📋 Planned
Cache not invalidated on model promotion	Stale results served until TTL	📋 Planned fix
Single RabbitMQ broker	Queue unavailable = all inference fails	🚧 Known

Latency Trade-offs¶

The sync inference path (POST /predict) is bounded by the Celery ml queue timeout (p95 < 30 s, per Operational Targets). This is intentionally generous. Several design decisions contribute to this latency profile:

Why sync inference can take up to 30 s:

Feature computation is expensive. Feature assembly at inference time reuses the same logic from src/features/ as the offline pipeline. This means time-windowed statistical aggregations, ELO calculations, and other stateful computations run in the request path.
Offline logic is reused by design, not by accident. The decision to share feature code between training and serving is an explicit architectural invariant (see Runtime Invariants). This eliminates training/serving skew at the cost of inference latency. The trade-off is accepted.
No precompute layer exists yet. There is no online feature store or precomputed feature cache. Every cache miss triggers full feature assembly from scratch. This is a known limitation.
Model load on first request. A freshly started worker must download and load the model artifact from MinIO/MLflow before the first inference completes. Subsequent requests are faster (in-process model).
RabbitMQ round-trip. Every cache miss dispatches through RabbitMQ, adding message broker round-trip latency to the total.

The 30 s bound is a p95 operational target, not a hard SLA. It reflects the current implementation profile, not an architectural ceiling.

Why Sync Path Uses Celery¶

The synchronous POST /predict endpoint dispatches to a Celery worker and waits for the result, rather than executing inference inline in the FastAPI process. This appears to add latency — and it does. The trade-off is deliberate.

Reasons for this design:

Uniform execution path. Sync and async inference both run in celery-worker-ml. There is a single inference implementation, not two divergent paths. This eliminates the risk of sync and async paths producing different results from the same input.
Process isolation. FastAPI HTTP workers remain responsive under model load, feature assembly, or I/O delays. A slow inference job in a Celery worker cannot block the HTTP process.
Reuse of async infrastructure. The same queue, the same workers, and the same feature/model code are used for both sync and async jobs. There is no separate serving stack to maintain.
Predictable failure modes. Task timeout, queue unavailability, and worker crash all have well-defined outcomes and recovery paths. Inline inference in FastAPI would scatter these failure modes across the HTTP layer.

The trade-off: sync latency includes Celery dispatch overhead (RabbitMQ round-trip + task pickup). This is acceptable given the 30 s budget and the benefit of a single unified inference path.

Future Optimization Paths¶

These are potential improvements to the inference latency profile. They are listed here as architectural options, not commitments. None are currently implemented.

Precomputed features (offline feature assembly): Run feature assembly as a batch job before match time; store feature vectors in a lookup store. Inference at request time becomes a vector lookup + model forward pass only.
Lighter model for sync path: Replace or supplement the current model with a faster variant (e.g., smaller gradient boosting tree, logistic regression) for the latency-sensitive sync path. The current model would remain available for async batch scoring.
Dedicated online feature store (optional): A purpose-built feature store (e.g., Feast, Redis-backed) could decouple feature freshness from request-time computation. Not warranted at current scale.
Cache warming on model promotion: When a new champion model is promoted, pre-populate the Redis cache for known upcoming matches. This avoids the first-request cold-start penalty after a model switch.
Worker pre-warming: Start inference workers and trigger a dummy prediction at deploy time to force model load before the first real request.

Container View — Celery worker types, Redis, RabbitMQ
Component View — inference components and feature assembly
Failure Modes — what happens when cache, queue, or model is unavailable
Serving — API Contract — full endpoint specification