Architecture Roadmap¶

This page documents planned architectural improvements in priority order. All items on this page are 📋 Planned — none are implemented unless otherwise stated.

The roadmap is driven by engineering maturity gaps, not feature requests. Each item is justified by a concrete architectural need, not speculative scope expansion.

The v1.0 deliverables below are the binding scope for the next 1–2 weeks. The Near-term / Mid-term / Long-term sections that follow are the post-v1 backlog and are explicitly out of scope for v1.0.

v1.0 — Demo Track (1–2 weeks, in progress)¶

These items are the v1.0 Definition of Done as fixed in Requirements §Definition of Done. They are the only items committed for the current cycle. All other roadmap items below are deferred to post-v1.

v1.1 Public prediction UI (DoD-01, DoD-02)¶

Architectural reason: A read-only Streamlit UI that lists matches and renders the champion-model 1x2 prediction is the system's user-facing demonstration of the end-to-end pipeline. Without it, the value of the data, training, and serving layers is invisible to a non-operator visitor.

Current state: src/ui/app/pages/ is empty. The Streamlit app exposes livescores only; predictions are not rendered. APIClient lacks /predict/* methods.

Target: - src/ui/app/pages/predictions.py — match list (past / current / future) with champion-model 1x2 prediction per match. - src/ui/app/pages/model_metrics.py — historical quality metrics (accuracy, log-loss, calibration, historical ROI) for champion and challenger models read from MLflow / evaluation artifacts, as information only (no strategy recommendations). - APIClient extended with /predict/{match_id} and /predict/model/info.

Scope: src/ui/app/pages/, src/ui/app/shared/api_client.py. No changes to serving or training code.

v1.2 Production training parameters (DoD-03)¶

Architectural reason: The current params.yaml is in smoke mode (classification.fracs_for_train=[0.001, 0.002], tuning.n_trials=2). A model trained with these parameters cannot be honestly described as a champion. v1.0 requires the registered champion to be trained with parameters representative of the production regime.

Current state: Smoke parameters active by default. Active MLflow experiment is matches_clf_smoke.

Target: Production-scale fracs_for_train and n_trials in params.yaml; one full dvc repro cycle producing a registered champion in a non-smoke experiment.

Scope: params.yaml, one full pipeline run, MLflow tag review.

v1.3 Docs ↔ code reconciliation (DoD-04)¶

Architectural reason: Several docs/status.md claims contradict the code (UI Streamlit predictions claim, GE-gate naming) and the contract test in tests/contract/test_pipeline_contracts.py is CI-red because EXPECTED_STAGES references validate_interim, which is absent from dvc.yaml. Documentation that contradicts code is worse than no documentation.

Current state: 2 formal contradictions (C-01, C-02) and 1 stale-wording finding (#37) carried over from audit cycle 2026-04-28.

Target: Resolve contradictions in either direction (fix code or fix docs) so that no ✅ Operational claim in docs/status.md is unsupported and the contract test is green.

Scope: docs/status.md, tests/contract/test_pipeline_contracts.py, possibly dvc.yaml.

v1.4 Public-surface guardrails (DoD-05)¶

Architectural reason: The public deployment is intentionally unauthenticated (see Non-Goals). To make this safe, the surface must be read-only, rate-limited, and clearly labelled as a demo.

Current state: CORS allow_origins=["*"], no rate limiting, no disclaimer.

Target: - nginx ingress rate limit on the public UI / API host. - Visible "demo only — not betting advice" disclaimer in the UI footer. - CORS narrowed to the deployed UI origin.

Scope: k8s/helm/, src/ui/app/shared/, FastAPI CORS middleware config.

Explicitly deferred from v1.0 (kept here for traceability): champion-vs-challenger gate (R6), model hot-reload on alias change (R3), automated retrain DAG (R2 / R5 / D-03), Evidently drift detection (R7), Grafana dashboards + Prometheus alerting (OPS-04, OR-04), authenticated /predict/* (SRV-01), online model selection from UI, neural-network challengers. These remain in the Near-term / Mid-term / Long-term sections below.

Near-term (0–3 months, post-v1)¶

1. Automated Staging → Production Promotion Policy¶

Current state: Model promotion from Staging to Production (champion alias) is manual. A reviewer must inspect MLflow metrics and manually update the alias.

Problem: Manual gates are reliable only when followed consistently. A promotion without review degrades model quality silently.

Target: Define an explicit metric threshold policy (e.g., log_loss < X on holdout set) enforced by the register_model DVC stage or a post-training CI step. The system should block promotion if the policy is not met, and optionally notify the operator.

Scope: src/pipelines/register_model.py + MLflow client automation + CI gate.

2. Grafana Dashboards¶

Architectural reason: Observability is a stated quality attribute of this system. Prometheus metrics are already collected; the gap is visualization. Without dashboards, the observability layer is instrumented but not operationally usable.

Current state: Prometheus collects metrics across FastAPI, Celery workers, RabbitMQ, and cluster infrastructure. Grafana is deployed but dashboards are not defined.

Problem: Metrics are not actionable without a dashboard — an operator cannot assess service health at a glance.

Target: Define and provision at minimum: - Inference service dashboard (request rate, p50/p95 latency, error rate, cache hit ratio) - Celery queue dashboard (queue depth per queue, task processing rate) - Infrastructure dashboard (CPU, memory, node metrics from kube-state-metrics + node-exporter)

Scope: Grafana dashboard JSON definitions in k8s/helm/monitoring/.

3. Prometheus Alerting Rules¶

Architectural reason: The system's reliability requirement depends on detecting failures before they become extended outages. Purely reactive detection via manual inspection does not meet single-maintainer operability requirements.

Current state: Prometheus scrapes metrics but no alerting rules are configured. Failures are detected reactively (Airflow UI, K8s events, or manual log inspection).

Target: Define alerting rules for: - API error rate > threshold - Celery queue depth > threshold (stuck inference) - No scraping job completed in 24 h - Pod CrashLoopBackOff

Scope: Prometheus alerting rules in k8s/helm/monitoring/.

Mid-term (3–9 months)¶

4. Evidently Offline Drift Reports¶

Current state: Drift detection is architecturally designed but not implemented. The system logs prediction inputs but does not analyze distribution shifts.

Target: Scheduled batch job (Airflow DAG) that: 1. Loads recent prediction inputs from PostgreSQL or MinIO. 2. Runs Evidently comparison against the training data distribution. 3. Writes HTML report to MinIO. 4. Links report from docs/evidence/monitoring.md.

Scope: New Airflow DAG + src/monitoring/drift.py + MinIO artifact store + MkDocs link.

Not yet: No automated retraining trigger based on drift (see item 5).

5. Formalized Retraining Triggers¶

Architectural reason: The system's prediction quality degrades over time as match statistics evolve (team form, tactical changes, new seasons). Without a defined trigger, the model training cadence is undocumented, ad hoc, and dependent on operator judgment rather than system policy.

Current state: Retraining is manual — the operator runs dvc repro when new data is available.

Target: Define and implement at least one of: - Time-based trigger (Airflow DAG at fixed cadence: weekly/monthly). - Data-volume trigger (new N matches ingested since last training run). - Drift trigger (Evidently report exceeds threshold — depends on item 4).

Scope: Airflow DAG + trigger condition logic + CI/CD integration with dvc repro.

6. Cache Invalidation on Model Promotion¶

Current state: Redis cache is TTL-based. When a new model is promoted to champion, stale predictions from the previous model remain in cache until TTL expires.

Target: On model promotion, emit an event (or hook) that flushes the Redis prediction cache. Mechanism: post-promotion script or Celery task triggered by registry alias change.

Scope: src/app/tasks/ + model registration script.

Long-term (9+ months)¶

7. High-Availability Kubernetes (if scale justifies)¶

Current state: Single-node K8s on healserver. No HA.

Consideration: If prediction volume or data ingestion frequency grows significantly, or if the project moves toward multi-user / multi-tenant serving, a managed K8s cluster (GKE, EKS, or AKS) would provide automatic failover, node autoscaling, and managed control plane.

Decision criteria: Volume > ~1,000 requests/day, or sustained operational issues with single-node.

Note: Helm charts are already parameterized for portability. Migration requires only config changes.

8. Online Feature Store¶

Current state: Features are assembled at inference time from historical rolling statistics. This works for the current prediction horizon (future matches known in advance).

Consideration: If the prediction use case expands to include in-game or near-real-time events, an online feature store (e.g., Feast, Hopsworks, or Redis-backed feature registry) would provide low-latency feature retrieval without repeated computation.

Decision criteria: Use case requires features updated faster than batch pipeline cadence.

9. Streaming Ingestion¶

Current state: Data ingested in scheduled batches (Airflow DAG → Selenoid → PostgreSQL). This matches the current prediction use case: future matches are known in advance and predictions do not need to respond to sub-hour events.

Consideration: Only justified if the prediction use case changes to require in-game or near-real-time event data. No such requirement exists today.

Decision criteria: New prediction targets requiring sub-hour data freshness AND a data provider that supports streaming delivery. Both conditions must hold; absent them, batch ingestion is correct.

What Is Not on the Roadmap¶

Betting execution or portfolio management automation.
Support for sports other than football.
Multi-tenant user management or per-user prediction APIs.
Real-time UI beyond the existing Streamlit interface.

Implementation Status — current state of all components
Failure Modes — gaps addressed by near-term items
Trade-offs — decisions that constrain or enable roadmap items
Architecture Principles — principles that govern prioritization