Design Trade-offs¶

Key architectural decisions, the alternatives that were considered, and why each choice was made. This page is the complement to the ADR section — ADRs are formal records; this page is the readable summary with rationale.

Each decision follows the template: Context → Decision → Alternatives → Why chosen → Consequences → Current status

Decisions Driven by Constraints¶

These bullet points summarize how the system's hard constraints directly forced or ruled out specific architectural choices. They provide the "why" for decisions that might otherwise look like arbitrary preferences.

Single maintainer → avoid distributed complexity: Kafka, Spark, and managed queues introduce operational surface that a single operator cannot own reliably. Celery + RabbitMQ is understood, debuggable, and has known recovery paths.
Low cost → self-hosted everything: S3, W&B, Neptune, and managed Kubernetes are all rejected on budget grounds. MinIO, self-hosted MLflow, and a single-VPS K8s cluster provide equivalent capability at near-zero licensing cost.
Reproducibility → DVC + MLflow + dependency locking: The requirement that any historical model be reproducible from a clean checkout makes artifact-driven pipelines and content-addressed storage mandatory, not optional.
No HA requirement → single-node K8s acceptable: Without uptime SLAs, full infrastructure redundancy is not justified. Single-node K8s is documented as a known limitation, not a design gap.
External unstable data → validation layer (Great Expectations): WhoScored.com can change layout without notice. Great Expectations gates prevent silently broken data from propagating to training or serving without validation.
Batch retraining acceptable → no streaming architecture: Because predictions are not required to update in real-time, a calendar-driven Airflow + DVC pipeline is sufficient. No Kafka, Flink, or real-time feature store is warranted.
No API authentication requirement: The API is exposed to a known, operator-controlled frontend. No multi-tenant access or public API surface justifies authentication infrastructure today.

1. Pipeline orchestration: DVC + Airflow¶

Context: The system needs both scheduled data ingestion (recurring, calendar-driven) and reproducible ML pipelines (artifact-driven, triggered by data changes).

Decision: Use Airflow for ETL scheduling and DVC for ML pipeline orchestration.

Alternatives considered:

Option	Why rejected
Airflow for everything	ML stages have artifact dependencies, not time dependencies. Airflow graphs for DVC-style pipelines are awkward and lose reproducibility.
DVC for everything	DVC has no scheduler; cannot drive daily scraping jobs.
Prefect / Dagster	Extra infra complexity; no established integration with DVC artifact graph.

Why chosen: Each tool is used exclusively for what it was designed for. Airflow owns scheduling; DVC owns artifact-driven reproducibility.

Consequences: - ETL and ML pipelines are owned by different tools — two places to check on failures. - Benefit: clean separation; each tool's failure mode is isolated.

Current status: ✅ Both operational. See ADR-0001.

Revisit when: DVC–Airflow handoff becomes a persistent operational friction point, or when a unified orchestrator (e.g., Prefect, Dagster) clearly reduces total complexity.

2. Data versioning: DVC + MinIO¶

Context: Training data must be reproducible — every model must be traceable to the exact dataset used.

Decision: DVC for version control, MinIO as the remote storage backend (S3-compatible, self-hosted).

Alternatives considered:

Option	Why rejected
LakeFS	Requires additional Kubernetes service; storage overhead. DVC is simpler and Git-native.
Delta Lake	Heavier; requires Spark-compatible infrastructure; overkill for current data volume.
Git LFS	Not designed for ML datasets; poor DVC integration.
S3 (AWS)	Cost and SaaS dependency; MinIO gives the same API on-premises.

Why chosen: DVC provides content-addressed Git-native versioning with zero SaaS cost. MinIO gives S3 API compatibility for a self-hosted price.

Consequences: - Every dataset version is content-addressed; dvc checkout <commit> restores any historical state. - MinIO requires operational care (storage, backups) but has zero licensing cost.

Current status: ✅ Operational. See ADR-0002.

Revisit when: Data volume requires lakehouse capabilities (e.g., ACID transactions, schema evolution), or a managed cloud environment removes the self-hosting constraint.

3. Experiment tracking and model registry: MLflow¶

Context: ML experiments need parameter logging, metric comparison, artifact storage, and a model registry with staging/production lifecycle.

Decision: MLflow (self-hosted, backed by local filesystem + MinIO artifacts).

Alternatives considered:

Option	Why rejected
Weights & Biases	SaaS dependency; cost at scale; data leaves the system.
Neptune.ai	Same SaaS concerns.
Comet	Same SaaS concerns.
Custom logging	Reinvents the wheel; no registry lifecycle management.

Why chosen: Full control over experiment data and artifact storage. No SaaS cost or data egress.

Consequences: - Ops overhead: MLflow server must be running for tracking to work. - Promotion aliases (champion, challenger) are used consistently across code and docs. - Manual promotion gate today; automated policy is on the roadmap.

Current status: ✅ Operational. Model promotion gate is currently manual. See ADR-0003.

Revisit when: Collaboration requirements or regulatory compliance justify a hosted ML platform; or when the volume of experiments makes self-hosted MLflow operationally burdensome.

4. Sync vs async inference: FastAPI + Celery/RabbitMQ¶

Context: Prediction requests have two use cases — interactive (user waits) and batch-style (background, user polls for result).

Decision: FastAPI handles HTTP; Celery with RabbitMQ handles async task execution. Sync endpoint dispatches to Celery ml queue with a 30 s timeout.

Alternatives considered:

Option	Why rejected
Kafka	Significant ops overhead for current throughput; consumer group complexity not justified.
AWS SQS	SaaS dependency; couples infrastructure to AWS.
Redis as broker	Less durable than RabbitMQ for task queues; no native dead-letter support.
Inline sync-only	Cannot support long-running batch workloads without blocking the HTTP worker.

Why chosen: Celery + RabbitMQ is the standard Python async task pattern; well-understood ops; supports both sync-dispatch and fire-and-forget modes.

Consequences: - Two inference paths (sync + async) require both FastAPI and Celery worker pods to be healthy. - RabbitMQ is a single broker today (no clustering) — acceptable at current scale. - Async path returns task_id for polling via status endpoint.

Current status: ✅ Both paths operational. See ADR-0005.

Revisit when: Request throughput or latency SLA cannot be met by Celery + RabbitMQ; or when use cases require a streaming inference path.

5. Deployment: Kubernetes + Helm on VPS¶

Context: The system needs a real deployment environment with service discovery, health probes, rolling updates, and resource limits.

Decision: Kubernetes on a self-managed VPS, deployed via Helm.

Alternatives considered:

Option	Why rejected
Managed EKS / GKE	Cost. For a portfolio project, $200+/month is not justified.
Docker Compose only	No health probes, no rolling restarts, no resource limits.
Nomad	Less industry-standard; fewer available resources and integrations.

Why chosen: Full K8s operational experience — networking, storage, ingress, secrets. Helm charts are portable to managed K8s without code changes.

Consequences: - Full K8s operational discipline: networking, storage, ingress, secrets, rolling updates. - Single-node cluster means no HA. Documented as a known limitation. - Helm charts are parameterized — trivial to move to managed K8s.

Current status: ✅ Deployed. Single-node VPS.

Revisit when: Operational load from single-node management exceeds benefit; or when prediction volume or multi-user requirements justify true HA.

6. Secrets management: SOPS + age¶

Context: Credentials for PostgreSQL, MinIO, RabbitMQ, and API keys must be version-controlled safely and injected at runtime without ever appearing in plaintext in the repo.

Decision: SOPS for encryption with age as the key provider. Encrypted files committed to Git. Decryption key stored in GitLab CI protected variables.

Alternatives considered:

Option	Why rejected
HashiCorp Vault	Heavy infrastructure; overkill for one-maintainer project.
AWS Secrets Manager	SaaS dependency; cost; couples to AWS.
`.env` files not committed	No version control; no auditability; easy to lose.
GPG-only	More complex key management; age is simpler and safer by default.

Why chosen: Secrets are version-controlled (encrypted), auditable, and never require a separate secrets service. age is simpler than GPG for single-key use cases.

Consequences: - Secrets are auditable in Git history (encrypted form). - Key rotation requires re-encrypting all secret files. - Zero secrets ever appear in logs, build artifacts, or container images.

Current status: ✅ Operational. See ADR-0004 and Security.

Revisit when: Team size or compliance requirements justify a dedicated secrets management service (e.g., Vault); or when key rotation frequency makes SOPS re-encryption impractical.

7. Prediction and feature cache: Redis¶

Context: The sync inference path (POST /predict) involves Celery task dispatch and feature assembly. For repeated queries about the same match (e.g., polling from the UI), full re-inference is unnecessary.

Decision: Redis as an in-cluster cache for prediction results and assembled feature vectors. Cache key is a hash of the input feature vector. TTL-based expiry.

Alternatives considered:

Option	Why rejected
In-process dict (per worker)	Not shared across worker processes; cache misses on every new task dispatch
PostgreSQL as cache	Wrong tool; read/write overhead; not designed for TTL-based ephemeral data
Memcached	Similar capability; Redis chosen for richer data structures and existing familiarity in the stack
No cache	Adds unnecessary latency for repeated queries; wastes compute on redundant inference

Why chosen: Redis provides a shared, fast, TTL-managed cache visible to all Celery workers and FastAPI instances. Low operational overhead in K8s.

Consequences: - Cache invalidation on model promotion is not yet implemented (stale results served until TTL). - Single Redis instance today — no HA; Redis unavailability degrades performance but does not break inference. - TTL strategy must be tuned: too short = poor cache hit rate; too long = stale predictions post-match.

Current status: ✅ Implemented. Cache invalidation on promotion is 📋 Planned.

Decision: no cache invalidation on model promotion (current state): When a new champion model is promoted, existing Redis cache entries are not flushed. This means predictions from the previous model are served until TTL expires. This is an explicit and accepted trade-off: - Implementing promotion-triggered invalidation requires a coordination hook between the offline pipeline (MLflow promotion event) and the runtime cache (Redis flush) — a coupling across the offline/online boundary. - Predictions are probabilistic and advisory; a bounded stale window is not a correctness failure. - The TTL bound limits exposure; the trade-off is revisited once the consistency window becomes operationally unacceptable.

Revisit when: TTL-based invalidation creates unacceptable consistency windows post-promotion; or when data access patterns change significantly (e.g., fewer repeated queries).

Summary Table¶

Decision	Choice	Key reason	Status
ML pipeline orchestration	DVC	Git-native, artifact-driven, reproducible	✅
ETL scheduling	Airflow	Calendar-driven, proper DAG retries	✅
Data versioning	DVC + MinIO	Content-addressed, zero SaaS dependency	✅
Experiment tracking	MLflow (self-hosted)	Full control, no SaaS cost	✅
Async inference	Celery + RabbitMQ	Simpler ops than Kafka at current scale	✅
Deployment	Kubernetes + Helm	Real production parity, portable charts	✅
Secrets	SOPS + age	Encrypted-in-Git, no external secrets service	✅
Prediction cache	Redis	Shared fast cache, TTL-based, low ops overhead	✅

ADR Index — formal decision records
Architecture Principles — principles that drove these choices
Roadmap — planned improvements to current decisions
Failure Modes — consequences of these choices under failure

Design Trade-offs¶

Decisions Driven by Constraints¶

1. Pipeline orchestration: DVC + Airflow¶

2. Data versioning: DVC + MinIO¶

3. Experiment tracking and model registry: MLflow¶

4. Sync vs async inference: FastAPI + Celery/RabbitMQ¶

5. Deployment: Kubernetes + Helm on VPS¶

6. Secrets management: SOPS + age¶

7. Prediction and feature cache: Redis¶

Summary Table¶

Related¶