Skip to content

Design Trade-offs

Key architectural decisions, the alternatives that were considered, and why each choice was made. This page is the complement to the ADR section — ADRs are formal records; this page is the readable summary with rationale.

Each decision follows the template: Context → Decision → Alternatives → Why chosen → Consequences → Current status


Decisions Driven by Constraints

These bullet points summarize how the system's hard constraints directly forced or ruled out specific architectural choices. They provide the "why" for decisions that might otherwise look like arbitrary preferences.

  • Single maintainer → avoid distributed complexity: Kafka, Spark, and managed queues introduce operational surface that a single operator cannot own reliably. Celery + RabbitMQ is understood, debuggable, and has known recovery paths.
  • Low cost → self-hosted everything: S3, W&B, Neptune, and managed Kubernetes are all rejected on budget grounds. MinIO, self-hosted MLflow, and a single-VPS K8s cluster provide equivalent capability at near-zero licensing cost.
  • Reproducibility → DVC + MLflow + dependency locking: The requirement that any historical model be reproducible from a clean checkout makes artifact-driven pipelines and content-addressed storage mandatory, not optional.
  • No HA requirement → single-node K8s acceptable: Without uptime SLAs, full infrastructure redundancy is not justified. Single-node K8s is documented as a known limitation, not a design gap.
  • External unstable data → validation layer (Great Expectations): WhoScored.com can change layout without notice. Great Expectations gates prevent silently broken data from propagating to training or serving without validation.
  • Batch retraining acceptable → no streaming architecture: Because predictions are not required to update in real-time, a calendar-driven Airflow + DVC pipeline is sufficient. No Kafka, Flink, or real-time feature store is warranted.
  • No API authentication requirement: The API is exposed to a known, operator-controlled frontend. No multi-tenant access or public API surface justifies authentication infrastructure today.

1. Pipeline orchestration: DVC + Airflow

Context: The system needs both scheduled data ingestion (recurring, calendar-driven) and reproducible ML pipelines (artifact-driven, triggered by data changes).

Decision: Use Airflow for ETL scheduling and DVC for ML pipeline orchestration.

Alternatives considered:

Option Why rejected
Airflow for everything ML stages have artifact dependencies, not time dependencies. Airflow graphs for DVC-style pipelines are awkward and lose reproducibility.
DVC for everything DVC has no scheduler; cannot drive daily scraping jobs.
Prefect / Dagster Extra infra complexity; no established integration with DVC artifact graph.

Why chosen: Each tool is used exclusively for what it was designed for. Airflow owns scheduling; DVC owns artifact-driven reproducibility.

Consequences: - ETL and ML pipelines are owned by different tools — two places to check on failures. - Benefit: clean separation; each tool's failure mode is isolated.

Current status: ✅ Both operational. See ADR-0001.

Revisit when: DVC–Airflow handoff becomes a persistent operational friction point, or when a unified orchestrator (e.g., Prefect, Dagster) clearly reduces total complexity.


2. Data versioning: DVC + MinIO

Context: Training data must be reproducible — every model must be traceable to the exact dataset used.

Decision: DVC for version control, MinIO as the remote storage backend (S3-compatible, self-hosted).

Alternatives considered:

Option Why rejected
LakeFS Requires additional Kubernetes service; storage overhead. DVC is simpler and Git-native.
Delta Lake Heavier; requires Spark-compatible infrastructure; overkill for current data volume.
Git LFS Not designed for ML datasets; poor DVC integration.
S3 (AWS) Cost and SaaS dependency; MinIO gives the same API on-premises.

Why chosen: DVC provides content-addressed Git-native versioning with zero SaaS cost. MinIO gives S3 API compatibility for a self-hosted price.

Consequences: - Every dataset version is content-addressed; dvc checkout <commit> restores any historical state. - MinIO requires operational care (storage, backups) but has zero licensing cost.

Current status: ✅ Operational. See ADR-0002.

Revisit when: Data volume requires lakehouse capabilities (e.g., ACID transactions, schema evolution), or a managed cloud environment removes the self-hosting constraint.


3. Experiment tracking and model registry: MLflow

Context: ML experiments need parameter logging, metric comparison, artifact storage, and a model registry with staging/production lifecycle.

Decision: MLflow (self-hosted, backed by local filesystem + MinIO artifacts).

Alternatives considered:

Option Why rejected
Weights & Biases SaaS dependency; cost at scale; data leaves the system.
Neptune.ai Same SaaS concerns.
Comet Same SaaS concerns.
Custom logging Reinvents the wheel; no registry lifecycle management.

Why chosen: Full control over experiment data and artifact storage. No SaaS cost or data egress.

Consequences: - Ops overhead: MLflow server must be running for tracking to work. - Promotion aliases (champion, challenger) are used consistently across code and docs. - Manual promotion gate today; automated policy is on the roadmap.

Current status: ✅ Operational. Model promotion gate is currently manual. See ADR-0003.

Revisit when: Collaboration requirements or regulatory compliance justify a hosted ML platform; or when the volume of experiments makes self-hosted MLflow operationally burdensome.


4. Sync vs async inference: FastAPI + Celery/RabbitMQ

Context: Prediction requests have two use cases — interactive (user waits) and batch-style (background, user polls for result).

Decision: FastAPI handles HTTP; Celery with RabbitMQ handles async task execution. Sync endpoint dispatches to Celery ml queue with a 30 s timeout.

Alternatives considered:

Option Why rejected
Kafka Significant ops overhead for current throughput; consumer group complexity not justified.
AWS SQS SaaS dependency; couples infrastructure to AWS.
Redis as broker Less durable than RabbitMQ for task queues; no native dead-letter support.
Inline sync-only Cannot support long-running batch workloads without blocking the HTTP worker.

Why chosen: Celery + RabbitMQ is the standard Python async task pattern; well-understood ops; supports both sync-dispatch and fire-and-forget modes.

Consequences: - Two inference paths (sync + async) require both FastAPI and Celery worker pods to be healthy. - RabbitMQ is a single broker today (no clustering) — acceptable at current scale. - Async path returns task_id for polling via status endpoint.

Current status: ✅ Both paths operational. See ADR-0005.

Revisit when: Request throughput or latency SLA cannot be met by Celery + RabbitMQ; or when use cases require a streaming inference path.


5. Deployment: Kubernetes + Helm on VPS

Context: The system needs a real deployment environment with service discovery, health probes, rolling updates, and resource limits.

Decision: Kubernetes on a self-managed VPS, deployed via Helm.

Alternatives considered:

Option Why rejected
Managed EKS / GKE Cost. For a portfolio project, $200+/month is not justified.
Docker Compose only No health probes, no rolling restarts, no resource limits.
Nomad Less industry-standard; fewer available resources and integrations.

Why chosen: Full K8s operational experience — networking, storage, ingress, secrets. Helm charts are portable to managed K8s without code changes.

Consequences: - Full K8s operational discipline: networking, storage, ingress, secrets, rolling updates. - Single-node cluster means no HA. Documented as a known limitation. - Helm charts are parameterized — trivial to move to managed K8s.

Current status: ✅ Deployed. Single-node VPS.

Revisit when: Operational load from single-node management exceeds benefit; or when prediction volume or multi-user requirements justify true HA.


6. Secrets management: SOPS + age

Context: Credentials for PostgreSQL, MinIO, RabbitMQ, and API keys must be version-controlled safely and injected at runtime without ever appearing in plaintext in the repo.

Decision: SOPS for encryption with age as the key provider. Encrypted files committed to Git. Decryption key stored in GitLab CI protected variables.

Alternatives considered:

Option Why rejected
HashiCorp Vault Heavy infrastructure; overkill for one-maintainer project.
AWS Secrets Manager SaaS dependency; cost; couples to AWS.
.env files not committed No version control; no auditability; easy to lose.
GPG-only More complex key management; age is simpler and safer by default.

Why chosen: Secrets are version-controlled (encrypted), auditable, and never require a separate secrets service. age is simpler than GPG for single-key use cases.

Consequences: - Secrets are auditable in Git history (encrypted form). - Key rotation requires re-encrypting all secret files. - Zero secrets ever appear in logs, build artifacts, or container images.

Current status: ✅ Operational. See ADR-0004 and Security.

Revisit when: Team size or compliance requirements justify a dedicated secrets management service (e.g., Vault); or when key rotation frequency makes SOPS re-encryption impractical.


7. Prediction and feature cache: Redis

Context: The sync inference path (POST /predict) involves Celery task dispatch and feature assembly. For repeated queries about the same match (e.g., polling from the UI), full re-inference is unnecessary.

Decision: Redis as an in-cluster cache for prediction results and assembled feature vectors. Cache key is a hash of the input feature vector. TTL-based expiry.

Alternatives considered:

Option Why rejected
In-process dict (per worker) Not shared across worker processes; cache misses on every new task dispatch
PostgreSQL as cache Wrong tool; read/write overhead; not designed for TTL-based ephemeral data
Memcached Similar capability; Redis chosen for richer data structures and existing familiarity in the stack
No cache Adds unnecessary latency for repeated queries; wastes compute on redundant inference

Why chosen: Redis provides a shared, fast, TTL-managed cache visible to all Celery workers and FastAPI instances. Low operational overhead in K8s.

Consequences: - Cache invalidation on model promotion is not yet implemented (stale results served until TTL). - Single Redis instance today — no HA; Redis unavailability degrades performance but does not break inference. - TTL strategy must be tuned: too short = poor cache hit rate; too long = stale predictions post-match.

Current status: ✅ Implemented. Cache invalidation on promotion is 📋 Planned.

Decision: no cache invalidation on model promotion (current state): When a new champion model is promoted, existing Redis cache entries are not flushed. This means predictions from the previous model are served until TTL expires. This is an explicit and accepted trade-off: - Implementing promotion-triggered invalidation requires a coordination hook between the offline pipeline (MLflow promotion event) and the runtime cache (Redis flush) — a coupling across the offline/online boundary. - Predictions are probabilistic and advisory; a bounded stale window is not a correctness failure. - The TTL bound limits exposure; the trade-off is revisited once the consistency window becomes operationally unacceptable.

Revisit when: TTL-based invalidation creates unacceptable consistency windows post-promotion; or when data access patterns change significantly (e.g., fewer repeated queries).


Summary Table

Decision Choice Key reason Status
ML pipeline orchestration DVC Git-native, artifact-driven, reproducible
ETL scheduling Airflow Calendar-driven, proper DAG retries
Data versioning DVC + MinIO Content-addressed, zero SaaS dependency
Experiment tracking MLflow (self-hosted) Full control, no SaaS cost
Async inference Celery + RabbitMQ Simpler ops than Kafka at current scale
Deployment Kubernetes + Helm Real production parity, portable charts
Secrets SOPS + age Encrypted-in-Git, no external secrets service
Prediction cache Redis Shared fast cache, TTL-based, low ops overhead