Lessons Learned¶

Honest retrospective: what worked well, what didn't, and what I'd change next time.

Work in progress

This page is a placeholder. Will be written at project completion.

What Worked Well¶

DVC + MLflow together — clear separation between data versioning and experiment tracking; made pipeline reproducibility straightforward.
Temporal split enforced early — catching leakage at the test level prevented subtle bugs from propagating to production metrics.
Hydra for config — switching between dev/prod/test configs without code changes was a big productivity win.
FastAPI dependency injection — kept route handlers thin and made testing services easy.

Kubernetes networking on a single-node VPS — Getting Ingress + cert-manager + internal service resolution working took significant iteration.
Feature parity between offline and online paths — Ensuring that features computed in the DVC pipeline exactly matched what the API computes at inference time required explicit contracts and tests.
Airflow on Docker with custom dependencies — Dependency conflicts between Airflow's bundled packages and project requirements required a custom Dockerfile approach.

Set up a proper feature store (e.g., Redis + Feast) earlier to avoid the offline/online parity problem.
Use a managed database (even free tier) to avoid ops overhead on PostgreSQL in K8s.
Start with structured logging and log aggregation from day one instead of retrofitting.

When to introduce streaming ingestion (Kafka) vs. the current Airflow batch approach.
Whether a dedicated vector store adds value for match context embeddings.