Lessons Learned¶
Honest retrospective: what worked well, what didn't, and what I'd change next time.
Work in progress
This page is a placeholder. Will be written at project completion.
What Worked Well¶
- DVC + MLflow together — clear separation between data versioning and experiment tracking; made pipeline reproducibility straightforward.
- Temporal split enforced early — catching leakage at the test level prevented subtle bugs from propagating to production metrics.
- Hydra for config — switching between dev/prod/test configs without code changes was a big productivity win.
- FastAPI dependency injection — kept route handlers thin and made testing services easy.
What Was Harder Than Expected¶
- Kubernetes networking on a single-node VPS — Getting Ingress + cert-manager + internal service resolution working took significant iteration.
- Feature parity between offline and online paths — Ensuring that features computed in the DVC pipeline exactly matched what the API computes at inference time required explicit contracts and tests.
- Airflow on Docker with custom dependencies — Dependency conflicts between Airflow's bundled packages and project requirements required a custom Dockerfile approach.
What I'd Do Differently¶
- Set up a proper feature store (e.g., Redis + Feast) earlier to avoid the offline/online parity problem.
- Use a managed database (even free tier) to avoid ops overhead on PostgreSQL in K8s.
- Start with structured logging and log aggregation from day one instead of retrofitting.
Open Questions¶
- When to introduce streaming ingestion (Kafka) vs. the current Airflow batch approach.
- Whether a dedicated vector store adds value for match context embeddings.