Skip to content

Lessons Learned

Honest retrospective: what worked well, what didn't, and what I'd change next time.

Work in progress

This page is a placeholder. Will be written at project completion.

What Worked Well

  • DVC + MLflow together — clear separation between data versioning and experiment tracking; made pipeline reproducibility straightforward.
  • Temporal split enforced early — catching leakage at the test level prevented subtle bugs from propagating to production metrics.
  • Hydra for config — switching between dev/prod/test configs without code changes was a big productivity win.
  • FastAPI dependency injection — kept route handlers thin and made testing services easy.

What Was Harder Than Expected

  • Kubernetes networking on a single-node VPS — Getting Ingress + cert-manager + internal service resolution working took significant iteration.
  • Feature parity between offline and online paths — Ensuring that features computed in the DVC pipeline exactly matched what the API computes at inference time required explicit contracts and tests.
  • Airflow on Docker with custom dependencies — Dependency conflicts between Airflow's bundled packages and project requirements required a custom Dockerfile approach.

What I'd Do Differently

  • Set up a proper feature store (e.g., Redis + Feast) earlier to avoid the offline/online parity problem.
  • Use a managed database (even free tier) to avoid ops overhead on PostgreSQL in K8s.
  • Start with structured logging and log aggregation from day one instead of retrofitting.

Open Questions

  • When to introduce streaming ingestion (Kafka) vs. the current Airflow batch approach.
  • Whether a dedicated vector store adds value for match context embeddings.