Skip to content

Model Registry & Promotion

Status: ✅ Implemented — Registration and candidate promotion automated via DVC pipeline; candidate → champion gate is manual.

Purpose

Document how models move from training into serving, what the lifecycle stages mean, how promotion is gated, and how rollback works.


Role of the registry

The MLflow Model Registry is the single handoff point between training and serving. The serving layer (PredictionService) loads the model from the registry by name and alias. No model reaches production without passing through this boundary.


Alias scheme (4 levels)

Alias Meaning Gate Who sets it CI job
ci-smoke Toy model (frac=0.001, n_trials=2) — pipeline wiring check only; never used by serving None — always assigned register_model DVC stage train:smoke
smoke Real-data model; full feature set, reduced trials — lifecycle entry point None — always assigned register_model DVC stage train:test
candidate Passed quality gate; ready for manual review final.logloss ≤ current_candidate + 0.002 promote_model DVC stage train:test
champion Currently serving live predictions Manual sign-off (see Promotion Policy) Developer / scheduled DAG

ci-smoke is set by the experiment=smoke Hydra overlay (conf/experiment/smoke.yaml). All other aliases use the base config (conf/config.yaml).

A single model version can carry multiple aliases simultaneously (e.g. a new version becomes smoke immediately, then candidate after the gate passes, then champion after manual review).


Stage 1: Registration (register_model)

The register_model DVC stage is the final automated step after final_train. It reads data/models/final_run_id.json, creates or updates the registered model, and assigns the initial alias to the new version:

  • ci-smoke — when run from train:smoke CI job (experiment=smoke, toy data)
  • smoke — when run from train:test CI job (real data, lifecycle entry point)

This operation is idempotent: re-running with the same run ID is safe.

The model name and initial alias are controlled via params.yaml (register_model.model_name, register_model.model_stage).


Stage 2: Candidate promotion (promote_model)

The promote_model DVC stage runs after register_model. It fetches final.logloss for the new version and compares it to the current candidate alias:

new_logloss ≤ current_candidate_logloss + tolerance  →  sets 'candidate' alias
new_logloss  > current_candidate_logloss + tolerance  →  logs warning, no change

If no current candidate exists (fresh registry), promotion always proceeds. A gate failure does not fail the DVC pipeline — it is an expected outcome.

Parameters (in params.yaml under promote_model):

Parameter Default Meaning
metric final.logloss MLflow metric key to compare
tolerance 0.005 Max allowed degradation vs current candidate
candidate_alias candidate Alias name to assign on pass

Result is written to data/models/promoted_model.json.


Stage 3: Champion promotion (manual)

Promoting from candidate to champion requires manual approval. This is a deliberate quality gate — not a missing feature. See Promotion Policy for full checklist and rationale.

Promotion is performed via MLflow CLI:

# Find the candidate version number
mlflow models list-versions -n soccer-match-outcome

# Promote
python - <<'EOF'
import mlflow
client = mlflow.MlflowClient()
candidate = client.get_model_version_by_alias("soccer-match-outcome", "candidate")
client.set_registered_model_alias("soccer-match-outcome", "champion", candidate.version)
print(f"champion → version {candidate.version}")
EOF

Serving coupling

PredictionService in src/app/services/predict.py loads the model as:

mlflow.pyfunc.load_model(f"models:/{model_name}@{stage}")

stage is read from settings.mlflow.model_stage (inference.model_stage in params.yaml). Currently set to smoke (CI / dev environment). Change to candidate for staging, champion for production.

Changing the alias in the registry takes effect on next worker restart without redeployment. The model is lazy-loaded once per worker process and cached in memory.


Rollback

Rollback is a registry operation only — no retraining required:

  1. Re-assign the champion alias to any previous version in the MLflow UI or CLI.
  2. Restart workers (or wait for cache invalidation).

This is safe because all prior model versions remain stored as MLflow artifacts in MinIO.


Implementation status

Aspect Status
Automated registration (smoke alias) via register_model DVC stage ✅ Implemented
Automated candidate gate (candidate alias) via promote_model DVC stage ✅ Implemented
Manual candidate → champion gate 📋 Manual approval required
Rollback via registry alias reassignment ✅ Supported