Model Registry & Promotion¶
Status: ✅ Implemented — Registration and candidate promotion automated via DVC pipeline; candidate → champion gate is manual.
Purpose¶
Document how models move from training into serving, what the lifecycle stages mean, how promotion is gated, and how rollback works.
Role of the registry¶
The MLflow Model Registry is the single handoff point between training and serving.
The serving layer (PredictionService) loads the model from the registry by name and alias.
No model reaches production without passing through this boundary.
Alias scheme (4 levels)¶
| Alias | Meaning | Gate | Who sets it | CI job |
|---|---|---|---|---|
ci-smoke |
Toy model (frac=0.001, n_trials=2) — pipeline wiring check only; never used by serving |
None — always assigned | register_model DVC stage |
train:smoke |
smoke |
Real-data model; full feature set, reduced trials — lifecycle entry point | None — always assigned | register_model DVC stage |
train:test |
candidate |
Passed quality gate; ready for manual review | final.logloss ≤ current_candidate + 0.002 |
promote_model DVC stage |
train:test |
champion |
Currently serving live predictions | Manual sign-off (see Promotion Policy) | Developer / scheduled DAG | — |
ci-smoke is set by the experiment=smoke Hydra overlay (conf/experiment/smoke.yaml).
All other aliases use the base config (conf/config.yaml).
A single model version can carry multiple aliases simultaneously (e.g. a new version
becomes smoke immediately, then candidate after the gate passes, then champion
after manual review).
Stage 1: Registration (register_model)¶
The register_model DVC stage is the final automated step after final_train.
It reads data/models/final_run_id.json, creates or updates the registered model,
and assigns the initial alias to the new version:
ci-smoke— when run fromtrain:smokeCI job (experiment=smoke, toy data)smoke— when run fromtrain:testCI job (real data, lifecycle entry point)
This operation is idempotent: re-running with the same run ID is safe.
The model name and initial alias are controlled via params.yaml
(register_model.model_name, register_model.model_stage).
Stage 2: Candidate promotion (promote_model)¶
The promote_model DVC stage runs after register_model.
It fetches final.logloss for the new version and compares it to the current
candidate alias:
new_logloss ≤ current_candidate_logloss + tolerance → sets 'candidate' alias
new_logloss > current_candidate_logloss + tolerance → logs warning, no change
If no current candidate exists (fresh registry), promotion always proceeds.
A gate failure does not fail the DVC pipeline — it is an expected outcome.
Parameters (in params.yaml under promote_model):
| Parameter | Default | Meaning |
|---|---|---|
metric |
final.logloss |
MLflow metric key to compare |
tolerance |
0.005 |
Max allowed degradation vs current candidate |
candidate_alias |
candidate |
Alias name to assign on pass |
Result is written to data/models/promoted_model.json.
Stage 3: Champion promotion (manual)¶
Promoting from candidate to champion requires manual approval. This is a deliberate
quality gate — not a missing feature. See Promotion Policy for
full checklist and rationale.
Promotion is performed via MLflow CLI:
# Find the candidate version number
mlflow models list-versions -n soccer-match-outcome
# Promote
python - <<'EOF'
import mlflow
client = mlflow.MlflowClient()
candidate = client.get_model_version_by_alias("soccer-match-outcome", "candidate")
client.set_registered_model_alias("soccer-match-outcome", "champion", candidate.version)
print(f"champion → version {candidate.version}")
EOF
Serving coupling¶
PredictionService in src/app/services/predict.py loads the model as:
stage is read from settings.mlflow.model_stage (inference.model_stage in params.yaml).
Currently set to smoke (CI / dev environment).
Change to candidate for staging, champion for production.
Changing the alias in the registry takes effect on next worker restart without redeployment. The model is lazy-loaded once per worker process and cached in memory.
Rollback¶
Rollback is a registry operation only — no retraining required:
- Re-assign the
championalias to any previous version in the MLflow UI or CLI. - Restart workers (or wait for cache invalidation).
This is safe because all prior model versions remain stored as MLflow artifacts in MinIO.
Implementation status¶
| Aspect | Status |
|---|---|
Automated registration (smoke alias) via register_model DVC stage |
✅ Implemented |
Automated candidate gate (candidate alias) via promote_model DVC stage |
✅ Implemented |
Manual candidate → champion gate |
📋 Manual approval required |
| Rollback via registry alias reassignment | ✅ Supported |
Related¶
- Promotion Policy — hard gates and checklist for
champion - MLflow — experiment tracking
- Model Contract — breaking change policy
- Training Pipeline — pipeline stage sequence
- Serving — how the serving layer loads models
- Status