Experiment Tracking (MLflow)¶
Purpose¶
Document what MLflow is responsible for in this system, how runs are structured, what each training stage logs, and how traceability back to data and code is maintained.
MLflow is the observability and traceability layer for all ML training stages. It is not used for serving, feature storage, or monitoring.
Responsibilities¶
| MLflow responsibility | Covered |
|---|---|
| Experiment and run tracking | ✅ |
| Parameter and metric logging | ✅ |
| Model artifact storage | ✅ |
| Model registry (versioning + lifecycle) | ✅ — see Model Registry |
| Lineage to Git commit and DVC dataset version | ✅ |
| Serving / inference | ✗ — handled by FastAPI + Celery |
| Feature storage | ✗ — features are DVC-tracked Parquet files |
| Drift detection / monitoring | ✗ — planned, not yet integrated |
Experiments¶
Two experiments are used across all pipeline stages:
| Experiment name | Purpose |
|---|---|
matches_clf |
All production runs: train_eval, ablation, tuning, final_train |
matches_clf_smoke |
All smoke / fast-dev runs |
Ablation variants (elo_only, stats_only, full_no_h2h, full) are not separate
experiments. They run inside the same experiment and are identified by run-level tags
(pipeline.variant, features.profile). This keeps the experiment list short and
makes cross-variant comparison straightforward from a single experiment view.
The experiment name is controlled by params.yaml → classification.experiment_name
(shared with ablation.experiment_name).
Run structure¶
Run hierarchy¶
All pipeline stages use a two-level hierarchy where applicable:
- Parent run (
pipeline.scope=parent): groups a set of related child runs; holds summary params and pipeline context tags. - Child run (
pipeline.scope=child): one run per model or Optuna trial; holds metrics, model artifact, and full lineage tags.
Standalone runs (final_train) use pipeline.scope=parent with no children.
Standard run tags¶
Every run (parent and child) carries:
| Tag | Values | Source |
|---|---|---|
pipeline.run_kind |
smoke | train_eval | ablation | tuning | final_train |
build_pipeline_context_tags |
pipeline.stage |
train_eval | ablation | tuning | optuna_trial | final_train |
build_run_scope_tags |
pipeline.scope |
parent | child |
build_run_scope_tags |
pipeline.variant |
baseline | elo_only | stats_only | full_no_h2h | full |
build_run_scope_tags |
features.profile |
full | elo_only | stats_only | no_h2h |
build_run_scope_tags |
model.family |
gradient_boosting | linear | dummy | ... |
build_run_scope_tags (child only) |
pipeline.git_sha |
git short SHA | build_pipeline_context_tags |
pipeline.dvc_exp_name |
DVC experiment name | build_pipeline_context_tags |
pipeline.params_hash |
16-char SHA-256 of params.yaml | build_pipeline_context_tags |
Data lineage tags (child runs):
| Tag | Contents |
|---|---|
data.version / data.hash |
DVC MD5 of the dataset parquet |
data.source_bucket |
MinIO bucket name |
data.source_key |
MinIO object key |
data.source_etag |
MinIO ETag (quotes stripped) |
data.source_last_modified |
Last-PUT timestamp |
data.ingested_at |
Ingestion timestamp from sidecar |
data.train_start/end |
Temporal bounds of training split |
data.test_start/end |
Temporal bounds of holdout split |
data.train_rows / data.test_rows |
Row counts |
Run name format¶
| Stage | Scope | Example run name |
|---|---|---|
| train_eval | parent | smoke \| train_eval \| frac=0.001 \| feat=full |
| train_eval | parent | train_eval \| frac=1.0 \| feat=full |
| ablation | parent | smoke \| ablation \| variant=elo_only \| frac=0.001 |
| ablation | parent | ablation \| variant=elo_only \| frac=1.0 |
| tuning | parent | smoke \| tuning \| frac=0.1 |
| tuning | child | trial \| 000 |
| final_train | parent | smoke \| final_train \| model=xgb |
| train_eval / ablation | child | model \| xgb |
classification_models stage¶
One parent run per frac/variant combination, with nested child runs per model.
Selects the best model by holdout log-loss and writes the winning run ID to
data/models/run_id.json.
ablation_study stage¶
Runs inside the same matches_clf experiment as classification_models.
Each feature subset gets its own parent run tagged with pipeline.variant
and features.profile for filtering.
tune_xgb stage¶
One parent run ({run_kind} | tuning | frac={frac}), with one nested child run per
Optuna trial.
Each trial child run logs:
- XGBoost parameters (xgb.*)
- Mean CV log-loss across walk-forward folds (cv.logloss_mean)
- Trial number
Parent run logs the best parameters (best.*) and best CV log-loss at study completion.
final_train stage¶
The definitive training run. This is the only run that evaluates on the held-out test set.
Tags logged:
| Tag | Value |
|---|---|
pipeline.stage |
final_train |
pipeline.scope |
parent |
pipeline.variant |
baseline |
model.family |
Algorithm family (e.g., gradient_boosting) |
Parameters logged:
| Param group | Contents |
|---|---|
model.* |
Model name |
data.* |
Training rows, holdout rows, target column, fraction |
features.* |
Number of numeric and categorical features |
best.* |
Best hyperparameters from tuning stage |
Metrics logged:
- Holdout log-loss, accuracy, Brier score
- ECE (calibration error) — before and after calibration when calibration is enabled
- Precision, recall, F1 per class
Artifacts logged:
model/— the serialisedsklearnpipeline (or calibrated wrapper) as MLflow pyfunc- Calibration curves
- Confusion matrix (multi-class)
- Feature importance plot
Dataset input logged:
- MLflow dataset object pointing to the training parquet with source path
Lineage chain¶
Each final_train run is traceable through:
MLflow run
├── tag: data.version = DVC hash of data/processed/dataset.parquet
├── tag: pipeline.git_sha = git short SHA
├── tag: pipeline.params_hash = SHA-256 of params.yaml
├── MLflow dataset input: source = filesystem path to dataset.parquet
└── params: best.* = tuned hyperparameters from xgb_best_params.json
Given an MLflow run ID, you can recover:
- The exact dataset version (via data.version tag → dvc checkout).
- The exact code version (via pipeline.git_sha tag).
- The exact configuration (via logged params + params.yaml in git).
Filtering runs in the UI¶
To find all ablation runs: filter by tags.pipeline.stage = 'ablation'
To find elo-only runs: filter by tags.pipeline.variant = 'elo_only'
To find all smoke runs: filter by tags.pipeline.run_kind = 'smoke'
To find the winning final model runs: filter by tags.pipeline.stage = 'final_train'
What MLflow is not responsible for¶
- Serving: models are loaded from the registry by
PredictionService; MLflow tracking is not in the hot path. - Feature computation: features are DVC-tracked Parquet files, not MLflow datasets.
- Monitoring / drift detection: Evidently is the planned monitoring tool; not yet integrated.
- A/B testing: not implemented. Promotion is manual via the registry.
Running the MLflow UI¶
Navigate to the matches_clf experiment to see all training and tuning runs.