Skip to content

Experiment Tracking (MLflow)

Purpose

Document what MLflow is responsible for in this system, how runs are structured, what each training stage logs, and how traceability back to data and code is maintained.

MLflow is the observability and traceability layer for all ML training stages. It is not used for serving, feature storage, or monitoring.


Responsibilities

MLflow responsibility Covered
Experiment and run tracking
Parameter and metric logging
Model artifact storage
Model registry (versioning + lifecycle) ✅ — see Model Registry
Lineage to Git commit and DVC dataset version
Serving / inference ✗ — handled by FastAPI + Celery
Feature storage ✗ — features are DVC-tracked Parquet files
Drift detection / monitoring ✗ — planned, not yet integrated

Experiments

Two experiments are used across all pipeline stages:

Experiment name Purpose
matches_clf All production runs: train_eval, ablation, tuning, final_train
matches_clf_smoke All smoke / fast-dev runs

Ablation variants (elo_only, stats_only, full_no_h2h, full) are not separate experiments. They run inside the same experiment and are identified by run-level tags (pipeline.variant, features.profile). This keeps the experiment list short and makes cross-variant comparison straightforward from a single experiment view.

The experiment name is controlled by params.yaml → classification.experiment_name (shared with ablation.experiment_name).


Run structure

Run hierarchy

All pipeline stages use a two-level hierarchy where applicable:

  • Parent run (pipeline.scope=parent): groups a set of related child runs; holds summary params and pipeline context tags.
  • Child run (pipeline.scope=child): one run per model or Optuna trial; holds metrics, model artifact, and full lineage tags.

Standalone runs (final_train) use pipeline.scope=parent with no children.

Standard run tags

Every run (parent and child) carries:

Tag Values Source
pipeline.run_kind smoke | train_eval | ablation | tuning | final_train build_pipeline_context_tags
pipeline.stage train_eval | ablation | tuning | optuna_trial | final_train build_run_scope_tags
pipeline.scope parent | child build_run_scope_tags
pipeline.variant baseline | elo_only | stats_only | full_no_h2h | full build_run_scope_tags
features.profile full | elo_only | stats_only | no_h2h build_run_scope_tags
model.family gradient_boosting | linear | dummy | ... build_run_scope_tags (child only)
pipeline.git_sha git short SHA build_pipeline_context_tags
pipeline.dvc_exp_name DVC experiment name build_pipeline_context_tags
pipeline.params_hash 16-char SHA-256 of params.yaml build_pipeline_context_tags

Data lineage tags (child runs):

Tag Contents
data.version / data.hash DVC MD5 of the dataset parquet
data.source_bucket MinIO bucket name
data.source_key MinIO object key
data.source_etag MinIO ETag (quotes stripped)
data.source_last_modified Last-PUT timestamp
data.ingested_at Ingestion timestamp from sidecar
data.train_start/end Temporal bounds of training split
data.test_start/end Temporal bounds of holdout split
data.train_rows / data.test_rows Row counts

Run name format

Stage Scope Example run name
train_eval parent smoke \| train_eval \| frac=0.001 \| feat=full
train_eval parent train_eval \| frac=1.0 \| feat=full
ablation parent smoke \| ablation \| variant=elo_only \| frac=0.001
ablation parent ablation \| variant=elo_only \| frac=1.0
tuning parent smoke \| tuning \| frac=0.1
tuning child trial \| 000
final_train parent smoke \| final_train \| model=xgb
train_eval / ablation child model \| xgb

classification_models stage

One parent run per frac/variant combination, with nested child runs per model. Selects the best model by holdout log-loss and writes the winning run ID to data/models/run_id.json.

ablation_study stage

Runs inside the same matches_clf experiment as classification_models. Each feature subset gets its own parent run tagged with pipeline.variant and features.profile for filtering.

tune_xgb stage

One parent run ({run_kind} | tuning | frac={frac}), with one nested child run per Optuna trial. Each trial child run logs: - XGBoost parameters (xgb.*) - Mean CV log-loss across walk-forward folds (cv.logloss_mean) - Trial number

Parent run logs the best parameters (best.*) and best CV log-loss at study completion.

final_train stage

The definitive training run. This is the only run that evaluates on the held-out test set.

Tags logged:

Tag Value
pipeline.stage final_train
pipeline.scope parent
pipeline.variant baseline
model.family Algorithm family (e.g., gradient_boosting)

Parameters logged:

Param group Contents
model.* Model name
data.* Training rows, holdout rows, target column, fraction
features.* Number of numeric and categorical features
best.* Best hyperparameters from tuning stage

Metrics logged:

  • Holdout log-loss, accuracy, Brier score
  • ECE (calibration error) — before and after calibration when calibration is enabled
  • Precision, recall, F1 per class

Artifacts logged:

  • model/ — the serialised sklearn pipeline (or calibrated wrapper) as MLflow pyfunc
  • Calibration curves
  • Confusion matrix (multi-class)
  • Feature importance plot

Dataset input logged:

  • MLflow dataset object pointing to the training parquet with source path

Lineage chain

Each final_train run is traceable through:

MLflow run
  ├── tag: data.version = DVC hash of data/processed/dataset.parquet
  ├── tag: pipeline.git_sha = git short SHA
  ├── tag: pipeline.params_hash = SHA-256 of params.yaml
  ├── MLflow dataset input: source = filesystem path to dataset.parquet
  └── params: best.* = tuned hyperparameters from xgb_best_params.json

Given an MLflow run ID, you can recover: - The exact dataset version (via data.version tag → dvc checkout). - The exact code version (via pipeline.git_sha tag). - The exact configuration (via logged params + params.yaml in git).

Filtering runs in the UI

To find all ablation runs: filter by tags.pipeline.stage = 'ablation'

To find elo-only runs: filter by tags.pipeline.variant = 'elo_only'

To find all smoke runs: filter by tags.pipeline.run_kind = 'smoke'

To find the winning final model runs: filter by tags.pipeline.stage = 'final_train'


What MLflow is not responsible for

  • Serving: models are loaded from the registry by PredictionService; MLflow tracking is not in the hot path.
  • Feature computation: features are DVC-tracked Parquet files, not MLflow datasets.
  • Monitoring / drift detection: Evidently is the planned monitoring tool; not yet integrated.
  • A/B testing: not implemented. Promotion is manual via the registry.

Running the MLflow UI

mlflow ui --port 5001

Navigate to the matches_clf experiment to see all training and tuning runs.