Experiment Tracking (MLflow)¶

Purpose¶

Document what MLflow is responsible for in this system, how runs are structured, what each training stage logs, and how traceability back to data and code is maintained.

MLflow is the observability and traceability layer for all ML training stages. It is not used for serving, feature storage, or monitoring.

Responsibilities¶

MLflow responsibility	Covered
Experiment and run tracking	✅
Parameter and metric logging	✅
Model artifact storage	✅
Model registry (versioning + lifecycle)	✅ — see Model Registry
Lineage to Git commit and DVC dataset version	✅
Serving / inference	✗ — handled by FastAPI + Celery
Feature storage	✗ — features are DVC-tracked Parquet files
Drift detection / monitoring	✗ — planned, not yet integrated

Experiments¶

Two experiments are used across all pipeline stages:

Experiment name	Purpose
`matches_clf`	All production runs: train_eval, ablation, tuning, final_train
`matches_clf_smoke`	All smoke / fast-dev runs

Ablation variants (elo_only, stats_only, full_no_h2h, full) are not separate experiments. They run inside the same experiment and are identified by run-level tags (pipeline.variant, features.profile). This keeps the experiment list short and makes cross-variant comparison straightforward from a single experiment view.

The experiment name is controlled by params.yaml → classification.experiment_name (shared with ablation.experiment_name).

Run structure¶

Run hierarchy¶

All pipeline stages use a two-level hierarchy where applicable:

Parent run (pipeline.scope=parent): groups a set of related child runs; holds summary params and pipeline context tags.
Child run (pipeline.scope=child): one run per model or Optuna trial; holds metrics, model artifact, and full lineage tags.

Standalone runs (final_train) use pipeline.scope=parent with no children.

Standard run tags¶

Every run (parent and child) carries:

Tag	Values	Source
`pipeline.run_kind`	`smoke` \| `train_eval` \| `ablation` \| `tuning` \| `final_train`	`build_pipeline_context_tags`
`pipeline.stage`	`train_eval` \| `ablation` \| `tuning` \| `optuna_trial` \| `final_train`	`build_run_scope_tags`
`pipeline.scope`	`parent` \| `child`	`build_run_scope_tags`
`pipeline.variant`	`baseline` \| `elo_only` \| `stats_only` \| `full_no_h2h` \| `full`	`build_run_scope_tags`
`features.profile`	`full` \| `elo_only` \| `stats_only` \| `no_h2h`	`build_run_scope_tags`
`model.family`	`gradient_boosting` \| `linear` \| `dummy` \| ...	`build_run_scope_tags` (child only)
`pipeline.git_sha`	git short SHA	`build_pipeline_context_tags`
`pipeline.dvc_exp_name`	DVC experiment name	`build_pipeline_context_tags`
`pipeline.params_hash`	16-char SHA-256 of params.yaml	`build_pipeline_context_tags`

Data lineage tags (child runs):

Tag	Contents
`data.version` / `data.hash`	DVC MD5 of the dataset parquet
`data.source_bucket`	MinIO bucket name
`data.source_key`	MinIO object key
`data.source_etag`	MinIO ETag (quotes stripped)
`data.source_last_modified`	Last-PUT timestamp
`data.ingested_at`	Ingestion timestamp from sidecar
`data.train_start/end`	Temporal bounds of training split
`data.test_start/end`	Temporal bounds of holdout split
`data.train_rows` / `data.test_rows`	Row counts

Run name format¶

Stage	Scope	Example run name
train_eval	parent	`smoke \\| train_eval \\| frac=0.001 \\| feat=full`
train_eval	parent	`train_eval \\| frac=1.0 \\| feat=full`
ablation	parent	`smoke \\| ablation \\| variant=elo_only \\| frac=0.001`
ablation	parent	`ablation \\| variant=elo_only \\| frac=1.0`
tuning	parent	`smoke \\| tuning \\| frac=0.1`
tuning	child	`trial \\| 000`
final_train	parent	`smoke \\| final_train \\| model=xgb`
train_eval / ablation	child	`model \\| xgb`

`classification_models` stage¶

One parent run per frac/variant combination, with nested child runs per model. Selects the best model by holdout log-loss and writes the winning run ID to data/models/run_id.json.

`ablation_study` stage¶

Runs inside the same matches_clf experiment as classification_models. Each feature subset gets its own parent run tagged with pipeline.variant and features.profile for filtering.

`tune_xgb` stage¶

One parent run ({run_kind} | tuning | frac={frac}), with one nested child run per Optuna trial. Each trial child run logs: - XGBoost parameters (xgb.*) - Mean CV log-loss across walk-forward folds (cv.logloss_mean) - Trial number

Parent run logs the best parameters (best.*) and best CV log-loss at study completion.

`final_train` stage¶

The definitive training run. This is the only run that evaluates on the held-out test set.

Tags logged:

Tag	Value
`pipeline.stage`	`final_train`
`pipeline.scope`	`parent`
`pipeline.variant`	`baseline`
`model.family`	Algorithm family (e.g., `gradient_boosting`)

Parameters logged:

Param group	Contents
`model.*`	Model name
`data.*`	Training rows, holdout rows, target column, fraction
`features.*`	Number of numeric and categorical features
`best.*`	Best hyperparameters from tuning stage

Metrics logged:

Holdout log-loss, accuracy, Brier score
ECE (calibration error) — before and after calibration when calibration is enabled
Precision, recall, F1 per class

Artifacts logged:

model/ — the serialised sklearn pipeline (or calibrated wrapper) as MLflow pyfunc
Calibration curves
Confusion matrix (multi-class)
Feature importance plot

Dataset input logged:

MLflow dataset object pointing to the training parquet with source path

Lineage chain¶

Each final_train run is traceable through:

MLflow run
  ├── tag: data.version = DVC hash of data/processed/dataset.parquet
  ├── tag: pipeline.git_sha = git short SHA
  ├── tag: pipeline.params_hash = SHA-256 of params.yaml
  ├── MLflow dataset input: source = filesystem path to dataset.parquet
  └── params: best.* = tuned hyperparameters from xgb_best_params.json

Given an MLflow run ID, you can recover: - The exact dataset version (via data.version tag → dvc checkout). - The exact code version (via pipeline.git_sha tag). - The exact configuration (via logged params + params.yaml in git).

Filtering runs in the UI¶

To find all ablation runs: filter by tags.pipeline.stage = 'ablation'

To find elo-only runs: filter by tags.pipeline.variant = 'elo_only'

To find all smoke runs: filter by tags.pipeline.run_kind = 'smoke'

To find the winning final model runs: filter by tags.pipeline.stage = 'final_train'

What MLflow is not responsible for¶

Serving: models are loaded from the registry by PredictionService; MLflow tracking is not in the hot path.
Feature computation: features are DVC-tracked Parquet files, not MLflow datasets.
Monitoring / drift detection: Evidently is the planned monitoring tool; not yet integrated.
A/B testing: not implemented. Promotion is manual via the registry.

Running the MLflow UI¶

mlflow ui --port 5001

Navigate to the matches_clf experiment to see all training and tuning runs.