Data Contracts & Quality Gates (Great Expectations)¶
Why data contracts¶
Given unstable upstream sources, data quality cannot be assumed.
Data contracts define what “valid data” means for downstream ML pipelines.
Great Expectations usage¶
Great Expectations is used to:
- validate schema consistency,
- enforce value ranges and nullability,
- detect distribution anomalies.
Expectations are defined for:
- raw datasets,
- processed datasets,
- feature tables.
Blocking vs non-blocking checks¶
-
Blocking checks:
-
schema mismatch,
- missing critical columns,
-
invalid identifiers.
-
Non-blocking checks:
-
distribution drift,
- outlier rate increase.
Blocking failures stop the ML pipeline.
Execution point¶
Data contracts are executed:
- as part of DVC pipelines,
- before feature engineering and training.
Contract as code¶
Expectations are versioned alongside code and datasets. Any change to contracts is reviewed and traceable.