MLflow Integration¶
Integrate DTA provenance tracking with MLflow for complete ML experiment and data lineage.
Overview¶
MLflow is an open-source platform for managing the ML lifecycle, including experimentation, reproducibility, and deployment. By combining MLflow with DTA provenance metadata, you get:
- Experiment tracking - Track ML experiments with full data provenance
- Reproducibility - Link model runs to exact data versions
- DTA compliance - Provenance metadata for regulatory requirements
- Complete lineage - Know exactly which data was used for each model
Installation¶
# Install with MLflow support
pip install 'dta-provenance[mlflow]'
# Or install manually
pip install mlflow>=2.0
Quick Start¶
1. Set up MLflow tracking¶
# Use local tracking (default)
export MLFLOW_TRACKING_URI=file:./mlruns
# Or use remote tracking server
export MLFLOW_TRACKING_URI=http://mlflow-server:5000
2. Log provenance to MLflow¶
# Create your provenance metadata
cat > provenance.json <<EOF
{
"source": {
"datasetName": "Customer Churn Dataset Q1 2024",
"providerName": "Internal CRM",
"datasetUrl": "s3://data-bucket/customer-churn-q1.csv",
"license": "Internal Use Only"
},
"provenance": {
"dataGenerationMethod": "Exported from CRM system with SQL query",
"dateDataGenerated": "2024-01-15T00:00:00Z",
"dataType": "Tabular",
"dataFormat": "CSV"
},
"use": {
"intendedUse": "ML model training for churn prediction",
"legalRightsToUse": "Internal use only",
"sensitiveData": true,
"sensitiveDataCategories": ["PII", "Behavioral"]
}
}
EOF
# Log provenance to active MLflow run
# (requires an active MLflow run)
mlflow run start
dta-provenance mlflow-log --metadata provenance.json
3. Use in Python with MLflow¶
import mlflow
from pathlib import Path
from src.provenance import ProvenanceTracker, load_provenance_file
from src.integrations.mlflow_integration import MLflowProvenanceBridge
# Load provenance metadata
metadata = load_provenance_file(Path('provenance.json'))
# Initialize bridge
bridge = MLflowProvenanceBridge(Path('.'))
tracker = ProvenanceTracker(Path('.'))
# Start MLflow run with provenance tracking
with mlflow.start_run(run_name='churn-model-v1'):
# Train model
model = train_model(data)
# Log model metrics
mlflow.log_metric('accuracy', 0.95)
mlflow.log_metric('f1_score', 0.92)
# Create Git commit with MLflow tracking
commit_hash, run_id = bridge.commit_with_mlflow_tracking(
file_paths=[Path('data/training_data.csv')],
metadata=metadata.to_dict(),
message='Training data for churn model v1',
tracker=tracker,
experiment_name='churn-prediction'
)
print(f"Git commit: {commit_hash}")
print(f"MLflow run: {run_id}")
How It Works¶
Provenance Metadata in MLflow¶
When you use mlflow-log, the DTA provenance metadata is stored in MLflow as tags:
# Tags stored in MLflow run
{
"dta.compliant": "true",
"dta.version": "1.0.0",
"dta.source.datasetName": "Customer Churn Dataset Q1 2024",
"dta.source.providerName": "Internal CRM",
"dta.provenance.dataType": "Tabular",
"dta.provenance.dataFormat": "CSV",
"dta.use.intendedUse": "ML model training for churn prediction",
"dta.use.sensitiveData": "True",
"git.commit_hash": "abc123def456...",
"git.commit_message": "Training data for churn model v1"
}
Additionally, the complete provenance metadata is stored as an artifact for full detail.
Bidirectional Linking¶
The integration creates bidirectional links between Git and MLflow:
- Git → MLflow: Git commit metadata includes
mlflow_run_id - MLflow → Git: MLflow run tags include
git.commit_hash
This allows you to:
- Start from Git: Find which MLflow runs used a specific commit
- Start from MLflow: Find which Git commit contains the data for a run
- Audit trail: Complete bidirectional traceability
Usage Examples¶
Log to Specific Run¶
# Log to specific MLflow run ID
dta-provenance mlflow-log \\
--metadata provenance.json \\
--run-id abc123-run-id
Use Custom Tracking URI¶
# Log to remote MLflow server
dta-provenance mlflow-log \\
--metadata provenance.json \\
--tracking-uri http://mlflow-server:5000
Query Runs by Dataset¶
from pathlib import Path
from src.integrations.mlflow_integration import MLflowProvenanceBridge
# Initialize bridge
bridge = MLflowProvenanceBridge(Path('.'))
# Find all runs that used a specific dataset
runs = bridge.get_mlflow_runs_by_dataset('Customer Churn Dataset Q1 2024')
for run in runs:
print(f"Run ID: {run['run_id']}")
print(f"Status: {run['status']}")
print(f"Metrics: {run['metrics']}")
Read Provenance from MLflow¶
from pathlib import Path
from src.integrations.mlflow_integration import MLflowProvenanceBridge
bridge = MLflowProvenanceBridge(Path('.'))
# Read provenance metadata from MLflow run
metadata = bridge.read_mlflow_provenance('run-id-123')
if metadata:
print(f"Dataset: {metadata['source']['datasetName']}")
print(f"Data Type: {metadata['provenance']['dataType']}")
print(f"Sensitive: {metadata['use']['sensitiveData']}")
Workflow: ML Model Development¶
Complete workflow for ML model development with MLflow + DTA:
import mlflow
from pathlib import Path
from src.provenance import ProvenanceTracker, ProvenanceMetadata
from src.integrations.mlflow_integration import MLflowProvenanceBridge
# 1. Create provenance metadata
metadata = ProvenanceMetadata(
source={
"datasetName": "Customer Churn Training Set",
"providerName": "Data Engineering Team",
"datasetUrl": "s3://data-lake/churn/train.csv"
},
provenance={
"dataGenerationMethod": "Stratified sampling from CRM database",
"dateDataGenerated": "2024-01-15T00:00:00Z",
"dataType": "Tabular",
"dataFormat": "CSV"
},
use={
"intendedUse": "Train churn prediction model",
"legalRightsToUse": "Internal use only",
"sensitiveData": True,
"sensitiveDataCategories": ["PII", "Behavioral"]
}
)
# 2. Initialize tracking
tracker = ProvenanceTracker(Path('.'))
bridge = MLflowProvenanceBridge(Path('.'))
# 3. Train model with full tracking
mlflow.set_experiment('churn-prediction')
with mlflow.start_run(run_name='random-forest-v1') as run:
# Log parameters
mlflow.log_param('model_type', 'RandomForest')
mlflow.log_param('n_estimators', 100)
# Train model
model = train_random_forest(data, n_estimators=100)
# Log metrics
accuracy = evaluate_model(model, test_data)
mlflow.log_metric('accuracy', accuracy)
# Create Git commit with MLflow tracking
commit_hash, run_id = bridge.commit_with_mlflow_tracking(
file_paths=[Path('data/train.csv')],
metadata=metadata.to_dict(),
message=f'Training data for RandomForest v1 (accuracy: {accuracy:.3f})',
tracker=tracker
)
# Log model
mlflow.sklearn.log_model(model, 'model')
print(f"Model trained and logged!")
print(f"Git commit: {commit_hash}")
print(f"MLflow run: {run_id}")
# 4. Later: retrieve full lineage
lineage = bridge.read_mlflow_provenance(run_id)
print(f"Model was trained on: {lineage['source']['datasetName']}")
Advanced Features¶
Experiment Organization¶
# Organize by project
mlflow.set_experiment('customer-analytics/churn-prediction')
# Organize by model type
mlflow.set_experiment('models/classification')
# Organize by data version
mlflow.set_experiment('data-v2/experiments')
Comparing Runs¶
from src.integrations.mlflow_integration import MLflowProvenanceBridge
bridge = MLflowProvenanceBridge(Path('.'))
# Get all runs for a dataset
runs = bridge.get_mlflow_runs_by_dataset('Customer Churn Dataset')
# Compare metrics across runs
for run in runs:
print(f"Run {run['run_id']}: accuracy={run['metrics'].get('accuracy', 'N/A')}")
CI/CD Integration¶
# .github/workflows/train-model.yml
name: Train Model
on: [push]
jobs:
train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
- name: Install dependencies
run: |
pip install dta-provenance[mlflow]
pip install -r requirements.txt
- name: Train model with provenance
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
run: |
python train_model.py
- name: Log provenance to MLflow
run: |
dta-provenance mlflow-log --metadata provenance.json
Benefits¶
Data Scientists¶
- Track which data version produced which model
- Reproduce exact results from any experiment
- Compare models trained on different datasets
- Meet compliance requirements automatically
MLOps Engineers¶
- Immutable audit trail of all experiments
- Bidirectional Git ↔ MLflow linking
- Integration with CI/CD pipelines
- Automated compliance metadata
Compliance Officers¶
- Complete provenance metadata per DTA standards
- Cryptographic proof via Git commits
- Query which data was used for which models
- Evidence for regulatory audits
Remote Tracking Server¶
Set up a remote MLflow tracking server for team collaboration:
# Start MLflow server
mlflow server \\
--backend-store-uri postgresql://user:pass@localhost/mlflow \\
--default-artifact-root s3://mlflow-artifacts \\
--host 0.0.0.0 \\
--port 5000
# Configure clients to use remote server
export MLFLOW_TRACKING_URI=http://mlflow-server:5000
# Or in Python
import mlflow
mlflow.set_tracking_uri('http://mlflow-server:5000')
Limitations¶
- MLflow required - Both locally and in CI/CD
- Active run needed - Must have MLflow run context for logging
- Network access - Remote tracking server requires network connectivity
- Storage costs - Remote artifact storage (S3, Azure, GCS) costs apply
Troubleshooting¶
"MLflow is not installed"¶
"No active MLflow run"¶
Start an MLflow run before logging:
import mlflow
with mlflow.start_run():
# Now you can log provenance
bridge.log_provenance_to_mlflow(metadata)
Or provide run_id explicitly:
"Cannot connect to tracking server"¶
Check MLflow tracking URI:
"Artifact logging failed"¶
Ensure artifact storage is configured and accessible:
# Check MLflow configuration
mlflow server --help
# Test artifact logging
mlflow artifacts log-artifact test.txt
Best Practices¶
- Always log provenance - Don't train models without data provenance
- Use experiments - Organize runs into meaningful experiments
- Bidirectional linking - Use
commit_with_mlflow_trackingfor automatic linking - Query by dataset - Track which datasets are most effective
- Version everything - Version code (Git), data (DVC/Git), and experiments (MLflow)
- Document assumptions - Include data quality notes in provenance metadata
Related Resources¶
dta-provenance mlflow-logcommand- MLflow Documentation
- DTA Standards
- DVC Integration - Version data files with DVC