DVC Integration¶

Integrate DTA provenance tracking with DVC (Data Version Control) for complete data lineage.

Overview¶

DVC is a version control system for data files, ML models, and datasets. By combining DVC with DTA provenance metadata, you get:

Data versioning - Track large data files outside Git
Content-addressable storage - Immutable data identified by hash
DTA compliance - Provenance metadata for regulatory requirements
Complete lineage - Know exactly which data version was used

Installation¶

# Install with DVC support
pip install 'dta-provenance[dvc]'

# Or install manually
pip install dvc>=3.0

Quick Start¶

1. Initialize DVC in your repository¶

git init
dvc init

2. Track a file with DVC and provenance¶

# Create your provenance metadata
cat > provenance.json <<EOF
{
  "source": {
    "datasetName": "Customer Data Q1 2024",
    "providerName": "Internal CRM"
  },
  "provenance": {
    "dataGenerationMethod": "Exported from CRM system",
    "dateDataGenerated": "2024-01-15",
    "dataType": "Tabular",
    "dataFormat": "CSV"
  },
  "use": {
    "intendedUse": "Customer analytics",
    "legalRightsToUse": "Internal use only",
    "sensitiveData": true,
    "sensitiveDataCategories": ["PII", "Financial"]
  }
}
EOF

# Track file with DVC and enrich metadata
dta-provenance dvc-track customers.csv \\
    --metadata provenance.json \\
    --output provenance-with-dvc.json

3. Commit with enriched provenance¶

git add customers.csv.dvc .gitignore provenance-with-dvc.json
dta-provenance commit customers.csv.dvc provenance-with-dvc.json \\
    --metadata provenance-with-dvc.json \\
    --message "Add Q1 customer data with DVC tracking"

How It Works¶

DVC Metadata Enrichment¶

When you use dvc-track, the DTA provenance metadata is enriched with DVC-specific fields:

{
  "source": { ... },
  "provenance": { ... },
  "use": { ... },
  "metadata": {
    "dvc": {
      "tracked": true,
      "dvc_file": "customers.csv.dvc",
      "md5": "abc123def456...",
      "size": "1048576",
      "path": "customers.csv"
    }
  }
}

This allows you to:

Verify data integrity - Compare MD5 hashes
Track file versions - DVC file changes are tracked in Git
Reconstruct exact state - Use DVC checkout with commit hash
Audit compliance - Full provenance + version history

DVC File Structure¶

DVC creates .dvc files that are tracked in Git:

# customers.csv.dvc
outs:
- md5: abc123def456
  size: 1048576
  path: customers.csv

The actual data is stored in .dvc/cache/ and can be pushed to remote storage (S3, GCS, Azure, etc.).

Usage Examples¶

Track Multiple Files¶

dta-provenance dvc-track train.csv test.csv \\
    --metadata provenance.json \\
    --output provenance-enriched.json

Track Without Existing Metadata¶

If you don't have provenance metadata yet, you can still track with DVC and create metadata later:

# Track with DVC first
dta-provenance dvc-track dataset.csv

# Creates enriched metadata with DVC info only
# Add full DTA provenance fields later

Check DVC Status Programmatically¶

from pathlib import Path
from src.integrations.dvc_integration import DVCProvenanceBridge

# Initialize bridge
bridge = DVCProvenanceBridge(Path('.'))

# Get DVC status
status = bridge.get_dvc_status()
print(f"Clean: {status['is_clean']}")

# Read DVC metadata for a file
dvc_info = bridge.read_dvc_provenance(Path('data.csv'))
if dvc_info:
    print(f"MD5: {dvc_info['metadata']['md5']}")

Enrich Existing Metadata¶

If you already have DVC-tracked files and want to add provenance:

from pathlib import Path
from src.integrations.dvc_integration import DVCProvenanceBridge
from src.provenance import load_provenance_file

# Load existing provenance
metadata = load_provenance_file(Path('provenance.json'))
metadata_dict = metadata.to_dict()

# Enrich with DVC info
bridge = DVCProvenanceBridge(Path('.'))
enriched = bridge.enrich_metadata_from_dvc(Path('data.csv'), metadata_dict)

# Save enriched metadata
import json
with open('provenance-with-dvc.json', 'w') as f:
    json.dump(enriched, f, indent=2)

Workflow: Data Science Project¶

Complete workflow for a data science project with DVC + DTA:

# 1. Initialize project
git init
dvc init

# 2. Add raw data with provenance
dta-provenance dvc-track raw_data.csv \\
    --metadata raw_data_provenance.json \\
    --output raw_data_enriched.json

git add raw_data.csv.dvc raw_data_enriched.json .gitignore
git commit -m "Add raw data with provenance"

# 3. Process data (creates processed_data.csv)
python process_data.py

# 4. Track processed data with updated provenance
cat > processed_provenance.json <<EOF
{
  "source": {
    "datasetName": "Processed Customer Data",
    "providerName": "Internal CRM",
    "datasetVersion": "1.0-processed"
  },
  "provenance": {
    "dataGenerationMethod": "Cleaned and normalized from raw CRM export",
    "dateDataGenerated": "2024-01-16",
    "dataType": "Tabular",
    "dataFormat": "CSV",
    "qualityIndicators": "Removed duplicates, normalized dates"
  },
  "use": {
    "intendedUse": "ML model training",
    "legalRightsToUse": "Internal use only",
    "sensitiveData": true,
    "sensitiveDataCategories": ["PII"],
    "privacyMeasures": "Anonymized customer IDs"
  }
}
EOF

dta-provenance dvc-track processed_data.csv \\
    --metadata processed_provenance.json \\
    --output processed_enriched.json

git add processed_data.csv.dvc processed_enriched.json
dta-provenance commit processed_data.csv.dvc \\
    --metadata processed_enriched.json \\
    --message "Add processed data"

# 5. Push data to remote storage
dvc remote add -d storage s3://my-bucket/dvc-cache
dvc push

git push

Remote Storage¶

DVC supports multiple remote storage backends:

# Amazon S3
dvc remote add -d storage s3://mybucket/dvcstore

# Google Cloud Storage
dvc remote add -d storage gs://mybucket/dvcstore

# Azure Blob Storage
dvc remote add -d storage azure://mycontainer/dvcstore

# SSH/SFTP
dvc remote add -d storage ssh://user@server/path

# Local or network filesystem
dvc remote add -d storage /mnt/dvc-storage

Push and pull data:

# Push local data to remote
dvc push

# Pull data from remote
dvc pull

# Get specific version
git checkout abc123
dvc checkout

Benefits¶

Data Scientists¶

Track large datasets without bloating Git
Reproduce exact results from any commit
Share data efficiently with team
Meet compliance requirements automatically

MLOps Engineers¶

Immutable data artifacts with cryptographic hashes
Audit trail of all data transformations
Integration with CI/CD pipelines
Consistent environments across team

Compliance Officers¶

Complete provenance metadata per DTA standards
Cryptographic proof of data integrity
Full audit trail in Git history
Evidence for regulatory audits

Limitations¶

DVC required - Both locally and in CI/CD
Storage costs - Need remote storage for team collaboration
Initial setup - Requires DVC initialization and remote configuration
Learning curve - Team must understand DVC commands

Troubleshooting¶

"DVC is not installed"¶

pip install 'dta-provenance[dvc]'
# or
pip install dvc>=3.0

"Not a DVC repository"¶

# Initialize DVC first
dvc init
git commit -m "Initialize DVC"

"File is not DVC-tracked"¶

Ensure the file has been added with dvc add or use dvc-track command which does it automatically.

"DVC command failed"¶

Check DVC installation and configuration:

dvc version
dvc status
dvc remote list

Best Practices¶

Always enrich metadata - Don't track with DVC alone; add provenance metadata
Commit .dvc files - Always commit .dvc files and .gitignore to Git
Use remote storage - Configure DVC remote for team collaboration
Version processed data - Track both raw and processed datasets
Document transformations - Update provenance metadata when processing data
Test data integrity - Verify MD5 hashes after pulling data

dta-provenance dvc-track command
DVC Documentation
DTA Standards
MLflow Integration - Track experiments with provenance