Skip to content

DTA Provenance Standards Demo

What is This?

A comprehensive demonstration of data provenance tracking using the official Data & Trust Alliance standards v1.0.0.

Shows you two complete implementations:

  1. Git-Native Approach - Cryptographic audit logs using Git commits
  2. Blockchain Approach - Smart contracts on Ethereum/Polygon

Perfect for understanding when to use blockchain vs. traditional solutions for data provenance in AI/ML pipelines, supply chains, and regulated industries.

Why This Project?

  • Official Standards - Implements DTA v1.0.0 specification
  • Production Quality - Comprehensive tests, CLI tools, extensive documentation
  • Real-World Examples - Healthcare, ML training, IoT sensors, financial data
  • Educational - Learn what works (and what doesn't) in provenance tracking
  • Honest Comparison - Shows why most projects DON'T need blockchain

Features

Git-Native Implementation

from src.provenance import ProvenanceTracker, ProvenanceMetadata

# Create DTA-compliant metadata
metadata = ProvenanceMetadata(
    source={"datasetName": "Training Data", "providerName": "ML Team"},
    provenance={"dataGenerationMethod": "SQL export", ...},
    use={"intendedUse": "Model training", ...}
)

# Commit with provenance
tracker = ProvenanceTracker("./my-project")
commit_hash = tracker.commit_with_provenance(
    ["dataset.csv"],
    metadata,
    "Add training data v1.0"
)

# Verify integrity
is_valid, message = tracker.verify_integrity(commit_hash)

Blockchain Implementation

// Register provenance on-chain
function registerProvenance(
    string memory _datasetName,
    string memory _metadataURI,  // IPFS/Arweave
    bytes32 _metadataHash
) public returns (bytes32 recordId)

CLI Tools

# Validate DTA standards
dta-provenance validate metadata.json

# Commit with provenance
dta-provenance commit dataset.csv \
  --metadata provenance.json \
  --message "Add training data"

# Verify integrity
dta-provenance verify HEAD

# Generate audit trail
dta-provenance trace dataset.csv

Use Cases

Healthcare Imaging

Complete HIPAA-compliant provenance for medical imaging datasets with de-identification documentation.

View Example →

ML Training Data

Track provenance of training datasets from HuggingFace with license compliance verification.

View Example →

IoT Sensor Streams

Real-time environmental monitoring with sensor calibration and quality indicators.

View Example →

Financial Transactions

Anonymized payment data with multi-layer privacy protection and risk assessment.

View Example →

When to Use What?

Scenario Recommendation Why?
Internal ML pipeline Git-Native Fast, free, integrates with existing workflows
Cross-company supply chain Blockchain (maybe) ⚠️ Only if trust is issue AND you can't use APIs
Single organization Git-Native No need for blockchain's trust properties
Regulatory audit trail Either Both provide cryptographic integrity
High-frequency updates Git-Native No gas fees, instant commits
Public transparency Blockchain Immutable, publicly verifiable

Full Comparison →

Installation

git clone https://github.com/Ricoledan/dta-provenance-demo.git
cd dta-provenance-demo
nix develop
docker-compose up -d
# Access Jupyter: http://localhost:8888
# Blockchain RPC: http://localhost:8545
# Git-native
cd git-native
pip install -r requirements.txt
pip install -e .

# Blockchain
cd blockchain
npm install

Community

License

MIT License - see LICENSE for details.

The DTA standards are used under their original license. See Credits for full attribution.


Ready to get started?Quick Start Guide