Quick Start
Overview¶
A comprehensive demonstration of data provenance tracking using the official Data & Trust Alliance standards v1.0.0. This project provides production-ready implementations across multiple platforms and tools, making it easy to understand when and how to implement provenance tracking in real-world applications.
Two Complete Implementations¶
- Git-Native Approach - Cryptographic audit logs using Git commits with trailers
- Blockchain Approach - Immutable smart contracts on Ethereum-compatible networks
Key Capabilities¶
- Official DTA v1.0.0 Standards - Full implementation of Data & Trust Alliance specifications
- Production Quality - 156 passing tests with 67% code coverage
- Multiple Integration Points - DVC, MLflow, SBOM, API server, VS Code extension
- Real-World Examples - Healthcare imaging, ML training, IoT sensors, financial transactions
- Complete Documentation - Comprehensive guides, API references, and tutorials
- Honest Technical Comparison - Detailed analysis of when blockchain is (and isn't) appropriate
Note: The official DTA JSON Schema file is incomplete (missing closing brace). This project follows the DTA specification correctly, but automated schema validation against the official schema is not available. See standards/official/README.md for details.
Table of Contents¶
- Quick Start
- Features
- Architecture
- Examples
- Project Structure
- Installation
- Running Tests
- Deployment
- Contributing
- License
Quick Start¶
Docker Compose (Recommended)¶
The fastest way to get started with all services:
# Clone the repository
git clone https://github.com/Ricoledan/dta-provenance-demo.git
cd dta-provenance-demo/git-native
# Start all services
docker-compose up -d
# Access services
# - API Server: http://localhost:8000
# - API Docs: http://localhost:8000/docs
# - Dashboard: http://localhost:3001
Git-Native CLI¶
# Install Python package
cd git-native
pip install -e .
# Validate DTA compliance
dta-provenance validate ../standards/examples/healthcare-imaging.json
# Create test repository
mkdir my-project && cd my-project
git init
# Commit with provenance
echo "data" > dataset.csv
dta-provenance commit dataset.csv \
--metadata ../standards/examples/healthcare-imaging.json \
--message "Add training dataset"
# Verify integrity
dta-provenance verify HEAD
# Trace lineage
dta-provenance trace dataset.csv
Blockchain Deployment¶
# Install dependencies
cd blockchain
npm install
# Start local Hardhat network
npx hardhat node
# Deploy contract (in another terminal)
npx hardhat run scripts/deploy.js --network localhost
# Run tests
npx hardhat test
# Deploy to Polygon testnet
cp .env.example .env # Add your private key and RPC URL
npm run deploy:polygon-amoy
Features¶
Core Implementations¶
Git-Native Provenance Tracking¶
Python-based implementation using Git commits with structured trailers.
Features: - DTA v1.0.0 compliant metadata storage in Git commit messages - Cryptographic integrity verification using SHA-256 hashing - Complete audit trail generation from Git history - GPG/SSH signature support for commits - CLI tool with rich terminal output - Python library for programmatic access
Commands:
dta-provenance commit # Create provenance-tracked commits
dta-provenance verify # Verify cryptographic integrity
dta-provenance trace # Generate audit trails
dta-provenance validate # Check DTA compliance
dta-provenance show # Display metadata from commits
Blockchain Provenance Registry¶
Solidity smart contract implementation for immutable provenance records.
Features: - ERC-compliant smart contract on Ethereum/Polygon/Arbitrum - On-chain metadata hash verification - Provider-based access control - Event emission for off-chain indexing - Gas-optimized operations (~228K for registration) - Multi-network deployment support
Networks Supported: - Ethereum Mainnet - Polygon (Mainnet & Amoy Testnet) - Arbitrum One (Mainnet & Sepolia Testnet) - Local Hardhat Network
Advanced Integrations¶
1. Pre-commit Hooks¶
Automated validation integrated with the pre-commit framework.
Features: - Automatic metadata validation on file staging - Commit message trailer verification - DTA compliance checking before commits - Configurable validation rules - Works with existing pre-commit configurations
Installation:
2. DVC Integration¶
Bridge between DVC data versioning and DTA provenance metadata.
Features:
- Automatic DVC file tracking with provenance enrichment
- Metadata storage in metadata.dvc fields
- Hash verification across Git and DVC
- Support for remote storage backends (S3, GCS, Azure)
- CLI command for streamlined workflows
Usage:
pip install 'dta-provenance[dvc]'
dta-provenance dvc-track dataset.csv \
--metadata provenance.json \
--output enriched-metadata.json
3. MLflow Integration¶
Bidirectional linking between Git commits and MLflow experiment runs.
Features:
- Git commits store mlflow_run_id in metadata
- MLflow runs store git.commit_hash in tags
- Complete DTA metadata logged as MLflow artifacts
- Query runs by dataset name
- Experiment tracking with provenance compliance
Usage:
4. SBOM Integration¶
Software Bill of Materials generation with CycloneDX format.
Features: - Generate SBOMs from Python dependencies - Link SBOM to provenance metadata - Dependency analysis and license tracking - Security scanning integration (Trivy, Grype) - NTIA minimum elements compliance
Usage:
pip install 'dta-provenance[sbom]'
dta-provenance sbom-generate \
--project-name my-ml-project \
--metadata provenance.json
5. REST API Server¶
FastAPI-based server for programmatic provenance queries.
Features: - 5 RESTful endpoints for provenance operations - OpenAPI/Swagger documentation - CORS support for web applications - Pydantic validation for type safety - Docker deployment ready
Endpoints:
- GET /health - Health check
- GET /provenance/{commit_hash} - Retrieve metadata
- GET /audit-trail/{file_path} - Full audit history
- POST /validate - DTA compliance validation
- GET /lineage/{file_path} - Lineage graph data
Usage:
6. Frontend Dashboard¶
Interactive web application for provenance visualization and validation.
Features: - Provenance metadata viewer with hierarchical display - Audit trail timeline with CSV/JSON export - Interactive D3.js lineage graph with force-directed layout - Real-time DTA compliance validator - Responsive design with Material theme - Nginx-based production deployment
Access:
7. VS Code Extension¶
IDE integration for real-time validation and Git integration.
Features: - Automatic validation on JSON file save - Inline diagnostics for DTA compliance errors - 5 code snippets for rapid development - JSON Schema for IntelliSense support - Git provenance history viewer - Status bar with compliance scores - Configurable validation strictness
Installation:
cd git-native/vscode-extension
npm install && npm run compile
vsce package
code --install-extension dta-provenance-validator-0.1.0.vsix
Testing & Quality Assurance¶
- Python Tests: 121 tests covering all modules (pytest)
- Blockchain Tests: 35 tests for smart contracts (Hardhat)
- Coverage: 67% overall, 98%+ for integration modules
- CI/CD: GitHub Actions for automated testing
- Linting: Ruff, Black, MyPy for Python; ESLint for TypeScript
- Documentation Tests: MkDocs strict mode validation
Architecture¶
Git-Native Architecture¶
┌─────────────────────────────────────────────────────────────────┐
│ Application Layer │
├─────────────────────────────────────────────────────────────────┤
│ CLI Tool │ API Server │ VS Code Extension │ Dashboard │
├─────────────────────────────────────────────────────────────────┤
│ Core Library Layer │
├─────────────────────────────────────────────────────────────────┤
│ ProvenanceTracker │ ProvenanceVerifier │ Visualizer │
├─────────────────────────────────────────────────────────────────┤
│ Integration Layer │
├─────────────────────────────────────────────────────────────────┤
│ DVC Bridge │ MLflow Bridge │ SBOM Bridge │
├─────────────────────────────────────────────────────────────────┤
│ Storage Layer │
├─────────────────────────────────────────────────────────────────┤
│ Git Repository │ Filesystem │ Remote Storage │
└─────────────────────────────────────────────────────────────────┘
Blockchain Architecture¶
┌─────────────────────────────────────────────────────────────────┐
│ Client Applications │
├─────────────────────────────────────────────────────────────────┤
│ Web3.js/Ethers.js │ Frontend DApps │ Scripts │
├─────────────────────────────────────────────────────────────────┤
│ Smart Contract Layer │
├─────────────────────────────────────────────────────────────────┤
│ ProvenanceRegistry.sol │
│ - registerProvenance() - verifyRecord() │
│ - validateMetadata() - updateMetadata() │
├─────────────────────────────────────────────────────────────────┤
│ Blockchain Networks │
├─────────────────────────────────────────────────────────────────┤
│ Ethereum │ Polygon │ Arbitrum │ Local Hardhat │
└─────────────────────────────────────────────────────────────────┘
Data Flow¶
Git-Native Flow: 1. User creates/modifies data files 2. DTA metadata prepared (JSON) 3. Files staged in Git 4. Provenance metadata embedded in commit message as trailers 5. SHA-256 hash computed for integrity 6. Commit signed (optional GPG/SSH) 7. Metadata retrievable from commit history
Blockchain Flow: 1. Data processed off-chain 2. Metadata hash computed (SHA-256) 3. Transaction submitted to smart contract 4. On-chain verification and storage 5. Event emitted for indexing 6. Metadata retrievable by record ID or provider
Examples¶
Healthcare Imaging Dataset¶
Tracking medical imaging data with HIPAA compliance requirements.
{
"source": {
"datasetName": "Pneumonia X-ray Detection Dataset",
"providerName": "University Hospital Medical Imaging",
"datasetVersion": "2.1.0",
"providerWebsite": "https://hospital.example.edu/imaging"
},
"provenance": {
"dataGenerationMethod": "Chest X-ray DICOM to PNG conversion with radiologist annotations",
"dateDataGenerated": "2023-11-15",
"dataType": "Image",
"dataFormat": "PNG with JSON annotations",
"qualityIndicators": "Board-certified radiologist review, inter-rater agreement 0.94"
},
"use": {
"intendedUse": "Training AI models for pneumonia detection in chest X-rays",
"legalRightsToUse": "IRB-approved research protocol #2023-456",
"sensitiveData": true,
"sensitiveDataCategories": ["Health Information (HIPAA)"],
"privacyMeasures": "De-identified per HIPAA Safe Harbor, face regions removed"
}
}
Full example: standards/examples/healthcare-imaging.json
ML Training Pipeline¶
Tracking Hugging Face model training with experiment provenance.
# Track dataset with DVC
dta-provenance dvc-track train.csv --metadata train-provenance.json
# Log to MLflow during training
python train.py # Internally uses MLflowProvenanceBridge
# Generate SBOM for dependencies
dta-provenance sbom-generate --project-name sentiment-model
# Create provenance-tracked commit
dta-provenance commit model.bin \
--metadata enriched-provenance.json \
--message "Fine-tuned BERT for sentiment analysis"
Full example: standards/examples/ml-training-huggingface.json
IoT Sensor Stream¶
Real-time sensor data with temporal provenance tracking.
Full example: standards/examples/iot-sensor-stream.json
Financial Transactions¶
Trading data with compliance and audit requirements.
Full example: standards/examples/financial-transactions.json
Project Structure¶
dta-provenance-demo/
├── .github/
│ └── workflows/ # CI/CD pipelines (test, docs, lint)
├── blockchain/
│ ├── contracts/ # Solidity smart contracts
│ │ └── ProvenanceRegistry.sol
│ ├── scripts/ # Deployment and interaction scripts
│ ├── test/ # Hardhat test suite (35 tests)
│ ├── deployments/ # Network deployment records
│ ├── hardhat.config.js # Multi-network configuration
│ └── .env.example # Environment template
├── docs/
│ ├── tutorials/ # Step-by-step guides (10 tutorials)
│ │ ├── pre-commit-hooks.md
│ │ ├── dvc-integration.md
│ │ ├── mlflow-integration.md
│ │ ├── sbom-integration.md
│ │ ├── blockchain-networks.md
│ │ ├── api-server.md
│ │ ├── dashboard.md
│ │ └── vscode-extension.md
│ ├── examples/ # Real-world use cases
│ ├── ARCHITECTURE.md # System design documentation
│ ├── DTA_STANDARDS.md # Standards specification
│ └── COMPARISON.md # Git vs Blockchain analysis
├── git-native/
│ ├── src/
│ │ ├── provenance.py # Core tracking library
│ │ ├── verify.py # Validation and verification
│ │ ├── cli.py # Command-line interface (9 commands)
│ │ ├── api.py # FastAPI REST server
│ │ ├── hooks.py # Pre-commit integration
│ │ ├── visualize.py # Graph generation
│ │ └── integrations/ # External tool bridges
│ │ ├── dvc_integration.py
│ │ ├── mlflow_integration.py
│ │ └── sbom_integration.py
│ ├── tests/ # Comprehensive test suite (121 tests)
│ ├── dashboard/ # Vite + D3.js frontend
│ │ ├── src/
│ │ │ ├── components/ # UI components
│ │ │ │ ├── provenance-viewer.js
│ │ │ │ ├── audit-trail.js
│ │ │ │ ├── lineage-graph.js
│ │ │ │ └── validator.js
│ │ │ └── main.js
│ │ ├── Dockerfile # Multi-stage build
│ │ └── nginx.conf # Reverse proxy config
│ ├── vscode-extension/ # TypeScript IDE extension
│ │ ├── src/
│ │ │ ├── extension.ts # Main activation
│ │ │ └── validator.ts # Validation logic
│ │ ├── snippets/ # Code snippets (5 templates)
│ │ └── schemas/ # JSON Schema
│ ├── docker-compose.yml # Multi-service orchestration
│ ├── pyproject.toml # Python package configuration
│ └── requirements.txt # Core dependencies
├── standards/
│ ├── official/ # DTA v1.0.0 schema and docs
│ └── examples/ # Reference implementations
│ ├── healthcare-imaging.json
│ ├── ml-training-huggingface.json
│ ├── iot-sensor-stream.json
│ └── financial-transactions.json
├── .pre-commit-config.yaml # Automated validation hooks
├── mkdocs.yml # Documentation site configuration
├── docker-compose.yml # Root-level service orchestration
└── README.md # This file
Installation¶
System Requirements¶
- Python: 3.9 or higher
- Node.js: 18 or higher
- Git: 2.30 or higher
- Docker: 20.10 or higher (optional, for containerized deployment)
Python Package¶
cd git-native
# Core installation
pip install -e .
# With all integrations
pip install -e '.[all]'
# Specific integrations
pip install -e '.[dvc]' # DVC support
pip install -e '.[mlflow]' # MLflow support
pip install -e '.[sbom]' # SBOM generation
pip install -e '.[api]' # API server
Blockchain¶
VS Code Extension¶
cd git-native/vscode-extension
npm install
npm run compile
# Package and install
npm install -g vsce
vsce package
code --install-extension dta-provenance-validator-0.1.0.vsix
Dashboard¶
Running Tests¶
Python Tests¶
cd git-native
# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-report=html
# Run specific test file
pytest tests/test_provenance.py
# Run specific test
pytest tests/test_provenance.py::test_metadata_hash_consistency
Test Summary: - 121 total tests - Core provenance: 20 tests - Verification: 18 tests - CLI: 19 tests - API: 24 tests - DVC integration: 13 tests - MLflow integration: 16 tests - SBOM integration: 12 tests - Hooks: 14 tests
Blockchain Tests¶
cd blockchain
# Run all tests
npx hardhat test
# Run with gas reporting
REPORT_GAS=true npx hardhat test
# Run specific test
npx hardhat test --grep "should register a new provenance record"
Test Summary: - 35 total tests - Deployment: 2 tests - Registration: 7 tests - Retrieval: 2 tests - Verification: 4 tests - Validation: 3 tests - Updates: 5 tests - Provider tracking: 3 tests - Gas optimization: 2 tests
Integration Tests¶
# API integration tests
cd git-native
pytest tests/test_api.py -v
# Full integration test
./scripts/integration-test.sh
Deployment¶
Docker Compose¶
Production-ready deployment with all services:
cd git-native
docker-compose up -d
# Services available:
# - API: http://localhost:8000
# - Dashboard: http://localhost:3001
# - Docs: http://localhost:8000/docs
# View logs
docker-compose logs -f
# Stop services
docker-compose down
Blockchain Networks¶
Testnets (Free)¶
cd blockchain
# Configure environment
cp .env.example .env
# Add PRIVATE_KEY, RPC URLs, and API keys
# Deploy to Polygon Amoy testnet
npm run deploy:polygon-amoy
# Deploy to Arbitrum Sepolia testnet
npm run deploy:arbitrum-sepolia
# Interact with deployed contract
npm run interact:polygon-amoy
Mainnets (Costs Gas)¶
# Deploy to Polygon mainnet
npm run deploy:polygon
# Deploy to Arbitrum One mainnet
npm run deploy:arbitrum
Cost Comparison: - Ethereum Mainnet: ~$50-200 per transaction - Polygon Mainnet: ~$0.01 per transaction - Arbitrum One: ~$0.01 per transaction - Testnets: Free (faucet tokens)
Production API¶
# Using systemd
sudo cp scripts/dta-api.service /etc/systemd/system/
sudo systemctl enable dta-api
sudo systemctl start dta-api
# Using Nginx reverse proxy
sudo cp scripts/nginx-api.conf /etc/nginx/sites-available/dta-api
sudo ln -s /etc/nginx/sites-available/dta-api /etc/nginx/sites-enabled/
sudo systemctl reload nginx
Documentation Site¶
Documentation¶
Comprehensive documentation is available at https://ricoledan.github.io/dta-provenance-demo
Key Documentation Pages¶
Getting Started: - Quick Start Guide - Installation Instructions - Docker Setup - Nix Setup
Technical Documentation: - Architecture Overview - DTA Standards Specification - Git vs Blockchain Comparison
Tutorials: - Basic Usage - Pre-commit Hooks - DVC Integration - MLflow Integration - SBOM Integration - Blockchain Networks - API Server - Frontend Dashboard - VS Code Extension
Examples: - Healthcare Imaging - ML Training - IoT Sensors - Financial Data
When to Use Git vs Blockchain¶
Use Git-Native When:¶
- You have a centralized authority or trust anchor
- Cost and performance are critical
- You need rich querying capabilities
- Your team is already familiar with Git
- Compliance requires audit trails but not decentralization
- Data privacy is a concern (on-chain data is public)
Advantages: - Zero operational cost - Instant operations (no mining/confirmation time) - Flexible querying with Git tools - Proven cryptographic security - Works offline - Private by default
Use Blockchain When:¶
- Multiple untrusting parties need to verify provenance
- No single entity can be trusted as authority
- Immutability must be guaranteed by consensus
- Public verifiability is required
- Smart contract automation is valuable
Advantages: - Truly decentralized trust model - Censorship resistant - Public verifiability - Programmable via smart contracts - Network effect and interoperability
Cost Comparison¶
| Operation | Git-Native | Ethereum | Polygon | Arbitrum |
|---|---|---|---|---|
| Registration | Free | $50-200 | $0.01 | $0.01 |
| Verification | Free | $20-100 | $0.005 | $0.005 |
| Query | Free | $5-20 | $0.002 | $0.002 |
| Storage | Local disk | On-chain (expensive) | On-chain | On-chain |
Conclusion: For most ML and data science applications, Git-native provenance provides sufficient guarantees at zero cost. Blockchain adds value primarily when decentralization and public verifiability are strict requirements.
Contributing¶
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
Development Setup¶
# Clone the repository
git clone https://github.com/Ricoledan/dta-provenance-demo.git
cd dta-provenance-demo
# Install development dependencies
cd git-native
pip install -e '.[dev]'
# Run linters
ruff check src/
black --check src/
mypy src/
# Run tests
pytest
# Build documentation
cd ..
mkdocs serve
Pull Request Process¶
- Fork the repository
- Create a feature branch (
git checkout -b feature/your-feature) - Make your changes with tests
- Ensure all tests pass (
pytestandnpx hardhat test) - Update documentation as needed
- Commit your changes (
git commit -m 'Add feature') - Push to the branch (
git push origin feature/your-feature) - Open a Pull Request
Code Standards¶
- Python: Follow PEP 8, use type hints, maintain >80% test coverage
- TypeScript: Follow ESLint rules, use strict mode
- Solidity: Follow Solidity style guide, optimize for gas
- Documentation: Use clear, concise language with code examples
- Commit messages: Use conventional commits format
License¶
This project is licensed under the MIT License - see the LICENSE file for details.
Credits¶
Standards¶
- Data & Trust Alliance - DTA Provenance Standards v1.0.0
Technologies¶
- GitPython - Git repository interaction
- FastAPI - Modern Python web framework
- Hardhat - Ethereum development environment
- D3.js - Data visualization library
- MkDocs Material - Documentation theme
Contributors¶
See CONTRIBUTING.md for the list of contributors.
Support¶
- Documentation: https://ricoledan.github.io/dta-provenance-demo
- Issues: https://github.com/Ricoledan/dta-provenance-demo/issues
- Discussions: https://github.com/Ricoledan/dta-provenance-demo/discussions