Frequently Asked Questions¶

General Questions¶

What is data provenance?¶

Data provenance is the documentation of data's origin, movement, and transformations throughout its lifecycle. It answers questions like: - Where did this data come from? - How was it collected or generated? - What transformations were applied? - Who has access to it? - How should it be used?

Why does data provenance matter?¶

Data provenance is increasingly critical for:

Regulatory Compliance
EU AI Act requires training data documentation
GDPR mandates data processing records
FDA requires provenance for AI/ML medical devices
Trust & Transparency
Users want to know where AI training data comes from
Researchers need reproducible data pipelines
Auditors require complete audit trails
Risk Management
Identify data biases early
Detect privacy issues before deployment
Track data quality issues to source

What are the Data & Trust Alliance (DTA) standards?¶

The DTA standards v1.0.0 define 22 fields organized into three categories: - SOURCE (8 fields): Where data came from - PROVENANCE (6 fields): How data was created - USE (8 fields): How data should be used

See DTA Standards Guide for full details.

Technical Questions¶

Do I need blockchain for data provenance?¶

Almost certainly not. The Git-native approach works for 95% of use cases.

Use Git-native when: - ✅ Single organization controls the data - ✅ Internal ML pipelines - ✅ Privacy-sensitive data - ✅ High-frequency updates - ✅ Cost matters

Only use blockchain when ALL of these apply: - ⚠️ Multiple untrusting parties - ⚠️ No central authority can be trusted - ⚠️ Automated settlement needed (smart contracts) - ⚠️ Public transparency valuable - ⚠️ Update frequency low enough

See detailed comparison.

Is this production-ready?¶

Git-native implementation: Yes, production-quality with: - ✅ Comprehensive test coverage - ✅ CLI tools - ✅ Error handling - ✅ Documentation

Blockchain implementation: Educational/demo quality. Would need: - ⚠️ Access control improvements - ⚠️ Gas optimization - ⚠️ Security audit - ⚠️ Deployment infrastructure

Can I use this with my existing ML pipeline?¶

Yes! The Python library integrates easily with: - MLflow - DVC (Data Version Control) - Jupyter notebooks - Standard Python ML tools

See the ML Training example.

What about IPFS/Arweave for decentralized storage?¶

These are great for immutable file storage and complement either approach: - Store full datasets on IPFS/Arweave - Store provenance metadata hash on-chain (blockchain) or in Git - Reference IPFS/Arweave URIs in metadata

The blockchain implementation shows this pattern with metadataURI fields.

How does this compare to DVC (Data Version Control)?¶

DVC: - Focuses on versioning large files efficiently - Handles data storage and retrieval - Integrates with cloud storage (S3, GCS, Azure)

This project: - Focuses on metadata and provenance tracking - Implements DTA standards compliance - Provides cryptographic integrity verification

They're complementary - you can use both together: - DVC for efficient file versioning - DTA provenance for metadata tracking

Compliance Questions¶

The DTA standards align with GDPR requirements, specifically: - Article 5(1)(a): Lawfulness, fairness, and transparency - Article 5(2): Accountability - Article 30: Records of processing activities

This project helps you document: - Legal basis for processing (legalRightsToUse) - Privacy measures (privacyMeasures) - Data categories (sensitiveDataCategories) - Processing locations (dataProcessingLocation)

However, this is a documentation tool, not a complete GDPR compliance solution. Consult legal counsel for full compliance.

What about the EU AI Act?¶

The EU AI Act requires training data provenance for high-risk AI systems. The DTA standards cover: - Data sources and origins - Collection methodology - Quality indicators - Bias assessment - Privacy measures

See DTA Standards Guide for regulatory mapping.

Can this be used for FDA submissions?¶

The provenance tracking provides documentation useful for FDA AI/ML submissions: - Data collection procedures - Quality controls - Version tracking - Audit trails

However, FDA compliance requires additional documentation beyond provenance. Use this as part of a broader quality management system.

Usage Questions¶

How do I get started?¶

Install the tools (Nix, Docker, or manual)
Follow the Quick Start Guide
Try the Interactive Jupyter Notebook
Explore real-world examples

Can I modify the DTA standards?¶

The DTA standards are extensible - you can add custom fields while maintaining core compliance:

metadata = ProvenanceMetadata(
    source={"datasetName": "...", ...},  # Required DTA fields
    provenance={"dataGenerationMethod": "...", ...},
    use={"intendedUse": "...", ...},
    # Add custom fields
    custom={
        "internalTrackingId": "12345",
        "departmentCode": "ML-RESEARCH",
        "customQualityMetrics": {...}
    }
)

How do I validate my metadata?¶

# CLI
dta-provenance validate my-metadata.json

# Python
from src.verify import validate_provenance_file
report = validate_provenance_file("my-metadata.json")
print(report)

Can I use this in a CI/CD pipeline?¶

Yes! The GitHub Actions workflow shows integration:

- name: Validate provenance
  run: |
    dta-provenance validate metadata.json
    if [ $? -ne 0 ]; then
      echo "Provenance validation failed"
      exit 1
    fi

See GitHub Actions workflow.

Community Questions¶

How can I contribute?¶

See the Contributing Guide for: - Code contributions - Documentation improvements - Bug reports - Feature requests - Examples and use cases

Who maintains this project?¶

This is an open-source educational project demonstrating DTA standards implementation. Contributions welcome!

Where can I get help?¶

GitHub Issues: Report bugs or ask questions
Documentation: Full documentation site
DTA Alliance: Official website

Blockchain-Specific Questions¶

What blockchain networks are supported?¶

The demo supports: - Hardhat local network (for testing) - Ethereum mainnet/testnets - Polygon (with configuration) - Any EVM-compatible network

See blockchain/hardhat.config.js for network configuration.

What are the gas costs?¶

From our tests: - Registration: ~228,000 gas (~$5-20 depending on network) - Verification: ~29,000 gas (~$0.50-5)

This is why Git-native is recommended for most use cases - no gas fees!

Can I use a private blockchain?¶

Yes! Deploy to: - Hyperledger Besu - Quorum - Polygon Supernets - Other private EVM networks

Private blockchains offer lower costs but lose the public transparency benefit.

Still Have Questions?¶

Open an issue: GitHub Issues
Read the docs: Full Documentation
Check examples: Real-world Examples