VS Code Extension Tutorial¶
This guide shows you how to use the DTA Provenance Validator VS Code extension for validating and managing DTA v1.0.0 provenance metadata.
Installation¶
From Source (Development)¶
-
Navigate to extension directory:
-
Install dependencies:
-
Compile TypeScript:
-
Package the extension:
-
Install in VS Code:
From VS Code Marketplace (Future)¶
Once published:
1. Open VS Code
2. Press Cmd+Shift+X (Extensions)
3. Search for "DTA Provenance Validator"
4. Click Install
Features¶
1. Automatic Validation¶
The extension automatically validates DTA provenance metadata files when you save them.
Supported file patterns:
- **/provenance*.json
- **/dta-metadata.json
- Any JSON file with "provenance" or "metadata" in the name
What it validates: - All required fields (source, provenance, use) - Recommended optional fields - Data type values - Date formats (ISO 8601) - Semantic rules (e.g., sensitiveDataCategories when sensitiveData=true)
Example:
Create provenance.json:
{
"source": {
"datasetName": "Customer Data",
"providerName": "Analytics Team"
},
"provenance": {
"dataGenerationMethod": "SQL export",
"dateDataGenerated": "2024-01-15T00:00:00Z",
"dataType": "Tabular",
"dataFormat": "CSV"
},
"use": {
"intendedUse": "ML training",
"legalRightsToUse": "Internal",
"sensitiveData": false
}
}
Save the file (Cmd+S) and see: - ✅ Validation results in the output panel - Status bar showing compliance score - Inline diagnostics for any issues
2. Manual Validation¶
Validate any file on demand:
- Open a JSON file
- Press
Cmd+Shift+P - Type "DTA: Validate Provenance Metadata"
- Press Enter
Results appear in: - Output panel (DTA Provenance channel) - Status bar (compliance score) - Inline diagnostics (red/yellow squiggles)
3. Code Snippets¶
Speed up metadata creation with snippets:
Complete Template¶
- Type
dta-template - Press
Tab - Fill in the placeholders
{
"source": {
"datasetName": "Dataset Name", // Tab to next field
"datasetVersion": "1.0.0",
"providerName": "Provider Name",
...
},
...
}
Minimal Template¶
For quick setup with required fields only:
- Type
dta-minimal - Press
Tab
Section Templates¶
Add individual sections:
- dta-source - Source metadata
- dta-provenance - Provenance metadata
- dta-use - Use metadata
4. JSON Schema Integration¶
The extension provides JSON schema support for:
IntelliSense: - Field name auto-completion - Value suggestions (e.g., dataType values) - Hover documentation
Usage:
- Open a provenance JSON file
- Start typing a field name
- Press
Ctrl+Spacefor suggestions
Example:
5. Git Provenance History¶
View DTA provenance metadata from Git commits:
- Open any file in a Git repository
- Press
Cmd+Shift+P - Type "DTA: Show Git Provenance"
- Press Enter
What you see: - Commits containing DTA provenance metadata - Complete metadata from each commit - Full commit history for the file - Author, date, and commit message
Example Git commit with provenance:
git commit -m "$(cat <<'EOF'
Add training dataset
DTA-Provenance-Version: 1.0.0
DTA-Provenance-Hash: abc123...
DTA-Dataset-Name: Customer Churn Data
DTA-Provenance-Metadata: {"source":{...},"provenance":{...},"use":{...}}
EOF
)"
The extension extracts and displays this metadata in a readable format.
Configuration¶
Settings¶
Access via Cmd+, → Search "DTA"
dta.validateOnSave (default: true)
- Automatically validate when saving files
- Disable for manual validation only
dta.showStatusBar (default: true)
- Show compliance score in status bar
- Disable to hide status bar item
dta.strictValidation (default: false)
- Treat warnings as errors
- Useful for CI/CD pipelines
Example Configuration¶
.vscode/settings.json:
Validation Examples¶
Valid Metadata¶
{
"source": {
"datasetName": "Healthcare Imaging Dataset",
"datasetVersion": "2.1.0",
"providerName": "City Hospital Research Dept",
"providerWebsite": "https://hospital.example.com/research"
},
"provenance": {
"dataGenerationMethod": "CT scans from clinical imaging systems",
"dateDataGenerated": "2023-06-01T00:00:00Z",
"dataType": "Image",
"dataFormat": "DICOM",
"qualityIndicators": {
"completeness": 0.98,
"deIdentificationScore": 1.0
}
},
"use": {
"intendedUse": "Training ML models for disease detection",
"legalRightsToUse": "Institutional approval with consent",
"restrictions": ["No commercial use", "Research only"],
"sensitiveData": true,
"sensitiveDataCategories": ["Health/Medical"],
"privacyMeasures": ["De-identification", "HIPAA compliance"],
"retentionPolicy": "10 years per IRB requirements",
"attribution": "City Hospital Research Department, 2023"
}
}
Result: ✅ 100% compliance
Missing Required Fields¶
{
"source": {
"datasetName": "My Dataset"
// Missing: providerName
},
"provenance": {
"dataGenerationMethod": "Manual entry"
// Missing: dateDataGenerated, dataType, dataFormat
},
"use": {
"intendedUse": "Analysis"
// Missing: legalRightsToUse, sensitiveData
}
}
Result: ❌ Errors - Missing required field: source.providerName - Missing required field: provenance.dateDataGenerated - Missing required field: provenance.dataType - Missing required field: provenance.dataFormat - Missing required field: use.legalRightsToUse - Missing required field: use.sensitiveData
Semantic Validation Error¶
{
"source": {
"datasetName": "Patient Records",
"providerName": "Hospital"
},
"provenance": {
"dataGenerationMethod": "EHR export",
"dateDataGenerated": "2024-01-15T00:00:00Z",
"dataType": "Tabular",
"dataFormat": "CSV"
},
"use": {
"intendedUse": "Research",
"legalRightsToUse": "IRB approved",
"sensitiveData": true
// Missing: sensitiveDataCategories (required when sensitiveData=true)
}
}
Result: ❌ Error - sensitiveDataCategories is required when sensitiveData is true
⚠️ Warning: - privacyMeasures strongly recommended when sensitiveData is true
Integration with Python CLI¶
The VS Code extension complements the Python CLI tool:
Workflow Example¶
1. Create metadata in VS Code:
Use snippets to create provenance.json, validate with extension.
2. Commit with Python CLI:
dta-provenance commit dataset.csv \
--metadata provenance.json \
--message "Add customer dataset v1.0"
3. View in VS Code:
Open dataset.csv, run "DTA: Show Git Provenance" to see metadata.
4. Verify integrity:
CI/CD Integration¶
Use strict validation in CI:
# .github/workflows/validate.yml
- name: Validate metadata
run: |
code --install-extension dta-provenance-validator-*.vsix
# Run validation tests
Or use Python CLI:
Troubleshooting¶
Extension Not Activating¶
Check:
1. File is .json format
2. Filename contains "provenance" or "metadata"
3. Open Command Palette: "Developer: Show Running Extensions"
Fix:
- Reload VS Code: Cmd+Shift+P → "Developer: Reload Window"
Validation Not Working¶
Check:
1. Settings: dta.validateOnSave is enabled
2. No JSON syntax errors (must parse as valid JSON first)
3. Check Output panel: "DTA Provenance" channel
Fix:
- Run manual validation: Cmd+Shift+P → "DTA: Validate"
- Check for JSON parse errors first
Git Provenance Not Showing¶
Check: 1. File is in a Git repository 2. File has commit history 3. Git is installed and in PATH
Fix:
Advanced Usage¶
Custom File Patterns¶
Edit .vscode/settings.json:
Workspace Recommendations¶
Create .vscode/extensions.json:
Team members get prompted to install the extension.
Multi-root Workspaces¶
Configure per-folder:
{
"folders": [
{
"path": "project1",
"settings": {
"dta.strictValidation": true
}
},
{
"path": "project2",
"settings": {
"dta.strictValidation": false
}
}
]
}
Next Steps¶
- Tutorial: Pre-commit Hooks - Automate validation
- Tutorial: DVC Integration - Large file tracking
- Tutorial: MLflow Integration - ML experiment tracking
- Reference: API Reference - Python library API