|
|
@@ -0,0 +1,1341 @@
|
|
|
+# Content-Addressable Storage (CAS) with Symlink Farm Architecture
|
|
|
+
|
|
|
+## Table of Contents
|
|
|
+- [Overview](#overview)
|
|
|
+- [Architecture Design](#architecture-design)
|
|
|
+- [Database Models](#database-models)
|
|
|
+- [Storage Backends](#storage-backends)
|
|
|
+- [Symlink Farm Views](#symlink-farm-views)
|
|
|
+- [Automatic Synchronization](#automatic-synchronization)
|
|
|
+- [Migration Strategy](#migration-strategy)
|
|
|
+- [Verification and Repair](#verification-and-repair)
|
|
|
+- [Configuration](#configuration)
|
|
|
+- [Workflow Examples](#workflow-examples)
|
|
|
+- [Benefits](#benefits)
|
|
|
+
|
|
|
+## Overview
|
|
|
+
|
|
|
+### Problem Statement
|
|
|
+ArchiveBox currently stores files in a timestamp-based structure:
|
|
|
+```
|
|
|
+/data/archive/{timestamp}/{extractor}/filename.ext
|
|
|
+```
|
|
|
+
|
|
|
+This leads to:
|
|
|
+- **Massive duplication**: `jquery.min.js` stored 1000x across different snapshots
|
|
|
+- **No S3 support**: Direct filesystem coupling
|
|
|
+- **Inflexible organization**: Hard to browse by domain, date, or user
|
|
|
+
|
|
|
+### Solution: Content-Addressable Storage + Symlink Farm
|
|
|
+
|
|
|
+**Core Concept:**
|
|
|
+1. **Store files once** in content-addressable storage (CAS) by hash
|
|
|
+2. **Create symlink farms** in multiple human-readable views
|
|
|
+3. **Database as source of truth** with automatic sync
|
|
|
+4. **Support S3 and local storage** via django-storages
|
|
|
+
|
|
|
+**Storage Layout:**
|
|
|
+```
|
|
|
+/data/
|
|
|
+├── cas/ # Content-addressable storage (deduplicated)
|
|
|
+│ └── sha256/
|
|
|
+│ └── ab/
|
|
|
+│ └── cd/
|
|
|
+│ └── abcdef123... # Actual file (stored once)
|
|
|
+│
|
|
|
+├── archive/ # Human-browseable views (all symlinks)
|
|
|
+│ ├── by_domain/
|
|
|
+│ │ └── example.com/
|
|
|
+│ │ └── 20241225/
|
|
|
+│ │ └── 019b54ee-28d9-72dc/
|
|
|
+│ │ ├── wget/
|
|
|
+│ │ │ └── index.html -> ../../../../../cas/sha256/ab/cd/abcdef...
|
|
|
+│ │ └── singlefile/
|
|
|
+│ │ └── page.html -> ../../../../../cas/sha256/ef/12/ef1234...
|
|
|
+│ │
|
|
|
+│ ├── by_date/
|
|
|
+│ │ └── 20241225/
|
|
|
+│ │ └── example.com/
|
|
|
+│ │ └── 019b54ee-28d9-72dc/
|
|
|
+│ │ └── wget/
|
|
|
+│ │ └── index.html -> ../../../../../../cas/sha256/ab/cd/abcdef...
|
|
|
+│ │
|
|
|
+│ ├── by_user/
|
|
|
+│ │ └── squash/
|
|
|
+│ │ └── 20241225/
|
|
|
+│ │ └── example.com/
|
|
|
+│ │ └── 019b54ee-28d9-72dc/
|
|
|
+│ │
|
|
|
+│ └── by_timestamp/ # Legacy compatibility
|
|
|
+│ └── 1735142400.123/
|
|
|
+│ └── wget/
|
|
|
+│ └── index.html -> ../../../../cas/sha256/ab/cd/abcdef...
|
|
|
+```
|
|
|
+
|
|
|
+## Architecture Design
|
|
|
+
|
|
|
+### Core Principles
|
|
|
+
|
|
|
+1. **Database = Source of Truth**: The `SnapshotFile` model is authoritative
|
|
|
+2. **Symlinks = Materialized Views**: Auto-generated from DB, disposable
|
|
|
+3. **Atomic Updates**: Symlinks created/deleted with DB transactions
|
|
|
+4. **Idempotent**: Operations can be safely retried
|
|
|
+5. **Self-Healing**: Automatic detection and repair of drift
|
|
|
+6. **Content-Addressable**: Files deduplicated by SHA-256 hash
|
|
|
+7. **Storage Agnostic**: Works with local filesystem, S3, Azure, etc.
|
|
|
+
|
|
|
+### Space Overhead Analysis
|
|
|
+
|
|
|
+Symlinks are incredibly cheap:
|
|
|
+```
|
|
|
+Typical symlink size:
|
|
|
+- ext4/XFS: ~60-100 bytes
|
|
|
+- ZFS: ~120 bytes
|
|
|
+- btrfs: ~80 bytes
|
|
|
+
|
|
|
+Example calculation:
|
|
|
+100,000 files × 4 views = 400,000 symlinks
|
|
|
+400,000 symlinks × 100 bytes = 40 MB
|
|
|
+
|
|
|
+Space saved by deduplication:
|
|
|
+- Average 30% duplicate content across archives
|
|
|
+- 100GB archive → saves ~30GB
|
|
|
+- Symlink overhead: 0.04GB (0.13% of savings!)
|
|
|
+
|
|
|
+Verdict: Symlinks are FREE compared to deduplication savings
|
|
|
+```
|
|
|
+
|
|
|
+## Database Models
|
|
|
+
|
|
|
+### Blob Model
|
|
|
+
|
|
|
+```python
|
|
|
+# archivebox/core/models.py
|
|
|
+
|
|
|
+class Blob(models.Model):
|
|
|
+ """
|
|
|
+ Immutable content-addressed blob.
|
|
|
+ Stored as: /cas/{hash_algorithm}/{ab}/{cd}/{full_hash}
|
|
|
+ """
|
|
|
+
|
|
|
+ # Content identification
|
|
|
+ hash_algorithm = models.CharField(max_length=16, default='sha256', db_index=True)
|
|
|
+ hash = models.CharField(max_length=128, db_index=True)
|
|
|
+ size = models.BigIntegerField()
|
|
|
+
|
|
|
+ # Storage location
|
|
|
+ storage_backend = models.CharField(
|
|
|
+ max_length=32,
|
|
|
+ default='local',
|
|
|
+ choices=[
|
|
|
+ ('local', 'Local Filesystem'),
|
|
|
+ ('s3', 'S3'),
|
|
|
+ ('azure', 'Azure Blob Storage'),
|
|
|
+ ('gcs', 'Google Cloud Storage'),
|
|
|
+ ],
|
|
|
+ db_index=True,
|
|
|
+ )
|
|
|
+
|
|
|
+ # Metadata
|
|
|
+ mime_type = models.CharField(max_length=255, blank=True)
|
|
|
+ created_at = models.DateTimeField(auto_now_add=True, db_index=True)
|
|
|
+
|
|
|
+ # Reference counting (for garbage collection)
|
|
|
+ ref_count = models.IntegerField(default=0, db_index=True)
|
|
|
+
|
|
|
+ class Meta:
|
|
|
+ unique_together = [('hash_algorithm', 'hash', 'storage_backend')]
|
|
|
+ indexes = [
|
|
|
+ models.Index(fields=['hash_algorithm', 'hash']),
|
|
|
+ models.Index(fields=['ref_count']),
|
|
|
+ models.Index(fields=['storage_backend', 'created_at']),
|
|
|
+ ]
|
|
|
+ constraints = [
|
|
|
+ # Ensure ref_count is never negative
|
|
|
+ models.CheckConstraint(
|
|
|
+ check=models.Q(ref_count__gte=0),
|
|
|
+ name='blob_ref_count_positive'
|
|
|
+ ),
|
|
|
+ ]
|
|
|
+
|
|
|
+ def __str__(self):
|
|
|
+ return f"Blob({self.hash[:16]}..., refs={self.ref_count})"
|
|
|
+
|
|
|
+ @property
|
|
|
+ def storage_path(self) -> str:
|
|
|
+ """Content-addressed path: sha256/ab/cd/abcdef123..."""
|
|
|
+ h = self.hash
|
|
|
+ return f"{self.hash_algorithm}/{h[:2]}/{h[2:4]}/{h}"
|
|
|
+
|
|
|
+ def get_file_url(self):
|
|
|
+ """Get URL to access this blob"""
|
|
|
+ from django.core.files.storage import default_storage
|
|
|
+ return default_storage.url(self.storage_path)
|
|
|
+
|
|
|
+
|
|
|
+class SnapshotFile(models.Model):
|
|
|
+ """
|
|
|
+ Links a Snapshot to its files (many-to-many through Blob).
|
|
|
+ Preserves original path information for backwards compatibility.
|
|
|
+ """
|
|
|
+
|
|
|
+ snapshot = models.ForeignKey(
|
|
|
+ Snapshot,
|
|
|
+ on_delete=models.CASCADE,
|
|
|
+ related_name='files'
|
|
|
+ )
|
|
|
+ blob = models.ForeignKey(
|
|
|
+ Blob,
|
|
|
+ on_delete=models.PROTECT # PROTECT: can't delete blob while referenced
|
|
|
+ )
|
|
|
+
|
|
|
+ # Original path information
|
|
|
+ extractor = models.CharField(max_length=32) # 'wget', 'singlefile', etc.
|
|
|
+ relative_path = models.CharField(max_length=512) # 'output.html', 'warc/example.warc.gz'
|
|
|
+
|
|
|
+ # Metadata
|
|
|
+ created_at = models.DateTimeField(auto_now_add=True, db_index=True)
|
|
|
+
|
|
|
+ class Meta:
|
|
|
+ unique_together = [('snapshot', 'extractor', 'relative_path')]
|
|
|
+ indexes = [
|
|
|
+ models.Index(fields=['snapshot', 'extractor']),
|
|
|
+ models.Index(fields=['blob']),
|
|
|
+ models.Index(fields=['created_at']),
|
|
|
+ ]
|
|
|
+
|
|
|
+ def __str__(self):
|
|
|
+ return f"{self.snapshot.id}/{self.extractor}/{self.relative_path}"
|
|
|
+
|
|
|
+ @property
|
|
|
+ def logical_path(self) -> Path:
|
|
|
+ """Virtual path as it would appear in old structure"""
|
|
|
+ return Path(self.snapshot.output_dir) / self.extractor / self.relative_path
|
|
|
+
|
|
|
+ def save(self, *args, **kwargs):
|
|
|
+ """Override save to ensure paths are normalized"""
|
|
|
+ # Normalize path (no leading slash, use forward slashes)
|
|
|
+ self.relative_path = self.relative_path.lstrip('/').replace('\\', '/')
|
|
|
+ super().save(*args, **kwargs)
|
|
|
+```
|
|
|
+
|
|
|
+### Updated Snapshot Model
|
|
|
+
|
|
|
+```python
|
|
|
+class Snapshot(ModelWithOutputDir, ...):
|
|
|
+ # ... existing fields ...
|
|
|
+
|
|
|
+ @property
|
|
|
+ def output_dir(self) -> Path:
|
|
|
+ """
|
|
|
+ Returns the primary view directory for browsing.
|
|
|
+ Falls back to legacy if needed.
|
|
|
+ """
|
|
|
+ # Try by_timestamp view first (best compatibility)
|
|
|
+ by_timestamp = CONSTANTS.ARCHIVE_DIR / 'by_timestamp' / self.timestamp
|
|
|
+ if by_timestamp.exists():
|
|
|
+ return by_timestamp
|
|
|
+
|
|
|
+ # Fall back to legacy location (pre-CAS archives)
|
|
|
+ legacy = CONSTANTS.ARCHIVE_DIR / self.timestamp
|
|
|
+ if legacy.exists():
|
|
|
+ return legacy
|
|
|
+
|
|
|
+ # Default to by_timestamp for new snapshots
|
|
|
+ return by_timestamp
|
|
|
+
|
|
|
+ def get_output_dir(self, view: str = 'by_timestamp') -> Path:
|
|
|
+ """Get output directory for a specific view"""
|
|
|
+ from storage.views import ViewManager
|
|
|
+ from urllib.parse import urlparse
|
|
|
+
|
|
|
+ if view not in ViewManager.VIEWS:
|
|
|
+ raise ValueError(f"Unknown view: {view}")
|
|
|
+
|
|
|
+ if view == 'by_domain':
|
|
|
+ domain = urlparse(self.url).netloc or 'unknown'
|
|
|
+ date = self.created_at.strftime('%Y%m%d')
|
|
|
+ return CONSTANTS.ARCHIVE_DIR / 'by_domain' / domain / date / str(self.id)
|
|
|
+
|
|
|
+ elif view == 'by_date':
|
|
|
+ domain = urlparse(self.url).netloc or 'unknown'
|
|
|
+ date = self.created_at.strftime('%Y%m%d')
|
|
|
+ return CONSTANTS.ARCHIVE_DIR / 'by_date' / date / domain / str(self.id)
|
|
|
+
|
|
|
+ elif view == 'by_user':
|
|
|
+ domain = urlparse(self.url).netloc or 'unknown'
|
|
|
+ date = self.created_at.strftime('%Y%m%d')
|
|
|
+ user = self.created_by.username
|
|
|
+ return CONSTANTS.ARCHIVE_DIR / 'by_user' / user / date / domain / str(self.id)
|
|
|
+
|
|
|
+ elif view == 'by_timestamp':
|
|
|
+ return CONSTANTS.ARCHIVE_DIR / 'by_timestamp' / self.timestamp
|
|
|
+
|
|
|
+ return self.output_dir
|
|
|
+```
|
|
|
+
|
|
|
+### Updated ArchiveResult Model
|
|
|
+
|
|
|
+```python
|
|
|
+class ArchiveResult(models.Model):
|
|
|
+ # ... existing fields ...
|
|
|
+
|
|
|
+ # Note: output_dir field is removed (was deprecated)
|
|
|
+ # Keep: output (relative path to primary output file)
|
|
|
+
|
|
|
+ @property
|
|
|
+ def output_files(self):
|
|
|
+ """Get all files for this extractor"""
|
|
|
+ return self.snapshot.files.filter(extractor=self.extractor)
|
|
|
+
|
|
|
+ @property
|
|
|
+ def primary_output_file(self):
|
|
|
+ """Get the primary output file (e.g., 'output.html')"""
|
|
|
+ if self.output:
|
|
|
+ return self.snapshot.files.filter(
|
|
|
+ extractor=self.extractor,
|
|
|
+ relative_path=self.output
|
|
|
+ ).first()
|
|
|
+ return None
|
|
|
+```
|
|
|
+
|
|
|
+## Storage Backends
|
|
|
+
|
|
|
+### Django Storage Configuration
|
|
|
+
|
|
|
+```python
|
|
|
+# settings.py or archivebox/config/settings.py
|
|
|
+
|
|
|
+# For local development/testing
|
|
|
+STORAGES = {
|
|
|
+ "default": {
|
|
|
+ "BACKEND": "django.core.files.storage.FileSystemStorage",
|
|
|
+ "OPTIONS": {
|
|
|
+ "location": "/data/cas",
|
|
|
+ "base_url": "/cas/",
|
|
|
+ },
|
|
|
+ },
|
|
|
+ "staticfiles": {
|
|
|
+ "BACKEND": "django.contrib.staticfiles.storage.StaticFilesStorage",
|
|
|
+ },
|
|
|
+}
|
|
|
+
|
|
|
+# For production with S3
|
|
|
+STORAGES = {
|
|
|
+ "default": {
|
|
|
+ "BACKEND": "storages.backends.s3.S3Storage",
|
|
|
+ "OPTIONS": {
|
|
|
+ "bucket_name": "archivebox-blobs",
|
|
|
+ "region_name": "us-east-1",
|
|
|
+ "default_acl": "private",
|
|
|
+ "object_parameters": {
|
|
|
+ "StorageClass": "INTELLIGENT_TIERING", # Auto-optimize storage costs
|
|
|
+ },
|
|
|
+ },
|
|
|
+ },
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+### Blob Manager
|
|
|
+
|
|
|
+```python
|
|
|
+# archivebox/storage/ingest.py
|
|
|
+
|
|
|
+import hashlib
|
|
|
+from django.core.files.storage import default_storage
|
|
|
+from django.core.files.base import ContentFile
|
|
|
+from django.db import transaction
|
|
|
+from pathlib import Path
|
|
|
+import os
|
|
|
+
|
|
|
+class BlobManager:
|
|
|
+ """Manages content-addressed blob storage with deduplication"""
|
|
|
+
|
|
|
+ @staticmethod
|
|
|
+ def hash_file(file_path: Path, algorithm='sha256') -> str:
|
|
|
+ """Calculate content hash of a file"""
|
|
|
+ hasher = hashlib.new(algorithm)
|
|
|
+ with open(file_path, 'rb') as f:
|
|
|
+ for chunk in iter(lambda: f.read(65536), b''):
|
|
|
+ hasher.update(chunk)
|
|
|
+ return hasher.hexdigest()
|
|
|
+
|
|
|
+ @staticmethod
|
|
|
+ def ingest_file(
|
|
|
+ file_path: Path,
|
|
|
+ snapshot,
|
|
|
+ extractor: str,
|
|
|
+ relative_path: str,
|
|
|
+ mime_type: str = '',
|
|
|
+ create_views: bool = True,
|
|
|
+ ) -> SnapshotFile:
|
|
|
+ """
|
|
|
+ Ingest a file into blob storage with deduplication.
|
|
|
+
|
|
|
+ Args:
|
|
|
+ file_path: Path to the file to ingest
|
|
|
+ snapshot: Snapshot this file belongs to
|
|
|
+ extractor: Extractor name (wget, singlefile, etc.)
|
|
|
+ relative_path: Relative path within extractor dir
|
|
|
+ mime_type: MIME type of the file
|
|
|
+ create_views: Whether to create symlink views
|
|
|
+
|
|
|
+ Returns:
|
|
|
+ SnapshotFile reference
|
|
|
+ """
|
|
|
+ from storage.views import ViewManager
|
|
|
+
|
|
|
+ # Calculate hash
|
|
|
+ file_hash = BlobManager.hash_file(file_path)
|
|
|
+ file_size = file_path.stat().st_size
|
|
|
+
|
|
|
+ with transaction.atomic():
|
|
|
+ # Check if blob already exists (deduplication!)
|
|
|
+ blob, created = Blob.objects.get_or_create(
|
|
|
+ hash_algorithm='sha256',
|
|
|
+ hash=file_hash,
|
|
|
+ storage_backend='local',
|
|
|
+ defaults={
|
|
|
+ 'size': file_size,
|
|
|
+ 'mime_type': mime_type,
|
|
|
+ }
|
|
|
+ )
|
|
|
+
|
|
|
+ if created:
|
|
|
+ # New blob - store in CAS
|
|
|
+ cas_path = ViewManager.get_cas_path(blob)
|
|
|
+ cas_path.parent.mkdir(parents=True, exist_ok=True)
|
|
|
+
|
|
|
+ # Use hardlink if possible (instant), copy if not
|
|
|
+ try:
|
|
|
+ os.link(file_path, cas_path)
|
|
|
+ except OSError:
|
|
|
+ import shutil
|
|
|
+ shutil.copy2(file_path, cas_path)
|
|
|
+
|
|
|
+ print(f"✓ Stored new blob: {file_hash[:16]}... ({file_size:,} bytes)")
|
|
|
+ else:
|
|
|
+ print(f"✓ Deduplicated: {file_hash[:16]}... (saved {file_size:,} bytes)")
|
|
|
+
|
|
|
+ # Increment reference count
|
|
|
+ blob.ref_count += 1
|
|
|
+ blob.save(update_fields=['ref_count'])
|
|
|
+
|
|
|
+ # Create snapshot file reference
|
|
|
+ snapshot_file, _ = SnapshotFile.objects.get_or_create(
|
|
|
+ snapshot=snapshot,
|
|
|
+ extractor=extractor,
|
|
|
+ relative_path=relative_path,
|
|
|
+ defaults={'blob': blob}
|
|
|
+ )
|
|
|
+
|
|
|
+ # Create symlink views (signal will also do this, but we can force it here)
|
|
|
+ if create_views:
|
|
|
+ views = ViewManager.create_symlinks(snapshot_file)
|
|
|
+ print(f" Created {len(views)} view symlinks")
|
|
|
+
|
|
|
+ return snapshot_file
|
|
|
+
|
|
|
+ @staticmethod
|
|
|
+ def ingest_directory(
|
|
|
+ dir_path: Path,
|
|
|
+ snapshot,
|
|
|
+ extractor: str
|
|
|
+ ) -> list[SnapshotFile]:
|
|
|
+ """Ingest all files from a directory"""
|
|
|
+ import mimetypes
|
|
|
+
|
|
|
+ snapshot_files = []
|
|
|
+
|
|
|
+ for file_path in dir_path.rglob('*'):
|
|
|
+ if file_path.is_file():
|
|
|
+ relative_path = str(file_path.relative_to(dir_path))
|
|
|
+ mime_type, _ = mimetypes.guess_type(str(file_path))
|
|
|
+
|
|
|
+ snapshot_file = BlobManager.ingest_file(
|
|
|
+ file_path,
|
|
|
+ snapshot,
|
|
|
+ extractor,
|
|
|
+ relative_path,
|
|
|
+ mime_type or ''
|
|
|
+ )
|
|
|
+ snapshot_files.append(snapshot_file)
|
|
|
+
|
|
|
+ return snapshot_files
|
|
|
+```
|
|
|
+
|
|
|
+## Symlink Farm Views
|
|
|
+
|
|
|
+### View Classes
|
|
|
+
|
|
|
+```python
|
|
|
+# archivebox/storage/views.py
|
|
|
+
|
|
|
+from pathlib import Path
|
|
|
+from typing import Protocol
|
|
|
+from urllib.parse import urlparse
|
|
|
+import os
|
|
|
+import logging
|
|
|
+
|
|
|
+logger = logging.getLogger(__name__)
|
|
|
+
|
|
|
+
|
|
|
+class SnapshotView(Protocol):
|
|
|
+ """Protocol for generating browseable views of snapshots"""
|
|
|
+
|
|
|
+ def get_view_path(self, snapshot_file: SnapshotFile) -> Path:
|
|
|
+ """Get the human-readable path for this file in this view"""
|
|
|
+ ...
|
|
|
+
|
|
|
+
|
|
|
+class ByDomainView:
|
|
|
+ """View: /archive/by_domain/{domain}/{YYYYMMDD}/{snapshot_id}/{extractor}/{filename}"""
|
|
|
+
|
|
|
+ def get_view_path(self, snapshot_file: SnapshotFile) -> Path:
|
|
|
+ snapshot = snapshot_file.snapshot
|
|
|
+ domain = urlparse(snapshot.url).netloc or 'unknown'
|
|
|
+ date = snapshot.created_at.strftime('%Y%m%d')
|
|
|
+
|
|
|
+ return (
|
|
|
+ CONSTANTS.ARCHIVE_DIR / 'by_domain' / domain / date /
|
|
|
+ str(snapshot.id) / snapshot_file.extractor / snapshot_file.relative_path
|
|
|
+ )
|
|
|
+
|
|
|
+
|
|
|
+class ByDateView:
|
|
|
+ """View: /archive/by_date/{YYYYMMDD}/{domain}/{snapshot_id}/{extractor}/{filename}"""
|
|
|
+
|
|
|
+ def get_view_path(self, snapshot_file: SnapshotFile) -> Path:
|
|
|
+ snapshot = snapshot_file.snapshot
|
|
|
+ domain = urlparse(snapshot.url).netloc or 'unknown'
|
|
|
+ date = snapshot.created_at.strftime('%Y%m%d')
|
|
|
+
|
|
|
+ return (
|
|
|
+ CONSTANTS.ARCHIVE_DIR / 'by_date' / date / domain /
|
|
|
+ str(snapshot.id) / snapshot_file.extractor / snapshot_file.relative_path
|
|
|
+ )
|
|
|
+
|
|
|
+
|
|
|
+class ByUserView:
|
|
|
+ """View: /archive/by_user/{username}/{YYYYMMDD}/{domain}/{snapshot_id}/{extractor}/{filename}"""
|
|
|
+
|
|
|
+ def get_view_path(self, snapshot_file: SnapshotFile) -> Path:
|
|
|
+ snapshot = snapshot_file.snapshot
|
|
|
+ user = snapshot.created_by.username
|
|
|
+ domain = urlparse(snapshot.url).netloc or 'unknown'
|
|
|
+ date = snapshot.created_at.strftime('%Y%m%d')
|
|
|
+
|
|
|
+ return (
|
|
|
+ CONSTANTS.ARCHIVE_DIR / 'by_user' / user / date / domain /
|
|
|
+ str(snapshot.id) / snapshot_file.extractor / snapshot_file.relative_path
|
|
|
+ )
|
|
|
+
|
|
|
+
|
|
|
+class LegacyTimestampView:
|
|
|
+ """View: /archive/by_timestamp/{timestamp}/{extractor}/{filename}"""
|
|
|
+
|
|
|
+ def get_view_path(self, snapshot_file: SnapshotFile) -> Path:
|
|
|
+ snapshot = snapshot_file.snapshot
|
|
|
+
|
|
|
+ return (
|
|
|
+ CONSTANTS.ARCHIVE_DIR / 'by_timestamp' / snapshot.timestamp /
|
|
|
+ snapshot_file.extractor / snapshot_file.relative_path
|
|
|
+ )
|
|
|
+
|
|
|
+
|
|
|
+class ViewManager:
|
|
|
+ """Manages symlink farm views"""
|
|
|
+
|
|
|
+ VIEWS = {
|
|
|
+ 'by_domain': ByDomainView(),
|
|
|
+ 'by_date': ByDateView(),
|
|
|
+ 'by_user': ByUserView(),
|
|
|
+ 'by_timestamp': LegacyTimestampView(),
|
|
|
+ }
|
|
|
+
|
|
|
+ @staticmethod
|
|
|
+ def get_cas_path(blob: Blob) -> Path:
|
|
|
+ """Get the CAS storage path for a blob"""
|
|
|
+ h = blob.hash
|
|
|
+ return (
|
|
|
+ CONSTANTS.DATA_DIR / 'cas' / blob.hash_algorithm /
|
|
|
+ h[:2] / h[2:4] / h
|
|
|
+ )
|
|
|
+
|
|
|
+ @staticmethod
|
|
|
+ def create_symlinks(snapshot_file: SnapshotFile, views: list[str] = None) -> dict[str, Path]:
|
|
|
+ """
|
|
|
+ Create symlinks for all views of a file.
|
|
|
+ If any operation fails, all are rolled back.
|
|
|
+ """
|
|
|
+ from config.common import STORAGE_CONFIG
|
|
|
+
|
|
|
+ if views is None:
|
|
|
+ views = STORAGE_CONFIG.ENABLED_VIEWS
|
|
|
+
|
|
|
+ cas_path = ViewManager.get_cas_path(snapshot_file.blob)
|
|
|
+
|
|
|
+ # Verify CAS file exists before creating symlinks
|
|
|
+ if not cas_path.exists():
|
|
|
+ raise FileNotFoundError(f"CAS file missing: {cas_path}")
|
|
|
+
|
|
|
+ created = {}
|
|
|
+ cleanup_on_error = []
|
|
|
+
|
|
|
+ try:
|
|
|
+ for view_name in views:
|
|
|
+ if view_name not in ViewManager.VIEWS:
|
|
|
+ continue
|
|
|
+
|
|
|
+ view = ViewManager.VIEWS[view_name]
|
|
|
+ view_path = view.get_view_path(snapshot_file)
|
|
|
+
|
|
|
+ # Create parent directory
|
|
|
+ view_path.parent.mkdir(parents=True, exist_ok=True)
|
|
|
+
|
|
|
+ # Create relative symlink (more portable)
|
|
|
+ rel_target = os.path.relpath(cas_path, view_path.parent)
|
|
|
+
|
|
|
+ # Remove existing symlink/file if present
|
|
|
+ if view_path.exists() or view_path.is_symlink():
|
|
|
+ view_path.unlink()
|
|
|
+
|
|
|
+ # Create symlink
|
|
|
+ view_path.symlink_to(rel_target)
|
|
|
+ created[view_name] = view_path
|
|
|
+ cleanup_on_error.append(view_path)
|
|
|
+
|
|
|
+ return created
|
|
|
+
|
|
|
+ except Exception as e:
|
|
|
+ # Rollback: Remove partially created symlinks
|
|
|
+ for path in cleanup_on_error:
|
|
|
+ try:
|
|
|
+ if path.exists() or path.is_symlink():
|
|
|
+ path.unlink()
|
|
|
+ except Exception as cleanup_error:
|
|
|
+ logger.error(f"Failed to cleanup {path}: {cleanup_error}")
|
|
|
+
|
|
|
+ raise Exception(f"Failed to create symlinks: {e}")
|
|
|
+
|
|
|
+ @staticmethod
|
|
|
+ def create_symlinks_idempotent(snapshot_file: SnapshotFile, views: list[str] = None):
|
|
|
+ """
|
|
|
+ Idempotent version - safe to call multiple times.
|
|
|
+ Returns dict of created symlinks, or empty dict if already correct.
|
|
|
+ """
|
|
|
+ from config.common import STORAGE_CONFIG
|
|
|
+
|
|
|
+ if views is None:
|
|
|
+ views = STORAGE_CONFIG.ENABLED_VIEWS
|
|
|
+
|
|
|
+ cas_path = ViewManager.get_cas_path(snapshot_file.blob)
|
|
|
+ needs_update = False
|
|
|
+
|
|
|
+ # Check if all symlinks exist and point to correct target
|
|
|
+ for view_name in views:
|
|
|
+ if view_name not in ViewManager.VIEWS:
|
|
|
+ continue
|
|
|
+
|
|
|
+ view = ViewManager.VIEWS[view_name]
|
|
|
+ view_path = view.get_view_path(snapshot_file)
|
|
|
+
|
|
|
+ if not view_path.is_symlink():
|
|
|
+ needs_update = True
|
|
|
+ break
|
|
|
+
|
|
|
+ # Check if symlink points to correct target
|
|
|
+ try:
|
|
|
+ current_target = view_path.resolve()
|
|
|
+ if current_target != cas_path:
|
|
|
+ needs_update = True
|
|
|
+ break
|
|
|
+ except Exception:
|
|
|
+ needs_update = True
|
|
|
+ break
|
|
|
+
|
|
|
+ if needs_update:
|
|
|
+ return ViewManager.create_symlinks(snapshot_file, views)
|
|
|
+
|
|
|
+ return {} # Already correct
|
|
|
+
|
|
|
+ @staticmethod
|
|
|
+ def cleanup_symlinks(snapshot_file: SnapshotFile):
|
|
|
+ """Remove all symlinks for a file"""
|
|
|
+ from config.common import STORAGE_CONFIG
|
|
|
+
|
|
|
+ for view_name in STORAGE_CONFIG.ENABLED_VIEWS:
|
|
|
+ if view_name not in ViewManager.VIEWS:
|
|
|
+ continue
|
|
|
+
|
|
|
+ view = ViewManager.VIEWS[view_name]
|
|
|
+ view_path = view.get_view_path(snapshot_file)
|
|
|
+
|
|
|
+ if view_path.exists() or view_path.is_symlink():
|
|
|
+ view_path.unlink()
|
|
|
+ logger.info(f"Removed symlink: {view_path}")
|
|
|
+```
|
|
|
+
|
|
|
+## Automatic Synchronization
|
|
|
+
|
|
|
+### Django Signals for Sync
|
|
|
+
|
|
|
+```python
|
|
|
+# archivebox/storage/signals.py
|
|
|
+
|
|
|
+from django.db.models.signals import post_save, post_delete, pre_delete
|
|
|
+from django.dispatch import receiver
|
|
|
+from django.db import transaction
|
|
|
+from core.models import SnapshotFile, Blob
|
|
|
+import logging
|
|
|
+
|
|
|
+logger = logging.getLogger(__name__)
|
|
|
+
|
|
|
+
|
|
|
+@receiver(post_save, sender=SnapshotFile)
|
|
|
+def sync_symlinks_on_save(sender, instance, created, **kwargs):
|
|
|
+ """
|
|
|
+ Automatically create/update symlinks when SnapshotFile is saved.
|
|
|
+ Runs AFTER transaction commit to ensure DB consistency.
|
|
|
+ """
|
|
|
+ from config.common import STORAGE_CONFIG
|
|
|
+
|
|
|
+ if not STORAGE_CONFIG.AUTO_SYNC_SYMLINKS:
|
|
|
+ return
|
|
|
+
|
|
|
+ if created:
|
|
|
+ # New file - create all symlinks
|
|
|
+ try:
|
|
|
+ from storage.views import ViewManager
|
|
|
+ views = ViewManager.create_symlinks(instance)
|
|
|
+ logger.info(f"Created {len(views)} symlinks for {instance.relative_path}")
|
|
|
+ except Exception as e:
|
|
|
+ logger.error(f"Failed to create symlinks for {instance.id}: {e}")
|
|
|
+ # Don't fail the transaction - can be repaired later
|
|
|
+
|
|
|
+
|
|
|
+@receiver(pre_delete, sender=SnapshotFile)
|
|
|
+def sync_symlinks_on_delete(sender, instance, **kwargs):
|
|
|
+ """
|
|
|
+ Remove symlinks when SnapshotFile is deleted.
|
|
|
+ Runs BEFORE deletion so we still have the data.
|
|
|
+ """
|
|
|
+ try:
|
|
|
+ from storage.views import ViewManager
|
|
|
+ ViewManager.cleanup_symlinks(instance)
|
|
|
+ logger.info(f"Removed symlinks for {instance.relative_path}")
|
|
|
+ except Exception as e:
|
|
|
+ logger.error(f"Failed to remove symlinks for {instance.id}: {e}")
|
|
|
+
|
|
|
+
|
|
|
+@receiver(post_delete, sender=SnapshotFile)
|
|
|
+def cleanup_unreferenced_blob(sender, instance, **kwargs):
|
|
|
+ """
|
|
|
+ Decrement blob reference count and cleanup if no longer referenced.
|
|
|
+ """
|
|
|
+ try:
|
|
|
+ blob = instance.blob
|
|
|
+
|
|
|
+ # Atomic decrement
|
|
|
+ from django.db.models import F
|
|
|
+ Blob.objects.filter(pk=blob.pk).update(ref_count=F('ref_count') - 1)
|
|
|
+
|
|
|
+ # Reload to get updated count
|
|
|
+ blob.refresh_from_db()
|
|
|
+
|
|
|
+ # Garbage collect if no more references
|
|
|
+ if blob.ref_count <= 0:
|
|
|
+ from storage.views import ViewManager
|
|
|
+ cas_path = ViewManager.get_cas_path(blob)
|
|
|
+
|
|
|
+ if cas_path.exists():
|
|
|
+ cas_path.unlink()
|
|
|
+ logger.info(f"Garbage collected blob {blob.hash[:16]}...")
|
|
|
+
|
|
|
+ blob.delete()
|
|
|
+
|
|
|
+ except Exception as e:
|
|
|
+ logger.error(f"Failed to cleanup blob: {e}")
|
|
|
+```
|
|
|
+
|
|
|
+### App Configuration
|
|
|
+
|
|
|
+```python
|
|
|
+# archivebox/storage/apps.py
|
|
|
+
|
|
|
+from django.apps import AppConfig
|
|
|
+
|
|
|
+class StorageConfig(AppConfig):
|
|
|
+ default_auto_field = 'django.db.models.BigAutoField'
|
|
|
+ name = 'storage'
|
|
|
+
|
|
|
+ def ready(self):
|
|
|
+ import storage.signals # Register signal handlers
|
|
|
+```
|
|
|
+
|
|
|
+## Migration Strategy
|
|
|
+
|
|
|
+### Migration Command
|
|
|
+
|
|
|
+```python
|
|
|
+# archivebox/core/management/commands/migrate_to_cas.py
|
|
|
+
|
|
|
+from django.core.management.base import BaseCommand
|
|
|
+from django.db.models import Q
|
|
|
+from core.models import Snapshot
|
|
|
+from storage.ingest import BlobManager
|
|
|
+from storage.views import ViewManager
|
|
|
+from pathlib import Path
|
|
|
+import shutil
|
|
|
+
|
|
|
+class Command(BaseCommand):
|
|
|
+ help = 'Migrate existing archives to content-addressable storage'
|
|
|
+
|
|
|
+ def add_arguments(self, parser):
|
|
|
+ parser.add_argument('--dry-run', action='store_true', help='Show what would be done')
|
|
|
+ parser.add_argument('--views', nargs='+', default=['by_timestamp', 'by_domain', 'by_date'])
|
|
|
+ parser.add_argument('--cleanup-legacy', action='store_true', help='Delete old files after migration')
|
|
|
+ parser.add_argument('--batch-size', type=int, default=100)
|
|
|
+
|
|
|
+ def handle(self, *args, **options):
|
|
|
+ dry_run = options['dry_run']
|
|
|
+ views = options['views']
|
|
|
+ cleanup = options['cleanup_legacy']
|
|
|
+ batch_size = options['batch_size']
|
|
|
+
|
|
|
+ snapshots = Snapshot.objects.all().order_by('created_at')
|
|
|
+ total = snapshots.count()
|
|
|
+
|
|
|
+ if dry_run:
|
|
|
+ self.stdout.write(self.style.WARNING('DRY RUN - No changes will be made'))
|
|
|
+
|
|
|
+ self.stdout.write(f"Found {total} snapshots to migrate")
|
|
|
+
|
|
|
+ total_files = 0
|
|
|
+ total_saved = 0
|
|
|
+ total_bytes = 0
|
|
|
+ error_count = 0
|
|
|
+
|
|
|
+ for i, snapshot in enumerate(snapshots, 1):
|
|
|
+ self.stdout.write(f"\n[{i}/{total}] Processing {snapshot.url[:60]}...")
|
|
|
+
|
|
|
+ legacy_dir = CONSTANTS.ARCHIVE_DIR / snapshot.timestamp
|
|
|
+
|
|
|
+ if not legacy_dir.exists():
|
|
|
+ self.stdout.write(f" Skipping (no legacy dir)")
|
|
|
+ continue
|
|
|
+
|
|
|
+ # Process each extractor directory
|
|
|
+ for extractor_dir in legacy_dir.iterdir():
|
|
|
+ if not extractor_dir.is_dir():
|
|
|
+ continue
|
|
|
+
|
|
|
+ extractor = extractor_dir.name
|
|
|
+ self.stdout.write(f" Processing extractor: {extractor}")
|
|
|
+
|
|
|
+ if dry_run:
|
|
|
+ file_count = sum(1 for _ in extractor_dir.rglob('*') if _.is_file())
|
|
|
+ self.stdout.write(f" Would ingest {file_count} files")
|
|
|
+ continue
|
|
|
+
|
|
|
+ # Track blobs before ingestion
|
|
|
+ blobs_before = Blob.objects.count()
|
|
|
+
|
|
|
+ try:
|
|
|
+ # Ingest all files from this extractor
|
|
|
+ ingested = BlobManager.ingest_directory(
|
|
|
+ extractor_dir,
|
|
|
+ snapshot,
|
|
|
+ extractor
|
|
|
+ )
|
|
|
+
|
|
|
+ total_files += len(ingested)
|
|
|
+
|
|
|
+ # Calculate deduplication savings
|
|
|
+ blobs_after = Blob.objects.count()
|
|
|
+ new_blobs = blobs_after - blobs_before
|
|
|
+ dedup_count = len(ingested) - new_blobs
|
|
|
+
|
|
|
+ if dedup_count > 0:
|
|
|
+ dedup_bytes = sum(f.blob.size for f in ingested[-dedup_count:])
|
|
|
+ total_saved += dedup_bytes
|
|
|
+ self.stdout.write(
|
|
|
+ f" ✓ Ingested {len(ingested)} files "
|
|
|
+ f"({new_blobs} new, {dedup_count} deduplicated, "
|
|
|
+ f"saved {dedup_bytes / 1024 / 1024:.1f} MB)"
|
|
|
+ )
|
|
|
+ else:
|
|
|
+ total_bytes_added = sum(f.blob.size for f in ingested)
|
|
|
+ total_bytes += total_bytes_added
|
|
|
+ self.stdout.write(
|
|
|
+ f" ✓ Ingested {len(ingested)} files "
|
|
|
+ f"({total_bytes_added / 1024 / 1024:.1f} MB)"
|
|
|
+ )
|
|
|
+
|
|
|
+ except Exception as e:
|
|
|
+ error_count += 1
|
|
|
+ self.stdout.write(self.style.ERROR(f" ✗ Error: {e}"))
|
|
|
+ continue
|
|
|
+
|
|
|
+ # Cleanup legacy files
|
|
|
+ if cleanup and not dry_run:
|
|
|
+ try:
|
|
|
+ shutil.rmtree(legacy_dir)
|
|
|
+ self.stdout.write(f" Cleaned up legacy dir: {legacy_dir}")
|
|
|
+ except Exception as e:
|
|
|
+ self.stdout.write(self.style.WARNING(f" Failed to cleanup: {e}"))
|
|
|
+
|
|
|
+ # Progress update
|
|
|
+ if i % 10 == 0:
|
|
|
+ self.stdout.write(
|
|
|
+ f"\nProgress: {i}/{total} | "
|
|
|
+ f"Files: {total_files:,} | "
|
|
|
+ f"Saved: {total_saved / 1024 / 1024:.1f} MB | "
|
|
|
+ f"Errors: {error_count}"
|
|
|
+ )
|
|
|
+
|
|
|
+ # Final summary
|
|
|
+ self.stdout.write("\n" + "="*80)
|
|
|
+ self.stdout.write(self.style.SUCCESS("Migration Complete!"))
|
|
|
+ self.stdout.write(f" Snapshots processed: {total}")
|
|
|
+ self.stdout.write(f" Files ingested: {total_files:,}")
|
|
|
+ self.stdout.write(f" Space saved by deduplication: {total_saved / 1024 / 1024:.1f} MB")
|
|
|
+ self.stdout.write(f" Errors: {error_count}")
|
|
|
+ self.stdout.write(f" Symlink views created: {', '.join(views)}")
|
|
|
+```
|
|
|
+
|
|
|
+### Rebuild Views Command
|
|
|
+
|
|
|
+```python
|
|
|
+# archivebox/core/management/commands/rebuild_views.py
|
|
|
+
|
|
|
+from django.core.management.base import BaseCommand
|
|
|
+from core.models import SnapshotFile
|
|
|
+from storage.views import ViewManager
|
|
|
+import shutil
|
|
|
+
|
|
|
+class Command(BaseCommand):
|
|
|
+ help = 'Rebuild symlink farm views from database'
|
|
|
+
|
|
|
+ def add_arguments(self, parser):
|
|
|
+ parser.add_argument(
|
|
|
+ '--views',
|
|
|
+ nargs='+',
|
|
|
+ default=['by_timestamp', 'by_domain', 'by_date'],
|
|
|
+ help='Which views to rebuild'
|
|
|
+ )
|
|
|
+ parser.add_argument(
|
|
|
+ '--clean',
|
|
|
+ action='store_true',
|
|
|
+ help='Remove old symlinks before rebuilding'
|
|
|
+ )
|
|
|
+
|
|
|
+ def handle(self, *args, **options):
|
|
|
+ views = options['views']
|
|
|
+ clean = options['clean']
|
|
|
+
|
|
|
+ # Clean old views
|
|
|
+ if clean:
|
|
|
+ self.stdout.write("Cleaning old views...")
|
|
|
+ for view_name in views:
|
|
|
+ view_dir = CONSTANTS.ARCHIVE_DIR / view_name
|
|
|
+ if view_dir.exists():
|
|
|
+ shutil.rmtree(view_dir)
|
|
|
+ self.stdout.write(f" Removed {view_dir}")
|
|
|
+
|
|
|
+ # Rebuild all symlinks
|
|
|
+ total_symlinks = 0
|
|
|
+ total_files = SnapshotFile.objects.count()
|
|
|
+
|
|
|
+ self.stdout.write(f"Rebuilding symlinks for {total_files:,} files...")
|
|
|
+
|
|
|
+ for i, snapshot_file in enumerate(
|
|
|
+ SnapshotFile.objects.select_related('snapshot', 'blob'),
|
|
|
+ 1
|
|
|
+ ):
|
|
|
+ try:
|
|
|
+ created = ViewManager.create_symlinks(snapshot_file, views=views)
|
|
|
+ total_symlinks += len(created)
|
|
|
+ except Exception as e:
|
|
|
+ self.stdout.write(self.style.ERROR(
|
|
|
+ f"Failed to create symlinks for {snapshot_file}: {e}"
|
|
|
+ ))
|
|
|
+
|
|
|
+ if i % 1000 == 0:
|
|
|
+ self.stdout.write(f" Created {total_symlinks:,} symlinks...")
|
|
|
+
|
|
|
+ self.stdout.write(
|
|
|
+ self.style.SUCCESS(
|
|
|
+ f"\n✓ Rebuilt {total_symlinks:,} symlinks across {len(views)} views"
|
|
|
+ )
|
|
|
+ )
|
|
|
+```
|
|
|
+
|
|
|
+## Verification and Repair
|
|
|
+
|
|
|
+### Storage Verification Command
|
|
|
+
|
|
|
+```python
|
|
|
+# archivebox/core/management/commands/verify_storage.py
|
|
|
+
|
|
|
+from django.core.management.base import BaseCommand
|
|
|
+from core.models import SnapshotFile, Blob
|
|
|
+from storage.views import ViewManager
|
|
|
+from pathlib import Path
|
|
|
+
|
|
|
+class Command(BaseCommand):
|
|
|
+ help = 'Verify storage consistency between DB and filesystem'
|
|
|
+
|
|
|
+ def add_arguments(self, parser):
|
|
|
+ parser.add_argument('--fix', action='store_true', help='Fix issues found')
|
|
|
+ parser.add_argument('--vacuum', action='store_true', help='Remove orphaned symlinks')
|
|
|
+
|
|
|
+ def handle(self, *args, **options):
|
|
|
+ fix = options['fix']
|
|
|
+ vacuum = options['vacuum']
|
|
|
+
|
|
|
+ issues = {
|
|
|
+ 'missing_cas_files': [],
|
|
|
+ 'missing_symlinks': [],
|
|
|
+ 'incorrect_symlinks': [],
|
|
|
+ 'orphaned_symlinks': [],
|
|
|
+ 'orphaned_blobs': [],
|
|
|
+ }
|
|
|
+
|
|
|
+ self.stdout.write("Checking database → filesystem consistency...")
|
|
|
+
|
|
|
+ # Check 1: Verify all blobs exist in CAS
|
|
|
+ self.stdout.write("\n1. Verifying CAS files...")
|
|
|
+ for blob in Blob.objects.all():
|
|
|
+ cas_path = ViewManager.get_cas_path(blob)
|
|
|
+ if not cas_path.exists():
|
|
|
+ issues['missing_cas_files'].append(blob)
|
|
|
+ self.stdout.write(self.style.ERROR(
|
|
|
+ f"✗ Missing CAS file: {cas_path} (blob {blob.hash[:16]}...)"
|
|
|
+ ))
|
|
|
+
|
|
|
+ # Check 2: Verify all SnapshotFiles have correct symlinks
|
|
|
+ self.stdout.write("\n2. Verifying symlinks...")
|
|
|
+ total_files = SnapshotFile.objects.count()
|
|
|
+
|
|
|
+ for i, sf in enumerate(SnapshotFile.objects.select_related('blob'), 1):
|
|
|
+ if i % 100 == 0:
|
|
|
+ self.stdout.write(f" Checked {i}/{total_files} files...")
|
|
|
+
|
|
|
+ cas_path = ViewManager.get_cas_path(sf.blob)
|
|
|
+
|
|
|
+ for view_name in STORAGE_CONFIG.ENABLED_VIEWS:
|
|
|
+ view = ViewManager.VIEWS[view_name]
|
|
|
+ view_path = view.get_view_path(sf)
|
|
|
+
|
|
|
+ if not view_path.exists() and not view_path.is_symlink():
|
|
|
+ issues['missing_symlinks'].append((sf, view_name, view_path))
|
|
|
+
|
|
|
+ if fix:
|
|
|
+ try:
|
|
|
+ ViewManager.create_symlinks_idempotent(sf, [view_name])
|
|
|
+ self.stdout.write(self.style.SUCCESS(
|
|
|
+ f"✓ Created missing symlink: {view_path}"
|
|
|
+ ))
|
|
|
+ except Exception as e:
|
|
|
+ self.stdout.write(self.style.ERROR(
|
|
|
+ f"✗ Failed to create symlink: {e}"
|
|
|
+ ))
|
|
|
+
|
|
|
+ elif view_path.is_symlink():
|
|
|
+ # Verify symlink points to correct CAS file
|
|
|
+ try:
|
|
|
+ current_target = view_path.resolve()
|
|
|
+ if current_target != cas_path:
|
|
|
+ issues['incorrect_symlinks'].append((sf, view_name, view_path))
|
|
|
+
|
|
|
+ if fix:
|
|
|
+ ViewManager.create_symlinks_idempotent(sf, [view_name])
|
|
|
+ self.stdout.write(self.style.SUCCESS(
|
|
|
+ f"✓ Fixed incorrect symlink: {view_path}"
|
|
|
+ ))
|
|
|
+ except Exception as e:
|
|
|
+ self.stdout.write(self.style.ERROR(
|
|
|
+ f"✗ Broken symlink: {view_path} - {e}"
|
|
|
+ ))
|
|
|
+
|
|
|
+ # Check 3: Find orphaned symlinks
|
|
|
+ if vacuum:
|
|
|
+ self.stdout.write("\n3. Checking for orphaned symlinks...")
|
|
|
+
|
|
|
+ # Get all valid view paths from DB
|
|
|
+ valid_paths = set()
|
|
|
+ for sf in SnapshotFile.objects.all():
|
|
|
+ for view_name in STORAGE_CONFIG.ENABLED_VIEWS:
|
|
|
+ view = ViewManager.VIEWS[view_name]
|
|
|
+ valid_paths.add(view.get_view_path(sf))
|
|
|
+
|
|
|
+ # Scan filesystem for symlinks
|
|
|
+ for view_name in STORAGE_CONFIG.ENABLED_VIEWS:
|
|
|
+ view_base = CONSTANTS.ARCHIVE_DIR / view_name
|
|
|
+ if not view_base.exists():
|
|
|
+ continue
|
|
|
+
|
|
|
+ for path in view_base.rglob('*'):
|
|
|
+ if path.is_symlink() and path not in valid_paths:
|
|
|
+ issues['orphaned_symlinks'].append(path)
|
|
|
+
|
|
|
+ if fix:
|
|
|
+ path.unlink()
|
|
|
+ self.stdout.write(self.style.SUCCESS(
|
|
|
+ f"✓ Removed orphaned symlink: {path}"
|
|
|
+ ))
|
|
|
+
|
|
|
+ # Check 4: Find orphaned blobs
|
|
|
+ self.stdout.write("\n4. Checking for orphaned blobs...")
|
|
|
+ orphaned_blobs = Blob.objects.filter(ref_count=0)
|
|
|
+
|
|
|
+ for blob in orphaned_blobs:
|
|
|
+ issues['orphaned_blobs'].append(blob)
|
|
|
+
|
|
|
+ if fix:
|
|
|
+ cas_path = ViewManager.get_cas_path(blob)
|
|
|
+ if cas_path.exists():
|
|
|
+ cas_path.unlink()
|
|
|
+ blob.delete()
|
|
|
+ self.stdout.write(self.style.SUCCESS(
|
|
|
+ f"✓ Removed orphaned blob: {blob.hash[:16]}..."
|
|
|
+ ))
|
|
|
+
|
|
|
+ # Summary
|
|
|
+ self.stdout.write("\n" + "="*80)
|
|
|
+ self.stdout.write(self.style.WARNING("Storage Verification Summary:"))
|
|
|
+ self.stdout.write(f" Missing CAS files: {len(issues['missing_cas_files'])}")
|
|
|
+ self.stdout.write(f" Missing symlinks: {len(issues['missing_symlinks'])}")
|
|
|
+ self.stdout.write(f" Incorrect symlinks: {len(issues['incorrect_symlinks'])}")
|
|
|
+ self.stdout.write(f" Orphaned symlinks: {len(issues['orphaned_symlinks'])}")
|
|
|
+ self.stdout.write(f" Orphaned blobs: {len(issues['orphaned_blobs'])}")
|
|
|
+
|
|
|
+ total_issues = sum(len(v) for v in issues.values())
|
|
|
+
|
|
|
+ if total_issues == 0:
|
|
|
+ self.stdout.write(self.style.SUCCESS("\n✓ Storage is consistent!"))
|
|
|
+ elif fix:
|
|
|
+ self.stdout.write(self.style.SUCCESS(f"\n✓ Fixed {total_issues} issues"))
|
|
|
+ else:
|
|
|
+ self.stdout.write(self.style.WARNING(
|
|
|
+ f"\n⚠ Found {total_issues} issues. Run with --fix to repair."
|
|
|
+ ))
|
|
|
+```
|
|
|
+
|
|
|
+## Configuration
|
|
|
+
|
|
|
+```python
|
|
|
+# archivebox/config/common.py
|
|
|
+
|
|
|
+class StorageConfig(BaseConfigSet):
|
|
|
+ toml_section_header: str = "STORAGE_CONFIG"
|
|
|
+
|
|
|
+ # Existing fields
|
|
|
+ TMP_DIR: Path = Field(default=CONSTANTS.DEFAULT_TMP_DIR)
|
|
|
+ LIB_DIR: Path = Field(default=CONSTANTS.DEFAULT_LIB_DIR)
|
|
|
+ OUTPUT_PERMISSIONS: str = Field(default="644")
|
|
|
+ RESTRICT_FILE_NAMES: str = Field(default="windows")
|
|
|
+ ENFORCE_ATOMIC_WRITES: bool = Field(default=True)
|
|
|
+ DIR_OUTPUT_PERMISSIONS: str = Field(default="755")
|
|
|
+
|
|
|
+ # New CAS fields
|
|
|
+ USE_CAS: bool = Field(
|
|
|
+ default=True,
|
|
|
+ description="Use content-addressable storage with deduplication"
|
|
|
+ )
|
|
|
+
|
|
|
+ ENABLED_VIEWS: list[str] = Field(
|
|
|
+ default=['by_timestamp', 'by_domain', 'by_date'],
|
|
|
+ description="Which symlink farm views to maintain"
|
|
|
+ )
|
|
|
+
|
|
|
+ AUTO_SYNC_SYMLINKS: bool = Field(
|
|
|
+ default=True,
|
|
|
+ description="Automatically create/update symlinks via signals"
|
|
|
+ )
|
|
|
+
|
|
|
+ VERIFY_ON_STARTUP: bool = Field(
|
|
|
+ default=False,
|
|
|
+ description="Verify storage consistency on startup"
|
|
|
+ )
|
|
|
+
|
|
|
+ VERIFY_INTERVAL_HOURS: int = Field(
|
|
|
+ default=24,
|
|
|
+ description="Run periodic storage verification (0 to disable)"
|
|
|
+ )
|
|
|
+
|
|
|
+ CLEANUP_TEMP_FILES: bool = Field(
|
|
|
+ default=True,
|
|
|
+ description="Remove temporary extractor files after ingestion"
|
|
|
+ )
|
|
|
+
|
|
|
+ CAS_BACKEND: str = Field(
|
|
|
+ default='local',
|
|
|
+ choices=['local', 's3', 'azure', 'gcs'],
|
|
|
+ description="Storage backend for CAS blobs"
|
|
|
+ )
|
|
|
+```
|
|
|
+
|
|
|
+## Workflow Examples
|
|
|
+
|
|
|
+### Example 1: Normal Operation
|
|
|
+
|
|
|
+```python
|
|
|
+# Extractor writes files to temporary directory
|
|
|
+extractor_dir = Path('/tmp/wget-output')
|
|
|
+
|
|
|
+# After extraction completes, ingest into CAS
|
|
|
+from storage.ingest import BlobManager
|
|
|
+
|
|
|
+ingested_files = BlobManager.ingest_directory(
|
|
|
+ extractor_dir,
|
|
|
+ snapshot,
|
|
|
+ 'wget'
|
|
|
+)
|
|
|
+
|
|
|
+# Behind the scenes:
|
|
|
+# 1. Each file hashed (SHA-256)
|
|
|
+# 2. Blob created/found in DB (deduplication)
|
|
|
+# 3. File stored in CAS (if new)
|
|
|
+# 4. SnapshotFile created in DB
|
|
|
+# 5. post_save signal fires
|
|
|
+# 6. Symlinks automatically created in all enabled views
|
|
|
+# ✓ DB and filesystem in perfect sync
|
|
|
+```
|
|
|
+
|
|
|
+### Example 2: Browse Archives
|
|
|
+
|
|
|
+```bash
|
|
|
+# User can browse in multiple ways:
|
|
|
+
|
|
|
+# By domain (great for site collections)
|
|
|
+$ ls /data/archive/by_domain/example.com/20241225/
|
|
|
+019b54ee-28d9-72dc/
|
|
|
+
|
|
|
+# By date (great for time-based browsing)
|
|
|
+$ ls /data/archive/by_date/20241225/
|
|
|
+example.com/
|
|
|
+github.com/
|
|
|
+wikipedia.org/
|
|
|
+
|
|
|
+# By user (great for multi-user setups)
|
|
|
+$ ls /data/archive/by_user/squash/20241225/
|
|
|
+example.com/
|
|
|
+github.com/
|
|
|
+
|
|
|
+# Legacy timestamp (backwards compatibility)
|
|
|
+$ ls /data/archive/by_timestamp/1735142400.123/
|
|
|
+wget/
|
|
|
+singlefile/
|
|
|
+screenshot/
|
|
|
+```
|
|
|
+
|
|
|
+### Example 3: Crash Recovery
|
|
|
+
|
|
|
+```python
|
|
|
+# System crashes after DB save but before symlinks created
|
|
|
+# - DB has SnapshotFile record ✓
|
|
|
+# - Symlinks missing ✗
|
|
|
+
|
|
|
+# Next verification run:
|
|
|
+$ python -m archivebox verify_storage --fix
|
|
|
+
|
|
|
+# Output:
|
|
|
+# Checking database → filesystem consistency...
|
|
|
+# ✗ Missing symlink: /data/archive/by_domain/example.com/.../index.html
|
|
|
+# ✓ Created missing symlink
|
|
|
+# ✓ Fixed 1 issues
|
|
|
+
|
|
|
+# Storage is now consistent!
|
|
|
+```
|
|
|
+
|
|
|
+### Example 4: Migration from Legacy
|
|
|
+
|
|
|
+```bash
|
|
|
+# Migrate all existing archives to CAS
|
|
|
+$ python -m archivebox migrate_to_cas --dry-run
|
|
|
+
|
|
|
+# Output:
|
|
|
+# DRY RUN - No changes will be made
|
|
|
+# Found 1000 snapshots to migrate
|
|
|
+# [1/1000] Processing https://example.com...
|
|
|
+# Would ingest wget: 15 files
|
|
|
+# Would ingest singlefile: 1 file
|
|
|
+# ...
|
|
|
+
|
|
|
+# Run actual migration
|
|
|
+$ python -m archivebox migrate_to_cas
|
|
|
+
|
|
|
+# Output:
|
|
|
+# [1/1000] Processing https://example.com...
|
|
|
+# ✓ Ingested 15 files (3 new, 12 deduplicated, saved 2.4 MB)
|
|
|
+# ...
|
|
|
+# Migration Complete!
|
|
|
+# Snapshots processed: 1000
|
|
|
+# Files ingested: 45,231
|
|
|
+# Space saved by deduplication: 12.3 GB
|
|
|
+```
|
|
|
+
|
|
|
+## Benefits
|
|
|
+
|
|
|
+### Space Savings
|
|
|
+- **Massive deduplication**: Common files (jquery, fonts, images) stored once
|
|
|
+- **30-70% typical savings** across archives
|
|
|
+- **Symlink overhead**: ~0.1% of saved space (negligible)
|
|
|
+
|
|
|
+### Flexibility
|
|
|
+- **Multiple views**: Browse by domain, date, user, timestamp
|
|
|
+- **Add views anytime**: Run `rebuild_views` to add new organization
|
|
|
+- **No data migration needed**: Just rebuild symlinks
|
|
|
+
|
|
|
+### S3 Support
|
|
|
+- **Use django-storages**: Drop-in S3, Azure, GCS support
|
|
|
+- **Hybrid mode**: Hot data local, cold data in S3
|
|
|
+- **Cost optimization**: S3 Intelligent Tiering for automatic cost reduction
|
|
|
+
|
|
|
+### Data Integrity
|
|
|
+- **Database as truth**: Symlinks are disposable, can be rebuilt
|
|
|
+- **Automatic sync**: Signals keep symlinks current
|
|
|
+- **Self-healing**: Verification detects and fixes drift
|
|
|
+- **Atomic operations**: Transaction-safe
|
|
|
+
|
|
|
+### Backwards Compatibility
|
|
|
+- **Legacy view**: `by_timestamp` maintains old structure
|
|
|
+- **Gradual migration**: Old and new archives coexist
|
|
|
+- **Zero downtime**: Archives keep working during migration
|
|
|
+
|
|
|
+### Developer Experience
|
|
|
+- **Human-browseable**: Easy to inspect and debug
|
|
|
+- **Standard tools work**: cp, rsync, tar, zip all work normally
|
|
|
+- **Multiple organization schemes**: Find archives multiple ways
|
|
|
+- **Easy backups**: Symlinks handled correctly by modern tools
|
|
|
+
|
|
|
+## Implementation Checklist
|
|
|
+
|
|
|
+- [ ] Create database models (Blob, SnapshotFile)
|
|
|
+- [ ] Create migrations for new models
|
|
|
+- [ ] Implement BlobManager (ingest.py)
|
|
|
+- [ ] Implement ViewManager (views.py)
|
|
|
+- [ ] Implement Django signals (signals.py)
|
|
|
+- [ ] Create migrate_to_cas command
|
|
|
+- [ ] Create rebuild_views command
|
|
|
+- [ ] Create verify_storage command
|
|
|
+- [ ] Update Snapshot.output_dir property
|
|
|
+- [ ] Update ArchiveResult to use SnapshotFile
|
|
|
+- [ ] Add StorageConfig settings
|
|
|
+- [ ] Configure django-storages
|
|
|
+- [ ] Test with local filesystem
|
|
|
+- [ ] Test with S3
|
|
|
+- [ ] Document for users
|
|
|
+- [ ] Update backup procedures
|
|
|
+
|
|
|
+## Future Enhancements
|
|
|
+
|
|
|
+- [ ] Web UI for browsing CAS blobs
|
|
|
+- [ ] API endpoints for file access
|
|
|
+- [ ] Content-aware compression (compress similar files together)
|
|
|
+- [ ] IPFS backend support
|
|
|
+- [ ] Automatic tiering (hot → warm → cold → glacier)
|
|
|
+- [ ] Deduplication statistics dashboard
|
|
|
+- [ ] Export to WARC with CAS metadata
|