ArchiveBox CLI Refactor TODO
Design Decisions
- Keep
archivebox add as high-level convenience command
- Unified
archivebox run for processing (replaces per-model run and orchestrator)
- Expose all models including binary, process, machine
- Clean break from old command structure (no backward compatibility aliases)
Final Architecture
archivebox <model> <action> [args...] [--filters]
archivebox run [stdin JSONL]
Actions (4 per model):
create - Create records (from args, stdin, or JSONL), dedupes by indexed fields
list - Query records (with filters, returns JSONL)
update - Modify records (from stdin JSONL, PATCH semantics)
delete - Remove records (from stdin JSONL, requires --yes)
Unified Run Command:
archivebox run - Process queued work
- With stdin JSONL: Process piped records, exit when complete
- Without stdin (TTY): Run orchestrator in foreground until killed
Models (7 total):
crawl - Crawl jobs
snapshot - Individual archived pages
archiveresult - Plugin extraction results
tag - Tags/labels
binary - Detected binaries (chrome, wget, etc.)
process - Process execution records (read-only)
machine - Machine/host records (read-only)
Implementation Checklist
Phase 1: Unified Run Command
Phase 2: Core Model Commands
Phase 3: System Model Commands
Phase 4: Registry & Cleanup
Phase 5: Tests for New Commands
Usage Examples
Basic CRUD
# Create
archivebox crawl create https://example.com https://foo.com --depth=1
archivebox snapshot create https://example.com --tag=news
# List with filters
archivebox crawl list --status=queued
archivebox snapshot list --url__icontains=example.com
archivebox archiveresult list --status=failed --plugin=screenshot
# Update (reads JSONL from stdin, applies changes)
archivebox snapshot list --tag=old | archivebox snapshot update --tag=new
# Delete (requires --yes)
archivebox crawl list --url__icontains=example.com | archivebox crawl delete --yes
Unified Run Command
# Run orchestrator in foreground (replaces `archivebox orchestrator`)
archivebox run
# Process specific records (pipe any JSONL type, exits when done)
archivebox snapshot list --status=queued | archivebox run
archivebox archiveresult list --status=failed | archivebox run
archivebox crawl list --status=queued | archivebox run
# Mixed types work too - run handles any JSONL
cat mixed_records.jsonl | archivebox run
Composed Workflows
# Full pipeline (replaces old `archivebox add`)
archivebox crawl create https://example.com --status=queued \
| archivebox snapshot create --status=queued \
| archivebox archiveresult create --status=queued \
| archivebox run
# Re-run failed extractions
archivebox archiveresult list --status=failed | archivebox run
# Delete all snapshots for a domain
archivebox snapshot list --url__icontains=spam.com | archivebox snapshot delete --yes
Keep archivebox add as convenience
# This remains the simple user-friendly interface:
archivebox add https://example.com --depth=1 --tag=news
# Internally equivalent to the composed pipeline above