Critical Realization: Each plugin must produce exactly ONE ArchiveResult output. This is fundamental to ArchiveBox's architecture - you cannot have multiple outputs from a single plugin.
DO NOT CONFUSE THESE CONCEPTS:
Plugin = Directory name (e.g., chrome, consolelog, screenshot)
archivebox/plugins/<plugin_name>/users/{username}/snapshots/YYYYMMDD/{domain}/{snap_id}/{plugin_name}/Hook = Individual script file (e.g., on_Snapshot__20_chrome_tab.bg.js)
Extractor = ArchiveResult.extractor field = PLUGIN NAME (not hook name)
ArchiveResult.extractor = 'chrome' (plugin name)ArchiveResult.extractor = '20_chrome_tab.bg' (hook name)Output Directory = users/{username}/snapshots/YYYYMMDD/{domain}/{snap_id}/{plugin_name}/
users/default/snapshots/20251227/example.com/019b-6397-6a5b/chrome/ contains outputs from ALL chrome hooksarchive/{timestamp}/ with symlink for backwards compatibilityExample 1: Chrome Plugin (Infrastructure - NO ArchiveResult)
Plugin name: 'chrome'
ArchiveResult: NONE (infrastructure only)
Output directory: users/default/snapshots/20251227/example.com/019b-6397-6a5b/chrome/
Hooks:
- on_Snapshot__20_chrome_tab.bg.js # Launches Chrome, opens tab
- on_Snapshot__30_chrome_navigate.js # Navigates to URL
- on_Snapshot__45_chrome_tab_cleanup.py # Kills Chrome on cleanup
Writes (temporary infrastructure files, deleted on cleanup):
- chrome/cdp_url.txt # Other plugins read this to connect
- chrome/target_id.txt # Tab ID for CDP connection
- chrome/page_loaded.txt # Navigation completion marker
- chrome/navigation.json # Navigation state
- chrome/hook.pid # For cleanup
NO ArchiveResult JSON is produced - this is pure infrastructure.
On SIGTERM: Chrome exits, chrome/ directory is deleted.
Example 2: Screenshot Plugin (Output Plugin - CREATES ArchiveResult)
Plugin name: 'screenshot'
ArchiveResult.extractor: 'screenshot'
Output directory: users/default/snapshots/20251227/example.com/019b-6397-6a5b/screenshot/
Hooks:
- on_Snapshot__34_screenshot.js
Process:
1. Reads ../chrome/cdp_url.txt to get Chrome connection
2. Connects to Chrome CDP
3. Takes screenshot
4. Writes to: screenshot/screenshot.png
5. Emits ArchiveResult JSON to stdout
Creates ArchiveResult with status=succeeded, output_files={'screenshot.png': {}}
Example 3: PDF Plugin (Output Plugin - CREATES ArchiveResult)
Plugin name: 'pdf'
ArchiveResult.extractor: 'pdf'
Output directory: users/default/snapshots/20251227/example.com/019b-6397-6a5b/pdf/
Hooks:
- on_Snapshot__35_pdf.js
Process:
1. Reads ../chrome/cdp_url.txt to get Chrome connection
2. Connects to Chrome CDP
3. Generates PDF
4. Writes to: pdf/output.pdf
5. Emits ArchiveResult JSON to stdout
Creates ArchiveResult with status=succeeded, output_files={'output.pdf': {}}
Lifecycle:
1. Chrome hooks run → create chrome/ dir with infrastructure files
2. Screenshot/PDF/etc hooks run → read chrome/cdp_url.txt, write to their own dirs
3. Snapshot.cleanup() called → sends SIGTERM to background hooks
4. Chrome receives SIGTERM → exits, deletes chrome/ dir
5. Screenshot/PDF/etc dirs remain with their outputs
DO NOT:
DO:
This principle drove the entire consolidation strategy:
Location: archivebox/plugins/chrome/
This plugin provides shared Chrome infrastructure for other plugins. It manages the browser lifecycle but produces NO ArchiveResult - only infrastructure files in a single chrome/ output directory.
Consolidates these former plugins:
chrome_session/ → Mergedchrome_navigate/ → Mergedchrome_cleanup/ → Mergedchrome_extensions/ → Utilities mergedHook Files:
chrome/
├── on_Crawl__00_chrome_install_config.py # Configure Chrome settings
├── on_Crawl__00_chrome_install.py # Install Chrome binary
├── on_Crawl__30_chrome_launch.bg.js # Launch Chrome (Crawl-level, bg)
├── on_Snapshot__20_chrome_tab.bg.js # Open tab (Snapshot-level, bg)
├── on_Snapshot__30_chrome_navigate.js # Navigate to URL (foreground)
├── on_Snapshot__45_chrome_tab_cleanup.py # Close tab, kill bg hooks
├── chrome_extension_utils.js # Extension utilities
├── config.json # Configuration
└── tests/test_chrome.py # Tests
Output Directory (Infrastructure Only):
chrome/
├── cdp_url.txt # WebSocket URL for CDP connection
├── pid.txt # Chrome process PID
├── target_id.txt # Current tab target ID
├── page_loaded.txt # Navigation completion marker
├── final_url.txt # Final URL after redirects
├── navigation.json # Navigation state (NEW)
└── hook.pid # Background hook PIDs (for cleanup)
New: navigation.json
Tracks navigation state with wait condition and timing:
{
"waitUntil": "networkidle2",
"elapsed": 1523,
"url": "https://example.com",
"finalUrl": "https://example.com/",
"status": 200,
"timestamp": "2025-12-27T22:15:30.123Z"
}
Fields:
waitUntil - Wait condition: networkidle0, networkidle2, domcontentloaded, or loadelapsed - Navigation time in millisecondsurl - Original requested URLfinalUrl - Final URL after redirects (success only)status - HTTP status code (success only)error - Error message (failure only)timestamp - ISO 8601 completion timestampThese remain SEPARATE plugins because each produces a distinct output/ArchiveResult. Each plugin references ../chrome for infrastructure.
archivebox/plugins/consolelog/
└── on_Snapshot__21_consolelog.bg.js
console.jsonl (browser console messages)../chrome for CDP URLarchivebox/plugins/ssl/
└── on_Snapshot__23_ssl.bg.js
ssl.jsonl (SSL/TLS certificate details)../chrome for CDP URLarchivebox/plugins/responses/
└── on_Snapshot__24_responses.bg.js
responses/ directory with index.jsonl (network responses)../chrome for CDP URLarchivebox/plugins/redirects/
└── on_Snapshot__31_redirects.bg.js
redirects.jsonl (redirect chain)../chrome for CDP URLNetwork.requestWillBeSent to capture redirects from initial requestarchivebox/plugins/staticfile/
└── on_Snapshot__31_staticfile.bg.js
../chrome for CDP URLchrome_session, chrome_navigate, chrome_cleanup, chrome_extensions → chrome/chrome/. (same directory)redirects Plugin:
Network.requestWillBeSent event with redirectResponse parameterstaticfile Plugin:
page.on('response') to capture Content-Type from initial requestnavigation.json file in chrome/ output directorywaitUntil condition and elapsed millisecondschrome_session/on_CrawlEnd__99_chrome_cleanup.py (manual cleanup hook)core/models.py and crawls/models.py work correctly═══ CRAWL LEVEL ═══
00. chrome_install_config.py Configure Chrome settings
00. chrome_install.py Install Chrome binary
20. chrome_launch.bg.js Launch Chrome browser (STAYS RUNNING)
═══ PER-SNAPSHOT LEVEL ═══
Phase 1: PRE-NAVIGATION (Background hooks setup)
20. chrome_tab.bg.js Open new tab (STAYS ALIVE)
21. consolelog.bg.js Setup console listener (STAYS ALIVE)
23. ssl.bg.js Setup SSL listener (STAYS ALIVE)
24. responses.bg.js Setup network response listener (STAYS ALIVE)
31. redirects.bg.js Setup redirect listener (STAYS ALIVE)
31. staticfile.bg.js Setup staticfile detector (STAYS ALIVE)
Phase 2: NAVIGATION (Foreground - synchronization point)
30. chrome_navigate.js Navigate to URL (BLOCKS until page loaded)
↓
Writes navigation.json with waitUntil & elapsed
Writes page_loaded.txt marker
↓
All background hooks can now finalize
Phase 3: POST-NAVIGATION (Background hooks finalize)
(All .bg hooks save their data and wait for cleanup signal)
Phase 4: OTHER EXTRACTORS (use loaded page)
34. screenshot.js
37. singlefile.js
... (other extractors that need loaded page)
Phase 5: CLEANUP
45. chrome_tab_cleanup.py Close tab
Kill background hooks (SIGTERM → SIGKILL)
Update ArchiveResults
All .bg.js hooks follow this pattern:
Key files written:
hook.pid - Process ID for cleanup mechanismconsole.jsonl, ssl.jsonl, etc.)Snapshot-level cleanup (core/models.py):
def cleanup(self):
"""Kill background hooks and close resources."""
# Scan OUTPUT_DIR for hook.pid files
# Send SIGTERM to processes
# Wait for graceful exit
# Send SIGKILL if process still alive
# Update ArchiveResults to FAILED if needed
Crawl-level cleanup (crawls/models.py):
def cleanup(self):
"""Kill Crawl-level background hooks (Chrome browser)."""
# Similar pattern for Crawl-level resources
# Kills Chrome launch process
State machine integration:
SnapshotMachine and CrawlMachine call cleanup() when entering sealed stateCrawl output structure:
users/{user_id}/crawls/{YYYYMMDD}/{crawl_id}/users/1/crawls/20251227/abc-def-123/users/1/crawls/20251227/abc-def-123/chrome/Snapshot output structure:
archive/{timestamp}/archive/{timestamp}/chrome/, archive/{timestamp}/consolelog/, etc.Within chrome plugin:
. or OUTPUT_DIR to reference the chrome/ directory they're running infs.writeFileSync(path.join(OUTPUT_DIR, 'navigation.json'), ...)From output plugins to chrome (same snapshot):
../chrome to reference Chrome infrastructure in same snapshotconst CHROME_SESSION_DIR = '../chrome';cdp_url.txt, target_id.txt, page_loaded.txtFrom snapshot hooks to crawl chrome:
CRAWL_OUTPUT_DIR environment variable (set by hooks.py)path.join(process.env.CRAWL_OUTPUT_DIR, 'chrome') to find crawl-level ChromeNavigation synchronization:
../chrome/page_loaded.txt before finalizingchrome_navigate.js after navigation completesOne ArchiveResult Per Plugin
Chrome as Infrastructure
Background Hooks for CDP
.bg.js)Foreground for Synchronization
chrome_navigate.js is foreground (not .bg)Automatic Cleanup
Clear Separation
✓ Architectural Clarity - Clear separation between infrastructure and outputs ✓ Correct Output Model - One ArchiveResult per plugin ✓ Better Performance - CDP listeners capture data from initial request ✓ No Duplication - Single Chrome infrastructure used by all ✓ Proper Lifecycle - Background hooks cleaned up automatically ✓ Maintainable - Easy to understand, debug, and extend ✓ Consistent - All background hooks follow same pattern ✓ Observable - Navigation state tracked for debugging
Run tests:
sudo -u testuser bash -c 'source .venv/bin/activate && python -m pytest archivebox/plugins/chrome/tests/ -v'
For developers:
chrome/ output dir (not chrome_session/)../chrome/cdp_url.txt from output plugins../chrome/page_loaded.txt../chrome/navigation.jsonFor users: