Snapshot.run() should enforce that snapshot hooks are run in 10 discrete, sequential "steps": 0*, 1*, 2*, 3*, 4*, 5*, 6*, 7*, 8*, 9*.
For every discovered hook script, ArchiveBox should create an ArchiveResult in queued state, then manage running them using retry_at and inline logic to enforce this ordering.
ArchiveResult.hook_name (CharField, nullable) - just filename, e.g., 'on_Snapshot__20_chrome_tab.bg.js'ArchiveResult.plugin - still important (plugin directory name)hook_name via extract_step(hook_name) - not storedSnapshot.current_step (IntegerField 0-9, default=0)SnapshotMachine state transitions for step advancementSnapshot.run() discovers all hooks upfront, creates one AR per hook with hook_name setextract_step(ar.hook_name) <= snapshot.current_stepSnapshot.advance_step_if_ready() increments current_step when:
SnapshotMachine state transitionsself.hook_name is set: run that single hookself.hook_name is None: discover all hooks for self.plugin and run sequentially.bg. in filename (e.g., on_Snapshot__20_chrome_tab.bg.js)Snapshot.advance_step_if_ready() checks:
current_stepcurrent_step=9 and all foreground hooks done, kill background hooks via Snapshot.cleanup()re.search(r'__(\d{2})_', hook_name), default to 9 if no matchHooks scripts are numbered 00 to 99 to control:
Hook scripts are launched strictly sequentially based on their filename alphabetical order, and run in sets of several per step before moving on to the next step.
Naming Format:
on_{ModelName}__{run_order}_{human_readable_description}[.bg].{ext}
Examples:
on_Snapshot__00_this_would_run_first.sh
on_Snapshot__05_start_ytdlp_download.bg.sh
on_Snapshot__10_chrome_tab_opened.js
on_Snapshot__50_screenshot.js
on_Snapshot__53_media.bg.py
PLUGINNAME_TIMEOUTSnapshot.cleanup())Important: A .bg script started in step 2 can keep running through steps 3, 4, 5... until the Snapshot seals or the hook exits naturally.
These are naming conventions and guidelines, not enforced checkpoints. They provide semantic organization for plugin ordering:
00-09: Initial setup, validation, feature detection
10-19: Browser/tab lifecycle setup
- Chrome browser launch
- Tab creation and CDP connection
20-29: Page loading and settling
- Navigate to URL
- Wait for page load
- Initial response capture (responses, ssl, consolelog as .bg listeners)
30-39: DOM manipulation before archiving
- Hide popups/banners
- Solve captchas
- Expand comments/details sections
- Inject custom CSS/JS
- Accessibility modifications
40-49: Final pre-archiving checks
- Verify page is fully adjusted
- Wait for any pending modifications
50-59: Extractors that need exclusive DOM access
- singlefile (MUST NOT be .bg)
- screenshot (MUST NOT be .bg)
- pdf (MUST NOT be .bg)
- dom (MUST NOT be .bg)
- title
- headers
- readability
- mercury
These MUST run sequentially as they temporarily modify the DOM
during extraction, then revert it. Running in parallel would corrupt results.
60-69: Extractors that don't need DOM or run on downloaded files
- wget
- git
- media (.bg - can run for hours)
- gallerydl (.bg)
- forumdl (.bg)
- papersdl (.bg)
70-79: Browser/tab teardown
- Close tabs
- Cleanup Chrome resources
80-89: Reprocess outputs from earlier extractors
- OCR of images
- Audio/video transcription
- URL parsing from downloaded content (rss, html, json, txt, csv, md)
- LLM analysis/summarization of outputs
90-99: Save to indexes and finalize
- Index text content to Sonic/SQLite FTS
- Create symlinks
- Generate merkle trees
- Final status updates
Hooks receive configuration as CLI flags (CSV or JSON-encoded):
--url="https://example.com"
--snapshot-id="1234-5678-uuid"
--config='{"some_key": "some_value"}'
--plugins=git,media,favicon,title
--timeout=50
--enable-something
All configuration comes from env vars, defined in plugin_dir/config.json JSONSchema:
WGET_BINARY=/usr/bin/wget
WGET_TIMEOUT=60
WGET_USER_AGENT="Mozilla/5.0..."
WGET_EXTRA_ARGS="--no-check-certificate"
SAVE_WGET=True
Required: Every plugin must support PLUGINNAME_TIMEOUT for self-termination.
Hooks read/write files to:
$CWD: Their own output subdirectory (e.g., archive/snapshots/{id}/wget/)$CWD/..: Parent directory (to read outputs from other hooks)This allows hooks to:
Hooks emit one JSONL line per database record they want to create or update:
{"type": "Tag", "name": "sci-fi"}
{"type": "ArchiveResult", "id": "1234-uuid", "status": "succeeded", "output_str": "wget/index.html"}
{"type": "Snapshot", "id": "5678-uuid", "title": "Example Page"}
See archivebox/misc/jsonl.py and model from_json() / from_jsonl() methods for full list of supported types and fields.
Hooks should emit human-readable output or debug info to stderr. There are no guarantees this will be persisted long-term. Use stdout JSONL or filesystem for outputs that matter.
If hooks emit no meaningful long-term outputs, they should delete any temporary files themselves to avoid wasting space. However, the ArchiveResult DB row should be kept so we know:
Hooks are expected to listen for polite SIGINT/SIGTERM and finish hastily, then exit cleanly. Beyond that, they may be SIGKILL'd at ArchiveBox's discretion.
If hooks double-fork or spawn long-running processes: They must output a .pid file in their directory so zombies can be swept safely.
Hooks can fail in several ways. ArchiveBox handles each differently:
Exit: 0 (success)
JSONL: {"type": "ArchiveResult", "status": "failed", "output_str": "404 Not Found"}
This means: "I ran successfully, but the resource wasn't available." Don't retry this.
Use cases:
Exit: Non-zero (1, 2, etc.) JSONL: None (or incomplete)
This means: "Something went wrong, I couldn't complete." Treat this ArchiveResult as "missing" and set retry_at for later.
Use cases:
Behavior:
retry_at on the ArchiveResultarchivebox updateExit: Non-zero JSONL: Partial records emitted before crash
Behavior:
retry_atExit: 0
JSONL: {"type": "ArchiveResult", "status": "succeeded", "output_str": "output/file.html"}
This is the happy path.
archivebox/plugins/{plugin_name}/
├── config.json # JSONSchema: env var config options
├── binaries.jsonl # Runtime dependencies: apt|brew|pip|npm|env
├── on_Snapshot__XX_name.py # Hook script (foreground)
├── on_Snapshot__XX_name.bg.py # Hook script (background)
└── tests/
└── test_name.py
Snapshot.current_step (IntegerField 0-9, default=0)ArchiveResult.hook_name (CharField, nullable) - just filename0034_snapshot_current_step.pyextract_step(hook_name) utility in archivebox/hooks.py
__XX_ patternis_background_hook(hook_name) utility in archivebox/hooks.py
.bg. in filenameSnapshot.create_pending_archiveresults() in archivebox/core/models.py:
hook_name setArchiveResult.run() in archivebox/core/models.py:
hook_name set: run single hookhook_name None: discover all plugin hooks (existing behavior)Snapshot.advance_step_if_ready() method:
current_step if readySnapshotMachine.is_finished() in archivebox/core/statemachines.py:
advance_step_if_ready() before checking if donearchivebox/workers/worker.py:
extract_step(ar.hook_name) <= snapshot.current_step.bg suffix to long-running hooks (media, gallerydl, forumdl, papersdl)No special migration needed:
hook_name=None continue to work (discover all plugin hooks at runtime)hook_name set (single hook per AR)ArchiveResult.run() handles both cases naturallyCompleted Renames:
# Step 5: DOM Extraction (sequential, non-background)
singlefile/on_Snapshot__37_singlefile.py → singlefile/on_Snapshot__50_singlefile.py ✅
screenshot/on_Snapshot__34_screenshot.js → screenshot/on_Snapshot__51_screenshot.js ✅
pdf/on_Snapshot__35_pdf.js → pdf/on_Snapshot__52_pdf.js ✅
dom/on_Snapshot__36_dom.js → dom/on_Snapshot__53_dom.js ✅
title/on_Snapshot__32_title.js → title/on_Snapshot__54_title.js ✅
readability/on_Snapshot__52_readability.py → readability/on_Snapshot__55_readability.py ✅
headers/on_Snapshot__33_headers.js → headers/on_Snapshot__55_headers.js ✅
mercury/on_Snapshot__53_mercury.py → mercury/on_Snapshot__56_mercury.py ✅
htmltotext/on_Snapshot__54_htmltotext.py → htmltotext/on_Snapshot__57_htmltotext.py ✅
# Step 6: Post-DOM Extraction (background for long-running)
wget/on_Snapshot__50_wget.py → wget/on_Snapshot__61_wget.py ✅
git/on_Snapshot__12_git.py → git/on_Snapshot__62_git.py ✅
media/on_Snapshot__51_media.py → media/on_Snapshot__63_media.bg.py ✅
gallerydl/on_Snapshot__52_gallerydl.py → gallerydl/on_Snapshot__64_gallerydl.bg.py ✅
forumdl/on_Snapshot__53_forumdl.py → forumdl/on_Snapshot__65_forumdl.bg.py ✅
papersdl/on_Snapshot__54_papersdl.py → papersdl/on_Snapshot__66_papersdl.bg.py ✅
# Step 7: URL Extraction (parse_* hooks moved from step 6)
parse_html_urls/on_Snapshot__60_parse_html_urls.py → parse_html_urls/on_Snapshot__70_parse_html_urls.py ✅
parse_txt_urls/on_Snapshot__62_parse_txt_urls.py → parse_txt_urls/on_Snapshot__71_parse_txt_urls.py ✅
parse_rss_urls/on_Snapshot__61_parse_rss_urls.py → parse_rss_urls/on_Snapshot__72_parse_rss_urls.py ✅
parse_netscape_urls/on_Snapshot__63_parse_netscape_urls.py → parse_netscape_urls/on_Snapshot__73_parse_netscape_urls.py ✅
parse_jsonl_urls/on_Snapshot__64_parse_jsonl_urls.py → parse_jsonl_urls/on_Snapshot__74_parse_jsonl_urls.py ✅
parse_dom_outlinks/on_Snapshot__40_parse_dom_outlinks.js → parse_dom_outlinks/on_Snapshot__75_parse_dom_outlinks.js ✅
A: No. Keep plugins decoupled. Let them use simple filesystem coordination if needed.
A: We can delete old successful ArchiveResults periodically, or archive them to cold storage. The important data is in the filesystem outputs.
A: Yes. The .bg suffix means "don't block step progression," not "run until step 99." Natural exit is the best case.
#!/usr/bin/env python3
# archivebox/plugins/screenshot/on_Snapshot__51_screenshot.js
# Runs at step 5, blocks step progression until complete
# Gets killed if it exceeds SCREENSHOT_TIMEOUT
timeout = get_env_int('SCREENSHOT_TIMEOUT') or get_env_int('TIMEOUT', 60)
try:
result = subprocess.run(cmd, capture_output=True, timeout=timeout)
if result.returncode == 0:
print(json.dumps({
"type": "ArchiveResult",
"status": "succeeded",
"output_str": "screenshot.png"
}))
sys.exit(0)
else:
# Temporary failure - will be retried
sys.exit(1)
except subprocess.TimeoutExpired:
# Timeout - will be retried
sys.exit(1)
#!/usr/bin/env python3
# archivebox/plugins/ytdlp/on_Snapshot__63_ytdlp.bg.py
# Runs at step 6, doesn't block step progression
# Gets full YTDLP_TIMEOUT (e.g., 3600s) regardless of when step 99 completes
timeout = get_env_int('YTDLP_TIMEOUT') or get_env_int('TIMEOUT', 3600)
try:
result = subprocess.run(['yt-dlp', url], capture_output=True, timeout=timeout)
if result.returncode == 0:
print(json.dumps({
"type": "ArchiveResult",
"status": "succeeded",
"output_str": "media/"
}))
sys.exit(0)
else:
# Hard failure - don't retry
print(json.dumps({
"type": "ArchiveResult",
"status": "failed",
"output_str": "Video unavailable"
}))
sys.exit(0) # Exit 0 to record the failure
except subprocess.TimeoutExpired:
# Timeout - will be retried
sys.exit(1)
#!/usr/bin/env node
// archivebox/plugins/ssl/on_Snapshot__23_ssl.bg.js
// Sets up listener, captures SSL info, then exits naturally
// No SIGTERM handler needed - already exits when done
async function main() {
const page = await connectToChrome();
// Set up listener
page.on('response', async (response) => {
const securityDetails = response.securityDetails();
if (securityDetails) {
fs.writeFileSync('ssl.json', JSON.stringify(securityDetails));
}
});
// Wait for navigation (done by other hook)
await waitForNavigation();
// Emit result
console.log(JSON.stringify({
type: 'ArchiveResult',
status: 'succeeded',
output_str: 'ssl.json'
}));
process.exit(0); // Natural exit - no await indefinitely
}
main().catch(e => {
console.error(`ERROR: ${e.message}`);
process.exit(1); // Will be retried
});
This plan provides:
The main implementation work is refactoring Snapshot.run() to enforce step ordering and manage .bg script lifecycles. Plugin renumbering is straightforward mechanical work.