CRITICAL: All hooks must follow this unified architecture. This pattern applies to ALL models: Crawl, Dependency, Snapshot, ArchiveResult, etc.
1. Model.run() discovers and executes hooks
2. Hooks emit JSONL to stdout
3. Model.run() parses JSONL and creates DB records
4. New DB records trigger their own Model.run()
5. Cycle repeats
Example Flow:
Crawl.run()
→ runs on_Crawl__* hooks
→ hooks emit JSONL: {type: 'Dependency', bin_name: 'wget', ...}
→ Crawl.run() creates Dependency record in DB
→ Dependency.run() is called automatically
→ runs on_Dependency__* hooks
→ hooks emit JSONL: {type: 'Binary', name: 'wget', ...}
→ Dependency.run() creates Binary record in DB
Model.run() executes hooks directly - No helper methods in statemachines. Statemachine just calls Model.run().
Hooks emit JSONL - Any line starting with { that has a type field creates/updates that model.
print(json.dumps({'type': 'Dependency', 'bin_name': 'wget', ...}))
print(json.dumps({'type': 'Binary', 'name': 'wget', ...}))
JSONL fields = Model fields - JSONL keys must match Django model field names exactly. No transformation.
# ✅ CORRECT - matches Dependency model
{'type': 'Dependency', 'bin_name': 'wget', 'bin_providers': 'apt,brew', 'overrides': {...}}
# ❌ WRONG - uses different field names
{'type': 'Dependency', 'name': 'wget', 'providers': 'apt,brew', 'custom_cmds': {...}}
No hardcoding - Never hardcode binary names, provider names, or anything else. Use discovery.
# ✅ CORRECT - discovers all on_Dependency hooks dynamically
run_hooks(event_name='Dependency', ...)
# ❌ WRONG - hardcodes provider list
for provider in ['pip', 'npm', 'apt', 'brew']:
run_hooks(event_name=f'Dependency__install_using_{provider}_provider', ...)
Trust abx-pkg - Never use shutil.which(), subprocess.run([bin, '--version']), or manual hash calculation.
# ✅ CORRECT - abx-pkg handles everything
from abx_pkg import Binary, PipProvider, EnvProvider
binary = Binary(name='wget', binproviders=[PipProvider(), EnvProvider()]).load()
# binary.abspath, binary.version, binary.sha256 are all populated automatically
# ❌ WRONG - manual detection
abspath = shutil.which('wget')
version = subprocess.run(['wget', '--version'], ...).stdout
Hooks check if they can handle requests - Each hook decides internally if it can handle the dependency.
# In on_Dependency__install_using_pip_provider.py
if bin_providers != '*' and 'pip' not in bin_providers.split(','):
sys.exit(0) # Can't handle this, exit cleanly
Minimal transformation - Statemachine/Model.run() should do minimal JSONL parsing, just create records.
# ✅ CORRECT - simple JSONL parsing
obj = json.loads(line)
if obj.get('type') == 'Dependency':
Dependency.objects.create(**obj)
# ❌ WRONG - complex transformation logic
if obj.get('type') == 'Dependency':
dep = Dependency.objects.create(name=obj['bin_name']) # renaming fields
dep.custom_commands = transform_overrides(obj['overrides']) # transforming data
Follow the same pattern as ArchiveResult.run() (archivebox/core/models.py:1030):
def run(self):
"""Execute this Model by running hooks and processing JSONL output."""
# 1. Discover hooks
hook = discover_hook_for_model(self)
# 2. Run hook
results = run_hook(hook, output_dir=..., ...)
# 3. Parse JSONL and update self
for line in results['stdout'].splitlines():
obj = json.loads(line)
if obj.get('type') == self.__class__.__name__:
self.status = obj.get('status')
self.output = obj.get('output')
# ... apply other fields
# 4. Create side-effect records
for line in results['stdout'].splitlines():
obj = json.loads(line)
if obj.get('type') != self.__class__.__name__:
create_record_from_jsonl(obj) # Creates Binary, etc.
self.save()
Purpose: Check if binary exists, emit Dependency if not found.
#!/usr/bin/env python3
import sys
import json
def find_wget() -> dict | None:
"""Find wget binary using abx-pkg."""
try:
from abx_pkg import Binary, AptProvider, BrewProvider, EnvProvider
binary = Binary(name='wget', binproviders=[AptProvider(), BrewProvider(), EnvProvider()])
loaded = binary.load()
if loaded and loaded.abspath:
return {
'name': 'wget',
'abspath': str(loaded.abspath),
'version': str(loaded.version) if loaded.version else None,
'sha256': loaded.sha256 if hasattr(loaded, 'sha256') else None,
'binprovider': loaded.binprovider.name if loaded.binprovider else 'env',
}
except Exception:
pass
return None
def main():
result = find_wget()
if result and result.get('abspath'):
# Binary found - emit Binary and Machine config
print(json.dumps({
'type': 'Binary',
'name': result['name'],
'abspath': result['abspath'],
'version': result['version'],
'sha256': result['sha256'],
'binprovider': result['binprovider'],
}))
print(json.dumps({
'type': 'Machine',
'_method': 'update',
'key': 'config/WGET_BINARY',
'value': result['abspath'],
}))
sys.exit(0)
else:
# Binary not found - emit Dependency
print(json.dumps({
'type': 'Dependency',
'bin_name': 'wget',
'bin_providers': 'apt,brew,env',
'overrides': {}, # Empty if no special install requirements
}))
print(f"wget binary not found", file=sys.stderr)
sys.exit(1)
if __name__ == '__main__':
main()
Rules:
Binary(...).load() from abx-pkg - handles finding binary, version, hash automaticallyBinary JSONL if foundDependency JSONL if not foundoverrides field matching abx-pkg format: {'pip': {'packages': ['pkg']}, 'apt': {'packages': ['pkg']}}shutil.which(), subprocess.run(), manual version detection, or hash calculationPurpose: Install binary if not already installed.
#!/usr/bin/env python3
import json
import sys
import rich_click as click
from abx_pkg import Binary, PipProvider
@click.command()
@click.option('--dependency-id', required=True)
@click.option('--bin-name', required=True)
@click.option('--bin-providers', default='*')
@click.option('--overrides', default=None, help="JSON-encoded overrides dict")
def main(dependency_id: str, bin_name: str, bin_providers: str, overrides: str | None):
"""Install binary using pip."""
# Check if this hook can handle this dependency
if bin_providers != '*' and 'pip' not in bin_providers.split(','):
click.echo(f"pip provider not allowed for {bin_name}", err=True)
sys.exit(0) # Exit cleanly - not an error, just can't handle
# Parse overrides
overrides_dict = None
if overrides:
try:
full_overrides = json.loads(overrides)
overrides_dict = full_overrides.get('pip', {}) # Extract pip section
except json.JSONDecodeError:
pass
# Install using abx-pkg
provider = PipProvider()
try:
binary = Binary(name=bin_name, binproviders=[provider], overrides=overrides_dict or {}).install()
except Exception as e:
click.echo(f"pip install failed: {e}", err=True)
sys.exit(1)
if not binary.abspath:
sys.exit(1)
# Emit Binary JSONL
print(json.dumps({
'type': 'Binary',
'name': bin_name,
'abspath': str(binary.abspath),
'version': str(binary.version) if binary.version else '',
'sha256': binary.sha256 or '',
'binprovider': 'pip',
'dependency_id': dependency_id,
}))
sys.exit(0)
if __name__ == '__main__':
main()
Rules:
bin_providers parameter - exit cleanly (code 0) if can't handleoverrides parameter as full dict, extract your provider's sectionBinary(...).install() from abx-pkg - handles actual installationBinary JSONL on successclass Dependency(models.Model):
def run(self):
"""Execute dependency installation by running all on_Dependency hooks."""
import json
from pathlib import Path
from django.conf import settings
# Check if already installed
if self.is_installed:
return self.binaries.first()
from archivebox.hooks import run_hooks
# Create output directory
DATA_DIR = getattr(settings, 'DATA_DIR', Path.cwd())
output_dir = Path(DATA_DIR) / 'tmp' / f'dependency_{self.id}'
output_dir.mkdir(parents=True, exist_ok=True)
# Build kwargs for hooks
hook_kwargs = {
'dependency_id': str(self.id),
'bin_name': self.bin_name,
'bin_providers': self.bin_providers,
'overrides': json.dumps(self.overrides) if self.overrides else None,
}
# Run ALL on_Dependency hooks - each decides if it can handle this
results = run_hooks(
event_name='Dependency',
output_dir=output_dir,
timeout=600,
**hook_kwargs
)
# Process results - parse JSONL and create Binary records
for result in results:
if result['returncode'] != 0:
continue
for line in result['stdout'].strip().split('\n'):
if not line.strip():
continue
try:
obj = json.loads(line)
if obj.get('type') == 'Binary':
# Create Binary record - fields match JSONL exactly
if not obj.get('name') or not obj.get('abspath') or not obj.get('version'):
continue
machine = Machine.current()
binary, _ = Binary.objects.update_or_create(
machine=machine,
name=obj['name'],
defaults={
'abspath': obj['abspath'],
'version': obj['version'],
'sha256': obj.get('sha256') or '',
'binprovider': obj.get('binprovider') or 'env',
'dependency': self,
}
)
if self.is_installed:
return binary
except json.JSONDecodeError:
continue
return None
Rules:
run_hooks(event_name='ModelName', ...) with model nameThis plan implements support for long-running background hooks that run concurrently with other extractors, while maintaining proper result collection, cleanup, and state management.
Key Changes:
.bg.js/.bg.py/.bg.sh suffix{type: 'ModelName', ...})run_hook() is generic - just parses JSONL, doesn't know about specific modelsModel.run() extends records of its own type with computed fieldsoutput_files, output_size, etc.output field into output_str (human-readable) and output_json (structured)output_files (dict), output_size (bytes), output_mimetypes (CSV)New ArchiveResult Fields:
# Output fields (replace old 'output' field)
output_str = TextField() # Human-readable summary: "Downloaded 5 files"
output_json = JSONField() # Structured metadata (headers, redirects, etc.)
output_files = JSONField() # Dict: {'index.html': {}, 'style.css': {}}
output_size = BigIntegerField() # Total bytes across all files
output_mimetypes = CharField() # CSV sorted by size: "text/html,text/css,image/png"
output_files Structure:
{} for now, extensible for future metadataArchiveResult.objects.filter(output_files__has_key='index.html')size, hash, mime_type to values later without migration# archivebox/core/migrations/00XX_archiveresult_background_hooks.py
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', 'XXXX_previous_migration'),
('machine', 'XXXX_latest_machine_migration'),
]
operations = [
# Add new fields (keep old 'output' temporarily for migration)
migrations.AddField(
model_name='archiveresult',
name='output_str',
field=models.TextField(
blank=True,
help_text='Human-readable output summary (e.g., "Downloaded 5 files")'
),
),
migrations.AddField(
model_name='archiveresult',
name='output_json',
field=models.JSONField(
null=True,
blank=True,
help_text='Structured metadata (headers, redirects, etc.) - should NOT duplicate ArchiveResult fields'
),
),
migrations.AddField(
model_name='archiveresult',
name='output_files',
field=models.JSONField(
default=dict,
help_text='Dict of {relative_path: {metadata}} - values are empty dicts for now, extensible for future metadata'
),
),
migrations.AddField(
model_name='archiveresult',
name='output_size',
field=models.BigIntegerField(
default=0,
help_text='Total recursive size in bytes of all output files'
),
),
migrations.AddField(
model_name='archiveresult',
name='output_mimetypes',
field=models.CharField(
max_length=512,
blank=True,
help_text='CSV of mimetypes sorted by size descending'
),
),
# Add binary FK (optional)
migrations.AddField(
model_name='archiveresult',
name='binary',
field=models.ForeignKey(
'machine.Binary',
on_delete=models.SET_NULL,
null=True,
blank=True,
help_text='Primary binary used by this hook (optional)'
),
),
]
.output Field# archivebox/core/migrations/00XX_migrate_output_field.py
from django.db import migrations
import json
def migrate_output_field(apps, schema_editor):
"""
Migrate existing 'output' field to new split fields.
Logic:
- If output contains JSON {...}, move to output_json
- If output is a file path and exists in output_files, ensure it's first
- Otherwise, move to output_str
"""
ArchiveResult = apps.get_model('core', 'ArchiveResult')
for ar in ArchiveResult.objects.all():
old_output = ar.output or ''
# Case 1: JSON output
if old_output.strip().startswith('{'):
try:
parsed = json.loads(old_output)
ar.output_json = parsed
ar.output_str = ''
except json.JSONDecodeError:
# Not valid JSON, treat as string
ar.output_str = old_output
# Case 2: File path (check if it looks like a relative path)
elif '/' in old_output or '.' in old_output:
# Might be a file path - if it's in output_files, it's already there
# output_files is now a dict, so no reordering needed
ar.output_str = old_output # Keep as string for display
# Case 3: Plain string summary
else:
ar.output_str = old_output
ar.save(update_fields=['output_str', 'output_json', 'output_files'])
def reverse_migrate(apps, schema_editor):
"""Reverse migration - copy output_str back to output."""
ArchiveResult = apps.get_model('core', 'ArchiveResult')
for ar in ArchiveResult.objects.all():
ar.output = ar.output_str or ''
ar.save(update_fields=['output'])
class Migration(migrations.Migration):
dependencies = [
('core', '00XX_archiveresult_background_hooks'),
]
operations = [
migrations.RunPython(migrate_output_field, reverse_migrate),
# Now safe to remove old 'output' field
migrations.RemoveField(
model_name='archiveresult',
name='output',
),
]
Contract:
type: 'ArchiveResult'status, output_str, output_json, cmd (optional)output_files, output_size, output_mimetypes (runner calculates these)output_json should NOT duplicate ArchiveResult fields (no status, start_ts, etc. in output_json)output_files, output_size, output_mimetypes, start_ts, end_ts, binary FKExample outputs:
// Simple string output
console.log(JSON.stringify({
type: 'ArchiveResult',
output_str: 'This is the page title',
}));
// With structured metadata and optional fields (headers, redirects, etc.)
console.log(JSON.stringify({
type: 'ArchiveResult',
status: 'succeeded',
output_str: 'Got https://example.com headers',
output_json: {'content-type': 'text/html', 'server': 'nginx', 'status-code': 200, 'content-length': 234235},
}));
// With explicit cmd (cmd first arg should match Binary.bin_abspath or XYZ_BINARY env var so ArchiveResult.run() can FK to the Binary)
console.log(JSON.stringify({
type: 'ArchiveResult',
status: 'succeeded',
output_str: 'Archived with wget',
cmd: ['/some/abspath/to/wget', '-p', '-k', 'https://example.com']
}));
// BAD: Don't duplicate ArchiveResult fields in output_json
console.log(JSON.stringify({
type: 'ArchiveResult',
status: 'succeeded',
output_json: {
status: 'succeeded', // ❌ BAD - this should be up a level on ArchiveResult.status, not inside output_json
title: 'the page title', // ❌ BAD - if the extractor's main output is just a string then it belongs in output_str
custom_data: 1234, // ✅ GOOD - custom fields only
},
output_files: {'index.html': {}}, // ❌ BAD - runner calculates this for us, no need to return it manually
}));
run_hook() is a generic JSONL parser - it doesn't know about ArchiveResult, Binary, or any specific model. It just:
{ that has a type field)Returns list of dicts
# archivebox/hooks.py
def run_hook(
script: Path,
output_dir: Path,
timeout: int = 300,
config_objects: Optional[List[Any]] = None,
**kwargs: Any
) -> Optional[List[dict]]:
"""
Execute a hook script and parse JSONL output.
This function is generic and doesn't know about specific model types.
It just executes the script and parses any JSONL lines with 'type' field.
Each Model.run() method handles its own record types differently:
- ArchiveResult.run() extends ArchiveResult records with computed fields
- Dependency.run() creates Binary records from hook output
- Crawl.run() can create Dependency records, Snapshots, or Binary records from hook output
Returns:
List of dicts with 'type' field, each extended with metadata:
[
{
'type': 'ArchiveResult',
'status': 'succeeded',
'plugin': 'wget',
'plugin_hook': 'archivebox/plugins/wget/on_Snapshot__21_wget.py',
'output_str': '...',
# ... other hook-reported fields
},
{
'type': 'Binary',
'name': 'wget',
'plugin': 'wget',
'plugin_hook': 'archivebox/plugins/wget/on_Snapshot__21_wget.py',
# ... other hook-reported fields
}
]
None if background hook (still running)
"""
Key Insight: Hooks output JSONL. Any line with {type: 'ModelName', ...} creates/updates that model. The type field determines what gets created. Each Model.run() method decides how to handle records of its own type.
# archivebox/hooks.py
def create_model_record(record: dict) -> Any:
"""
Generic helper to create/update model instances from hook output.
Args:
record: Dict with 'type' field and model data
Returns:
Created/updated model instance
"""
from archivebox.machine.models import Binary, Dependency
model_type = record.pop('type')
if model_type == 'Binary':
obj, created = Binary.objects.get_or_create(**record) # if model requires custom logic implement Binary.from_jsonl(**record)
return obj
elif model_type == 'Dependency':
obj, created = Dependency.objects.get_or_create(**record)
return obj
# ... Snapshot, ArchiveResult, etc. add more types as needed
else:
raise ValueError(f"Unknown record type: {model_type}")
CRITICAL: This phase MUST be done FIRST, before updating core code. Do this manually, one plugin at a time. Do NOT batch-update multiple plugins at once. Do NOT skip any plugins or checks.
Why First? Updating plugins to output clean JSONL before changing core code means the transition is safe and incremental. The current run_hook() can continue to work during the plugin updates.
All plugins should follow a consistent pattern for checking and declaring dependencies.
RENAME ALL HOOKS:
on_Crawl__*_validate_*.{sh,py,js}on_Crawl__*_install_*.{sh,py,js}Rationale: "install" is clearer than "validate" for what these hooks actually do.
ALL install hooks MUST follow this pattern:
Example Standard Pattern:
#!/usr/bin/env python3
"""
Check for wget binary and emit Dependency if not found.
"""
import os
import sys
import json
from pathlib import Path
def main():
# 1. Get configured binary name/path from env
binary_path = os.environ.get('WGET_BINARY', 'wget')
# 2. Check if Binary exists for this binary
# (In practice, this check happens via database query in the actual implementation)
# For install hooks, we emit a Dependency that the system will process
# 3. Emit Dependency JSONL if needed
# The bin provider will check Binary and install if missing
dependency = {
'type': 'Dependency',
'name': 'wget',
'bin_name': Path(binary_path).name if '/' in binary_path else binary_path,
'providers': ['apt', 'brew', 'pkg'], # Priority order
'abspath': binary_path if binary_path.startswith('/') else None,
}
print(json.dumps(dependency))
return 0
if __name__ == '__main__':
sys.exit(main())
ALL hooks MUST respect user-configured binary paths:
XYZ_BINARY env var (e.g., WGET_BINARY, YTDLP_BINARY, CHROME_BINARY)WGET_BINARY=/usr/local/bin/wget2WGET_BINARY=wget2WGET_BINARY=wget2, check for wget2 not wgetExample Config Handling:
# Get configured binary (could be path or name)
binary_path = os.environ.get('WGET_BINARY', 'wget')
# Extract just the binary name for Binary lookup
if '/' in binary_path:
# Absolute path: /usr/local/bin/wget2 -> wget2
bin_name = Path(binary_path).name
else:
# Just a name: wget2 -> wget2
bin_name = binary_path
# Now check Binary for bin_name (not hardcoded 'wget')
All on_Snapshot__*.* hooks must follow the output format specified in Phase 2. Key points for implementation:
CRITICAL Legacy Issues to Fix:
RESULT_JSON= prefix - old hooks use console.log('RESULT_JSON=' + ...)--version calls - hooks should NOT run binary version checksconsole.log(JSON.stringify(result))Before (WRONG):
console.log(`VERSION=${version}`);
console.log(`START_TS=${startTime.toISOString()}`);
console.log(`RESULT_JSON=${JSON.stringify(result)}`);
After (CORRECT):
console.log(JSON.stringify({type: 'ArchiveResult', status: 'succeeded', output_str: 'Done'}));
See Phase 2 for complete JSONL format specification and examples.
ALL on_Snapshot hooks MUST:
XYZ_BINARY env varExample:
// ✅ CORRECT - uses env var
const wgetBinary = process.env.WGET_BINARY || 'wget';
const cmd = [wgetBinary, '-p', '-k', url];
// Execute command...
const result = execSync(cmd.join(' '));
// Report cmd in output for binary FK
console.log(JSON.stringify({
type: 'ArchiveResult',
status: 'succeeded',
output_str: 'Downloaded page',
cmd: cmd, // ✅ Includes configured binary
}));
// ❌ WRONG - hardcoded binary name
const cmd = ['wget', '-p', '-k', url]; // Ignores WGET_BINARY
For EACH plugin, verify ALL of these:
on_Crawl__*_validate_* to on_Crawl__*_install_*XYZ_BINARY env var and handles both absolute paths + bin names{"type": "Dependency", ...} JSONL (uses configured bin_name)XYZ_BINARY env var and uses it in cmdRESULT_JSON= prefix)--version commands (some hooks still do for compatibility checks)cmd array with configured binary path (Python hooks)MANDATORY PROCESS:
Why one-by-one?
After updating each plugin, verify:
python3 on_Crawl__01_install_wget.pypython3 ... | jq .XYZ_BINARY env var... | jq .typeWhen auditing plugins, watch for these common mistakes:
Binary.filter(name='wget') → should use configured nameRESULT_JSON=, VERSION=, START_TS= linesoutput_files, start_ts, duration in JSONLXYZ_BINARY env vars--version command executionsSee sections 4.1 and 4.2 for detailed before/after examples.
Note: Only do this AFTER Phase 4 (plugin standardization) is complete. By then, all plugins will output clean JSONL and this implementation will work smoothly.
archivebox/hooks.pydef find_binary_for_cmd(cmd: List[str], machine_id: str) -> Optional[str]:
"""
Find Binary for a command, trying abspath first then name.
Only matches binaries on the current machine.
Args:
cmd: Command list (e.g., ['/usr/bin/wget', '-p', 'url'])
machine_id: Current machine ID
Returns:
Binary ID if found, None otherwise
"""
if not cmd:
return None
from archivebox.machine.models import Binary
bin_path_or_name = cmd[0]
# Try matching by absolute path first
binary = Binary.objects.filter(
abspath=bin_path_or_name,
machine_id=machine_id
).first()
if binary:
return str(binary.id)
# Fallback: match by binary name
bin_name = Path(bin_path_or_name).name
binary = Binary.objects.filter(
name=bin_name,
machine_id=machine_id
).first()
return str(binary.id) if binary else None
def run_hook(
script: Path,
output_dir: Path,
timeout: int = 300,
config_objects: Optional[List[Any]] = None,
**kwargs: Any
) -> Optional[List[dict]]:
"""
Execute a hook script and parse JSONL output.
This is a GENERIC function that doesn't know about specific model types.
It just executes and parses JSONL (any line with {type: 'ModelName', ...}).
Runner responsibilities:
- Detect background hooks (.bg. in filename)
- Capture stdout/stderr to log files
- Parse JSONL output and add plugin metadata
- Clean up log files and PID files
Hook responsibilities:
- Emit JSONL: {type: 'ArchiveResult', status, output_str, output_json, cmd}
- Can emit multiple types: {type: 'Binary', ...}
- Write actual output files
Args:
script: Path to hook script
output_dir: Working directory (where output files go)
timeout: Max execution time in seconds
config_objects: Config override objects (Machine, Crawl, Snapshot)
**kwargs: CLI arguments passed to script
Returns:
List of dicts with 'type' field for foreground hooks
None for background hooks (still running)
"""
import time
from datetime import datetime, timezone
from archivebox.machine.models import Machine
start_time = time.time()
# 1. SETUP
is_background = '.bg.' in script.name # Detect .bg.js/.bg.py/.bg.sh
effective_timeout = timeout * 10 if is_background else timeout
# Infrastructure files (ALL hooks)
stdout_file = output_dir / 'stdout.log'
stderr_file = output_dir / 'stderr.log'
pid_file = output_dir / 'hook.pid'
# Capture files before execution
files_before = set(output_dir.rglob('*')) if output_dir.exists() else set()
start_ts = datetime.now(timezone.utc)
# 2. BUILD COMMAND
ext = script.suffix.lower()
if ext == '.sh':
interpreter_cmd = ['bash', str(script)]
elif ext == '.py':
interpreter_cmd = ['python3', str(script)]
elif ext == '.js':
interpreter_cmd = ['node', str(script)]
else:
interpreter_cmd = [str(script)]
# Build CLI arguments from kwargs
cli_args = []
for key, value in kwargs.items():
if key.startswith('_'):
continue
arg_key = f'--{key.replace("_", "-")}'
if isinstance(value, bool):
if value:
cli_args.append(arg_key)
elif value is not None and value != '':
if isinstance(value, (dict, list)):
cli_args.append(f'{arg_key}={json.dumps(value)}')
else:
str_value = str(value).strip()
if str_value:
cli_args.append(f'{arg_key}={str_value}')
full_cmd = interpreter_cmd + cli_args
# 3. SET UP ENVIRONMENT
env = os.environ.copy()
# ... (existing env setup from current run_hook implementation)
# 4. CREATE OUTPUT DIRECTORY
output_dir.mkdir(parents=True, exist_ok=True)
# 5. EXECUTE PROCESS
try:
with open(stdout_file, 'w') as out, open(stderr_file, 'w') as err:
process = subprocess.Popen(
full_cmd,
cwd=str(output_dir),
stdout=out,
stderr=err,
env=env,
)
# Write PID for all hooks
pid_file.write_text(str(process.pid))
if is_background:
# Background hook - return immediately, don't wait
return None
# Foreground hook - wait for completion
try:
returncode = process.wait(timeout=effective_timeout)
except subprocess.TimeoutExpired:
process.kill()
process.wait()
returncode = -1
with open(stderr_file, 'a') as err:
err.write(f'\nHook timed out after {effective_timeout}s')
# 6. COLLECT RESULTS (foreground only)
end_ts = datetime.now(timezone.utc)
stdout = stdout_file.read_text() if stdout_file.exists() else ''
stderr = stderr_file.read_text() if stderr_file.exists() else ''
# Parse ALL JSONL output (any line with {type: 'ModelName', ...})
records = []
for line in stdout.splitlines():
line = line.strip()
if not line or not line.startswith('{'):
continue
try:
data = json.loads(line)
if 'type' in data:
# Add plugin metadata to every record
plugin_name = script.parent.name # Directory name (e.g., 'wget')
data['plugin'] = plugin_name
data['plugin_hook'] = str(script.relative_to(Path.cwd()))
records.append(data)
except json.JSONDecodeError:
continue
# 7. CLEANUP
# Delete empty logs (keep non-empty for debugging)
if stdout_file.exists() and stdout_file.stat().st_size == 0:
stdout_file.unlink()
if stderr_file.exists() and stderr_file.stat().st_size == 0:
stderr_file.unlink()
# Delete ALL .pid files on success
if returncode == 0:
for pf in output_dir.glob('*.pid'):
pf.unlink(missing_ok=True)
# 8. RETURN RECORDS
# Returns list of dicts, each with 'type' field and plugin metadata
return records
except Exception as e:
# On error, return empty list (hook failed, no records created)
return []
Note: Only do this AFTER Phase 5 (run_hook() implementation) is complete.
archivebox/core/models.pydef run(self):
"""
Execute this ArchiveResult's extractor and update status.
For foreground hooks: Waits for completion and updates immediately
For background hooks: Returns immediately, leaves status='started'
This method extends any ArchiveResult records from hook output with
computed fields (output_files, output_size, binary FK, etc.).
"""
from django.utils import timezone
from archivebox.hooks import BUILTIN_PLUGINS_DIR, USER_PLUGINS_DIR, run_hook, find_binary_for_cmd, create_model_record
from archivebox.machine.models import Machine
config_objects = [self.snapshot.crawl, self.snapshot] if self.snapshot.crawl else [self.snapshot]
# Find hook for this extractor
hook = None
for base_dir in (BUILTIN_PLUGINS_DIR, USER_PLUGINS_DIR):
if not base_dir.exists():
continue
matches = list(base_dir.glob(f'*/on_Snapshot__{self.extractor}.*'))
if matches:
hook = matches[0]
break
if not hook:
self.status = self.StatusChoices.FAILED
self.output_str = f'No hook found for: {self.extractor}'
self.retry_at = None
self.save()
return
# Use plugin directory name instead of extractor name
plugin_name = hook.parent.name
extractor_dir = Path(self.snapshot.output_dir) / plugin_name
start_ts = timezone.now()
# Run the hook (returns list of JSONL records)
records = run_hook(
hook,
output_dir=extractor_dir,
config_objects=config_objects,
url=self.snapshot.url,
snapshot_id=str(self.snapshot.id),
)
# BACKGROUND HOOK - still running
if records is None:
self.status = self.StatusChoices.STARTED
self.start_ts = start_ts
self.pwd = str(extractor_dir)
self.save()
return
# FOREGROUND HOOK - process records
end_ts = timezone.now()
# Find the ArchiveResult record (enforce single output)
ar_records = [r for r in records if r.get('type') == 'ArchiveResult']
assert len(ar_records) <= 1, f"Hook {hook} output {len(ar_records)} ArchiveResults, expected 0-1"
if ar_records:
hook_data = ar_records[0]
# Apply hook's data
status_str = hook_data.get('status', 'failed')
status_map = {
'succeeded': self.StatusChoices.SUCCEEDED,
'failed': self.StatusChoices.FAILED,
'skipped': self.StatusChoices.SKIPPED,
}
self.status = status_map.get(status_str, self.StatusChoices.FAILED)
self.output_str = hook_data.get('output_str', '')
self.output_json = hook_data.get('output_json')
# Set extractor from plugin metadata
self.extractor = hook_data['plugin']
# Determine binary FK from cmd (ArchiveResult-specific logic)
if 'cmd' in hook_data:
self.cmd = json.dumps(hook_data['cmd'])
machine = Machine.current()
binary_id = find_binary_for_cmd(hook_data['cmd'], machine.id)
if binary_id:
self.binary_id = binary_id
else:
# No ArchiveResult output - hook didn't report, treat as failed
self.status = self.StatusChoices.FAILED
self.output_str = 'Hook did not output ArchiveResult'
# Set timestamps and metadata
self.start_ts = start_ts
self.end_ts = end_ts
self.pwd = str(extractor_dir)
self.retry_at = None
# POPULATE OUTPUT FIELDS FROM FILESYSTEM (ArchiveResult-specific)
if extractor_dir.exists():
self._populate_output_fields(extractor_dir)
self.save()
# Create any side-effect records (Binary, Dependency, etc.)
for record in records:
if record['type'] != 'ArchiveResult':
create_model_record(record) # Generic helper that dispatches by type
# Clean up empty output directory (no real files after excluding logs/pids)
if extractor_dir.exists():
try:
# Check if only infrastructure files remain
remaining_files = [
f for f in extractor_dir.rglob('*')
if f.is_file() and f.name not in ('stdout.log', 'stderr.log', 'hook.pid', 'listener.pid')
]
if not remaining_files:
# Remove infrastructure files
for pf in extractor_dir.glob('*.log'):
pf.unlink(missing_ok=True)
for pf in extractor_dir.glob('*.pid'):
pf.unlink(missing_ok=True)
# Try to remove directory if empty
if not any(extractor_dir.iterdir()):
extractor_dir.rmdir()
except (OSError, RuntimeError):
pass
# Queue discovered URLs, trigger indexing, etc.
self._queue_urls_for_crawl(extractor_dir)
if self.status == self.StatusChoices.SUCCEEDED:
# Update snapshot title if this is title extractor
extractor_name = get_extractor_name(self.extractor)
if extractor_name == 'title':
self._update_snapshot_title(extractor_dir)
# Trigger search indexing
self.trigger_search_indexing()
def _populate_output_fields(self, output_dir: Path) -> None:
"""
Walk output directory and populate output_files, output_size, output_mimetypes fields.
Args:
output_dir: Directory containing output files
"""
import mimetypes
from collections import defaultdict
exclude_names = {'stdout.log', 'stderr.log', 'hook.pid', 'listener.pid'}
# Track mimetypes and sizes for aggregation
mime_sizes = defaultdict(int)
total_size = 0
output_files = {} # Dict keyed by relative path
for file_path in output_dir.rglob('*'):
# Skip non-files and infrastructure files
if not file_path.is_file():
continue
if file_path.name in exclude_names:
continue
# Get file stats
stat = file_path.stat()
mime_type, _ = mimetypes.guess_type(str(file_path))
mime_type = mime_type or 'application/octet-stream'
# Track for ArchiveResult fields
relative_path = str(file_path.relative_to(output_dir))
output_files[relative_path] = {} # Empty dict, extensible for future metadata
mime_sizes[mime_type] += stat.st_size
total_size += stat.st_size
# Populate ArchiveResult fields
self.output_files = output_files # Dict preserves insertion order (Python 3.7+)
self.output_size = total_size
# Build output_mimetypes CSV (sorted by size descending)
sorted_mimes = sorted(mime_sizes.items(), key=lambda x: x[1], reverse=True)
self.output_mimetypes = ','.join(mime for mime, _ in sorted_mimes)
Since output_files is a dict keyed by relative path, you can use Django's JSON field lookups:
# Check if a specific file exists
ArchiveResult.objects.filter(output_files__has_key='index.html')
# Check if any of multiple files exist (OR)
from django.db.models import Q
ArchiveResult.objects.filter(
Q(output_files__has_key='index.html') |
Q(output_files__has_key='index.htm')
)
# Get all results that have favicon
ArchiveResult.objects.filter(output_files__has_key='favicon.ico')
# Check in Python (after fetching)
if 'index.html' in archiveresult.output_files:
print("Found index.html")
# Get list of all paths
paths = list(archiveresult.output_files.keys())
# Count files
file_count = len(archiveresult.output_files)
# Future: When we add metadata, query still works
# output_files = {'index.html': {'size': 4096, 'hash': 'abc...'}}
ArchiveResult.objects.filter(output_files__index_html__size__gt=1000) # size > 1KB
Structure for Future Extension:
Current (empty metadata):
{
'index.html': {},
'style.css': {},
'images/logo.png': {}
}
Future (with optional metadata):
{
'index.html': {
'size': 4096,
'hash': 'abc123...',
'mime_type': 'text/html'
},
'style.css': {
'size': 2048,
'hash': 'def456...',
'mime_type': 'text/css'
}
}
All existing queries continue to work unchanged - the dict structure is backward compatible.
This phase adds support for long-running background hooks that don't block other extractors.
Background hooks are identified by .bg. suffix in filename:
on_Snapshot__21_consolelog.bg.js ← backgroundon_Snapshot__11_favicon.js ← foregroundFiles to rename:
# Use .bg. suffix (not __background)
mv archivebox/plugins/consolelog/on_Snapshot__21_consolelog.js \
archivebox/plugins/consolelog/on_Snapshot__21_consolelog.bg.js
mv archivebox/plugins/ssl/on_Snapshot__23_ssl.js \
archivebox/plugins/ssl/on_Snapshot__23_ssl.bg.js
mv archivebox/plugins/responses/on_Snapshot__24_responses.js \
archivebox/plugins/responses/on_Snapshot__24_responses.bg.js
Update hook content to emit proper JSON:
Each hook should emit:
console.log(JSON.stringify({
type: 'ArchiveResult',
status: 'succeeded', // or 'failed' or 'skipped'
output_str: 'Captured 15 console messages', // human-readable summary
output_json: { // optional structured metadata
// ... specific to each hook
}
}));
Location: archivebox/core/models.py or new archivebox/core/background_hooks.py
def find_background_hooks(snapshot) -> List['ArchiveResult']:
"""
Find all ArchiveResults that are background hooks still running.
Args:
snapshot: Snapshot instance
Returns:
List of ArchiveResults with status='started'
"""
return list(snapshot.archiveresult_set.filter(
status=ArchiveResult.StatusChoices.STARTED
))
def check_background_hook_completed(archiveresult: 'ArchiveResult') -> bool:
"""
Check if background hook process has exited.
Args:
archiveresult: ArchiveResult instance
Returns:
True if completed (process exited), False if still running
"""
extractor_dir = Path(archiveresult.pwd)
pid_file = extractor_dir / 'hook.pid'
if not pid_file.exists():
return True # No PID file = completed or failed to start
try:
pid = int(pid_file.read_text().strip())
os.kill(pid, 0) # Signal 0 = check if process exists
return False # Still running
except (OSError, ValueError):
return True # Process exited or invalid PID
def finalize_background_hook(archiveresult: 'ArchiveResult') -> None:
"""
Collect final results from completed background hook.
Same logic as ArchiveResult.run() but for background hooks that already started.
Args:
archiveresult: ArchiveResult instance to finalize
"""
from django.utils import timezone
from archivebox.machine.models import Machine
extractor_dir = Path(archiveresult.pwd)
stdout_file = extractor_dir / 'stdout.log'
stderr_file = extractor_dir / 'stderr.log'
# Read logs
stdout = stdout_file.read_text() if stdout_file.exists() else ''
# Parse JSONL output (same as run_hook)
records = []
for line in stdout.splitlines():
line = line.strip()
if not line or not line.startswith('{'):
continue
try:
data = json.loads(line)
if 'type' in data:
records.append(data)
except json.JSONDecodeError:
continue
# Find the ArchiveResult record
ar_records = [r for r in records if r.get('type') == 'ArchiveResult']
assert len(ar_records) <= 1, f"Background hook output {len(ar_records)} ArchiveResults, expected 0-1"
if ar_records:
hook_data = ar_records[0]
# Apply hook's data
status_str = hook_data.get('status', 'failed')
status_map = {
'succeeded': ArchiveResult.StatusChoices.SUCCEEDED,
'failed': ArchiveResult.StatusChoices.FAILED,
'skipped': ArchiveResult.StatusChoices.SKIPPED,
}
archiveresult.status = status_map.get(status_str, ArchiveResult.StatusChoices.FAILED)
archiveresult.output_str = hook_data.get('output_str', '')
archiveresult.output_json = hook_data.get('output_json')
# Determine binary FK from cmd
if 'cmd' in hook_data:
archiveresult.cmd = json.dumps(hook_data['cmd'])
machine = Machine.current()
binary_id = find_binary_for_cmd(hook_data['cmd'], machine.id)
if binary_id:
archiveresult.binary_id = binary_id
else:
# No output = failed
archiveresult.status = ArchiveResult.StatusChoices.FAILED
archiveresult.output_str = 'Background hook did not output ArchiveResult'
archiveresult.end_ts = timezone.now()
archiveresult.retry_at = None
# POPULATE OUTPUT FIELDS FROM FILESYSTEM
if extractor_dir.exists():
archiveresult._populate_output_fields(extractor_dir)
archiveresult.save()
# Create any side-effect records
for record in records:
if record['type'] != 'ArchiveResult':
create_model_record(record)
# Cleanup
for pf in extractor_dir.glob('*.pid'):
pf.unlink(missing_ok=True)
if stdout_file.exists() and stdout_file.stat().st_size == 0:
stdout_file.unlink()
if stderr_file.exists() and stderr_file.stat().st_size == 0:
stderr_file.unlink()
Location: archivebox/core/statemachines.py
class SnapshotMachine(StateMachine, strict_states=True):
# ... existing states ...
def is_finished(self) -> bool:
"""
Check if snapshot archiving is complete.
A snapshot is finished when:
1. No pending archiveresults remain (queued/started foreground hooks)
2. All background hooks have completed
"""
# Check if any pending archiveresults exist
if self.snapshot.pending_archiveresults().exists():
return False
# Check and finalize background hooks
background_hooks = find_background_hooks(self.snapshot)
for bg_hook in background_hooks:
if not check_background_hook_completed(bg_hook):
return False # Still running
# Completed - finalize it
finalize_background_hook(bg_hook)
# All done
return True
Deduplication is handled by external filesystem tools like fdupes (hardlinks), ZFS dedup, Btrfs duperemove, or rdfind. Users can run these tools periodically on the archive directory to identify and link duplicate files. ArchiveBox doesn't need to track hashes or manage deduplication itself - the filesystem layer handles it transparently.
# tests/test_background_hooks.py
def test_background_hook_detection():
"""Test .bg. suffix detection"""
assert is_background_hook(Path('on_Snapshot__21_test.bg.js'))
assert not is_background_hook(Path('on_Snapshot__21_test.js'))
def test_find_binary_by_abspath():
"""Test binary matching by absolute path"""
machine = Machine.current()
binary = Binary.objects.create(
name='wget',
abspath='/usr/bin/wget',
machine=machine
)
cmd = ['/usr/bin/wget', '-p', 'url']
assert find_binary_for_cmd(cmd, machine.id) == str(binary.id)
def test_find_binary_by_name():
"""Test binary matching by name fallback"""
machine = Machine.current()
binary = Binary.objects.create(
name='wget',
abspath='/usr/local/bin/wget',
machine=machine
)
cmd = ['wget', '-p', 'url']
assert find_binary_for_cmd(cmd, machine.id) == str(binary.id)
def test_parse_hook_json():
"""Test JSON parsing from stdout"""
stdout = '''
Some log output
{"type": "ArchiveResult", "status": "succeeded", "output_str": "test"}
More output
'''
result = parse_hook_output_json(stdout)
assert result['status'] == 'succeeded'
assert result['output_str'] == 'test'
def test_foreground_hook_execution(snapshot):
"""Test foreground hook runs and returns results"""
ar = ArchiveResult.objects.create(
snapshot=snapshot,
extractor='11_favicon',
status=ArchiveResult.StatusChoices.QUEUED
)
ar.run()
ar.refresh_from_db()
assert ar.status in [
ArchiveResult.StatusChoices.SUCCEEDED,
ArchiveResult.StatusChoices.FAILED
]
assert ar.start_ts is not None
assert ar.end_ts is not None
assert ar.output_size >= 0
def test_background_hook_execution(snapshot):
"""Test background hook starts but doesn't block"""
ar = ArchiveResult.objects.create(
snapshot=snapshot,
extractor='21_consolelog',
status=ArchiveResult.StatusChoices.QUEUED
)
start = time.time()
ar.run()
duration = time.time() - start
ar.refresh_from_db()
# Should return quickly (< 5 seconds)
assert duration < 5
# Should be in 'started' state
assert ar.status == ArchiveResult.StatusChoices.STARTED
# PID file should exist
assert (Path(ar.pwd) / 'hook.pid').exists()
def test_background_hook_finalization(snapshot):
"""Test background hook finalization after completion"""
# Start background hook
ar = ArchiveResult.objects.create(
snapshot=snapshot,
extractor='21_consolelog',
status=ArchiveResult.StatusChoices.STARTED,
pwd='/path/to/output'
)
# Simulate completion (hook writes output and exits)
# ...
# Finalize
finalize_background_hook(ar)
ar.refresh_from_db()
assert ar.status == ArchiveResult.StatusChoices.SUCCEEDED
assert ar.end_ts is not None
assert ar.output_size > 0
cd archivebox
python manage.py makemigrations core --name archiveresult_background_hooks
_populate_output_fields() to walk directory and populate summary fieldscreate_model_record() for any side-effect records (Binary, etc.)find_background_hooks()check_background_hook_completed()finalize_background_hook()If a background hook runs longer than MAX_LIFETIME after all foreground hooks complete, force kill it.
Background hooks could write progress to a file that gets polled:
fs.writeFileSync('progress.txt', '50%');
If needed in future, extend to support multiple JSON outputs by collecting all {type: 'ArchiveResult'} lines.
Store all binaries used by a hook (not just primary), useful for hooks that chain multiple tools.
If needed, extend output_files values to include per-file metadata:
output_files = {
'index.html': {
'size': 4096,
'hash': 'abc123...',
'mime_type': 'text/html',
'modified_at': '2025-01-15T10:30:00Z'
}
}
Can query with custom SQL for complex per-file queries (e.g., "find all results with any file > 50KB"). Summary fields (output_size, output_mimetypes) remain as denormalized cache for performance.
This report documents the Phase 4 plugin audit and Phase 1-7 implementation work.
Created migrations:
archivebox/core/migrations/0029_archiveresult_hook_fields.py - Adds new fieldsarchivebox/core/migrations/0030_migrate_output_field.py - Migrates old output fieldNew ArchiveResult fields:
output_str (TextField) - human-readable summaryoutput_json (JSONField) - structured metadataoutput_files (JSONField) - dict of {relative_path: {}}output_size (BigIntegerField) - total bytesoutput_mimetypes (CharField) - CSV of mimetypes sorted by sizebinary (ForeignKey to Binary) - optionalUpdated archivebox/hooks.py:
{type: 'ModelName', ...})RESULT_JSON= format.bg. suffixfind_binary_for_cmd() helpercreate_model_record() for Binary/MachineUpdated archivebox/core/models.py:
records from HookResult_populate_output_fields() method_set_binary_from_cmd() methodcreate_model_record() for side-effect recordsAdded to archivebox/core/models.py:
is_background_hook() methodcheck_background_completed() methodfinalize_background_hook() methodUpdated archivebox/core/statemachines.py:
SnapshotMachine.is_finished() checks/finalizes background hooks| Plugin | Hook | Status | Notes |
|---|---|---|---|
| apt | on_Dependency__install_using_apt_provider.py |
✅ OK | Emits {type: 'Binary'} JSONL |
| brew | on_Dependency__install_using_brew_provider.py |
✅ OK | Emits {type: 'Binary'} JSONL |
| custom | on_Dependency__install_using_custom_bash.py |
✅ OK | Emits {type: 'Binary'} JSONL |
| env | on_Dependency__install_using_env_provider.py |
✅ OK | Emits {type: 'Binary'} JSONL |
| npm | on_Dependency__install_using_npm_provider.py |
✅ OK | Emits {type: 'Binary'} JSONL |
| pip | on_Dependency__install_using_pip_provider.py |
✅ OK | Emits {type: 'Binary'} JSONL |
| Plugin | Hook | Status | Notes |
|---|---|---|---|
| chrome_session | on_Crawl__00_install_chrome.py |
✅ RENAMED | Emits Binary/Dependency JSONL |
| chrome_session | on_Crawl__00_install_chrome_config.py |
✅ RENAMED | Emits config JSONL |
| wget | on_Crawl__00_install_wget.py |
✅ RENAMED | Emits Binary/Dependency JSONL |
| wget | on_Crawl__00_install_wget_config.py |
✅ RENAMED | Emits config JSONL |
| singlefile | on_Crawl__00_install_singlefile.py |
✅ RENAMED | Emits Binary/Dependency JSONL |
| readability | on_Crawl__00_install_readability.py |
✅ RENAMED | Emits Binary/Dependency JSONL |
| media | on_Crawl__00_install_ytdlp.py |
✅ RENAMED | Emits Binary/Dependency JSONL |
| git | on_Crawl__00_install_git.py |
✅ RENAMED | Emits Binary/Dependency JSONL |
| forumdl | on_Crawl__00_install_forumdl.py |
✅ RENAMED | Emits Binary/Dependency JSONL |
| gallerydl | on_Crawl__00_install_gallerydl.py |
✅ RENAMED | Emits Binary/Dependency JSONL |
| mercury | on_Crawl__00_install_mercury.py |
✅ RENAMED | Emits Binary/Dependency JSONL |
| papersdl | on_Crawl__00_install_papersdl.py |
✅ RENAMED | Emits Binary/Dependency JSONL |
| search_backend_ripgrep | on_Crawl__00_install_ripgrep.py |
✅ RENAMED | Emits Binary/Dependency JSONL |
| Plugin | Hook | Status | Notes |
|---|---|---|---|
| favicon | on_Snapshot__11_favicon.py |
✅ UPDATED | Now outputs clean JSONL |
| git | on_Snapshot__12_git.py |
✅ UPDATED | Now outputs clean JSONL with cmd |
| archivedotorg | on_Snapshot__13_archivedotorg.py |
✅ UPDATED | Now outputs clean JSONL |
| title | on_Snapshot__32_title.js |
✅ UPDATED | Now outputs clean JSONL |
| singlefile | on_Snapshot__37_singlefile.py |
✅ UPDATED | Now outputs clean JSONL with cmd |
| wget | on_Snapshot__50_wget.py |
✅ UPDATED | Now outputs clean JSONL with cmd |
| media | on_Snapshot__51_media.py |
✅ UPDATED | Now outputs clean JSONL with cmd |
| readability | on_Snapshot__52_readability.py |
✅ UPDATED | Now outputs clean JSONL with cmd |
All JS hooks have been updated to use clean JSONL format:
| Plugin | Hook | Status | Notes |
|---|---|---|---|
| chrome_session | on_Snapshot__20_chrome_session.js |
✅ UPDATED | Clean JSONL with cmd_version |
| consolelog | on_Snapshot__21_consolelog.bg.js |
✅ UPDATED | Renamed to background hook |
| ssl | on_Snapshot__23_ssl.bg.js |
✅ UPDATED | Renamed to background hook |
| responses | on_Snapshot__24_responses.bg.js |
✅ UPDATED | Renamed to background hook |
| chrome_navigate | on_Snapshot__30_chrome_navigate.js |
✅ UPDATED | Clean JSONL output |
| redirects | on_Snapshot__31_redirects.js |
✅ UPDATED | Clean JSONL output |
| title | on_Snapshot__32_title.js |
✅ UPDATED | Clean JSONL output |
| headers | on_Snapshot__33_headers.js |
✅ UPDATED | Clean JSONL output |
| screenshot | on_Snapshot__34_screenshot.js |
✅ UPDATED | Clean JSONL output |
on_Snapshot__35_pdf.js |
✅ UPDATED | Clean JSONL output | |
| dom | on_Snapshot__36_dom.js |
✅ UPDATED | Clean JSONL output |
| seo | on_Snapshot__38_seo.js |
✅ UPDATED | Clean JSONL output |
| accessibility | on_Snapshot__39_accessibility.js |
✅ UPDATED | Clean JSONL output |
| parse_dom_outlinks | on_Snapshot__40_parse_dom_outlinks.js |
✅ UPDATED | Clean JSONL output |
The following hooks have been renamed with .bg. suffix:
on_Snapshot__21_consolelog.js → on_Snapshot__21_consolelog.bg.json_Snapshot__23_ssl.js → on_Snapshot__23_ssl.bg.json_Snapshot__24_responses.js → on_Snapshot__24_responses.bg.jsarchivebox/hooks.py - Updated run_hook() and added helpersarchivebox/core/models.py - Updated ArchiveResult model and run() methodarchivebox/core/statemachines.py - Updated SnapshotMachine.is_finished()archivebox/core/admin_archiveresults.py - Updated to use output_strarchivebox/core/templatetags/core_tags.py - Updated to use output_strarchivebox/core/migrations/0029_archiveresult_hook_fields.py (new)archivebox/core/migrations/0030_migrate_output_field.py (new)archivebox/plugins/archivedotorg/on_Snapshot__13_archivedotorg.pyarchivebox/plugins/favicon/on_Snapshot__11_favicon.pyarchivebox/plugins/git/on_Snapshot__12_git.pyarchivebox/plugins/media/on_Snapshot__51_media.pyarchivebox/plugins/readability/on_Snapshot__52_readability.pyarchivebox/plugins/singlefile/on_Snapshot__37_singlefile.pyarchivebox/plugins/wget/on_Snapshot__50_wget.pyarchivebox/plugins/chrome_session/on_Snapshot__20_chrome_session.jsarchivebox/plugins/consolelog/on_Snapshot__21_consolelog.bg.js (renamed)archivebox/plugins/ssl/on_Snapshot__23_ssl.bg.js (renamed)archivebox/plugins/responses/on_Snapshot__24_responses.bg.js (renamed)archivebox/plugins/chrome_navigate/on_Snapshot__30_chrome_navigate.jsarchivebox/plugins/redirects/on_Snapshot__31_redirects.jsarchivebox/plugins/title/on_Snapshot__32_title.jsarchivebox/plugins/headers/on_Snapshot__33_headers.jsarchivebox/plugins/screenshot/on_Snapshot__34_screenshot.jsarchivebox/plugins/pdf/on_Snapshot__35_pdf.jsarchivebox/plugins/dom/on_Snapshot__36_dom.jsarchivebox/plugins/seo/on_Snapshot__38_seo.jsarchivebox/plugins/accessibility/on_Snapshot__39_accessibility.jsarchivebox/plugins/parse_dom_outlinks/on_Snapshot__40_parse_dom_outlinks.js.bg. suffixAll phases of the hook architecture implementation are now complete:
Total hooks updated: 32 hooks across 6 dependency providers, 13 install hooks (renamed from validate), 8 Python snapshot hooks, and 14 JS snapshot hooks (3 of which are background hooks).