Status: 🟡 In Progress (2/13 phases complete) Started: 2025-12-28 Estimated Files to Update: ~150+ files
File Created: archivebox/core/migrations/0033_rename_extractor_add_hook_name.py
Changes:
migrations.RenameField() to rename extractor → pluginhook_name field (CharField, max_length=255, indexed, default='')File Updated: archivebox/core/models.py
indexable() method to use plugin__in and plugin=methodARCHIVE_METHODS_INDEXING_PRECEDENCE to EXTRACTOR_INDEXING_PRECEDENCEField Changes:
extractor → pluginhook_name (stores full filename like on_Snapshot__50_wget.py)Method Updates:
get_extractor_choices() → get_plugin_choices()__str__(): Now uses self.pluginsave(): Logs plugin instead of extractorget_absolute_url(): Uses self.pluginextractor_module property → plugin_module propertyoutput_exists(): Checks self.plugin directoryembed_path(): Uses self.plugin for pathscreate_output_dir(): Creates self.plugin directoryoutput_dir_name: Returns self.pluginrun(): All references to extractor → plugin (including extractor_dir → plugin_dir)update_from_output(): All references updated to plugin/plugin_dir_update_snapshot_title(): Parameter renamed to plugin_dirtrigger_search_indexing(): Passes plugin=self.pluginoutput_dir property: Returns plugin directoryis_background_hook(): Uses plugin_dirMethod Updates:
create_pending_archiveresults(): Uses get_enabled_plugins(), filters by plugin=pluginresult_icons (calc_icons): Maps by r.plugin, calls get_plugin_name() and get_plugin_icon()_merge_archive_results_from_index(): Maps by (ar.plugin, ar.start_ts), supports both 'extractor' and 'plugin' keys for backwards compat_create_archive_result_if_missing(): Supports both 'extractor' and 'plugin' keys, creates with plugin=pluginwrite_index_json(): Writes 'plugin': ar.plugin in archive_resultscanonical_outputs(): Updates find_best_output_in_dir() to use plugin_name, accesses result.plugin, creates keys like {result.plugin}_pathlatest_outputs(): Uses get_plugins(), filters by plugin=pluginretry_failed_archiveresults(): Updated docstring to reference "plugins" instead of "extractors"Total Lines Changed in models.py: ~50+ locations
Refactor the ArchiveResult model and standardize terminology across the codebase:
extractor field to plugin in ArchiveResult modelhook_name field to store the specific hook filename that executed--extract/--extractors to --pluginsclass ArchiveResult(ModelWithOutputDir, ...):
extractor = models.CharField(max_length=32, db_index=True) # e.g., "screenshot", "wget"
# New fields from migration 0029:
output_str, output_json, output_files, output_size, output_mimetypes
binary = ForeignKey('machine.Binary', ...)
# No hook_name field yet
ArchiveResult.run() discovers hooks for the plugin (e.g., wget/on_Snapshot__50_wget.py)run_hook() executes each hook script, captures output as HookResultupdate_from_output() parses JSONL and updates ArchiveResult fieldsextractor field is used in ~100 locations:
Create migration 0033_rename_extractor_add_hook_name.py:
extractor → plugin (preserve index, constraints)hook_name = CharField(max_length=255, blank=True, default='', db_index=True)
on_Snapshot__50_wget.py, on_Crawl__10_chrome_session.js, etc.Decision: Full filename chosen for explicitness and easy grep-ability
Critical Files to Update:
ArchiveResult Model (lines 1679-1820):
extractor → pluginhook_name = models.CharField(...)f'...-> {self.plugin}'ArchiveResultManager (lines 1669-1677):
filter(plugin__in=INDEXABLE_METHODS, ...)When(plugin=method, ...)Snapshot Model (lines 1000-1600):
archiveresult_set.filter(plugin=...)Function Renames:
get_extractors() → get_plugins() (lines 479-504)get_parser_extractors() → get_parser_plugins() (lines 507-514)get_extractor_name() → get_plugin_name() (lines 517-530)is_parser_extractor() → is_parser_plugin() (lines 533-536)get_enabled_extractors() → get_enabled_plugins() (lines 553-566)get_extractor_template() → get_plugin_template() (line 1048)get_extractor_icon() → get_plugin_icon() (line 1068)get_all_extractor_icons() → get_all_plugin_icons() (line 1092)Update HookResult TypedDict (lines 63-73):
hook_name: str to store hook filenameplugin: str (if not already present)Update run_hook() (lines 141-389):
hook_name keyUpdate ArchiveResult.run() (lines 1838-1914):
Update ArchiveResult.update_from_output() (lines 1916-2073):
Constants to Rename:
ARCHIVE_METHODS_INDEXING_PRECEDENCE → EXTRACTOR_INDEXING_PRECEDENCEComments/Docstrings: Update all function docstrings to use "plugin" terminology
Update archiveresult_to_jsonl() (lines 173-200):
'extractor': result.extractor → 'plugin': result.plugin'hook_name': result.hook_nameUpdate JSONL parsing:
Decision: Support both keys on import for smooth migration, always export new format
archivebox_extract.py (lines 1-230):
--plugin stays (already correct!)results.filter(plugin=plugin)result.pluginarchivebox_add.py:
'EXTRACTORS': plugins → 'PLUGINS': plugins (if not already)archivebox_update.py:
--plugins flag (currently may be --extractors or --extract)tests/test_oneshot.py:
--extract=... → --plugins=...v1_core.py (ArchiveResult API):
extractor: str → plugin: strhook_name: str = ''q=[..., 'plugin', ...]plugin: Optional[str] = Field(None, q='plugin__icontains')v1_cli.py (CLI API):
extract: str → plugins: strextractors: str → plugins: strargs.plugins → plugins parameteradmin_archiveresults.py:
'plugin' instead of 'extractor'order_by('plugin')admin_snapshots.py:
forms.py:
get_archive_methods() → get_plugin_choices()archive_methods → pluginsviews.py:
archiveresult_objects[result.plugin] = resulttemplatetags/core_tags.py:
extractor_icon() → plugin_icon()extractor_thumbnail() → plugin_thumbnail()extractor_embed() → plugin_embed()result.extractor → result.pluginUpdate HTML templates (if any directly reference extractor):
{{ result.extractor }} and similar{{ result.plugin }}templates/admin/progress_monitor.html:
extractor.extractor and a.extractor to use plugin fieldArchiveResultWorker:
extractor → plugin (lines 348, 350)qs.filter(plugin=self.plugin)ArchiveResultMachine:
self.archiveresult.plugin instead of extractorUpdate test files:
WHERE extractor = '60_parse_html_urls' (line 163).extractor attribute: Change to .pluginThis phase standardizes terminology throughout the codebase to use consistent "plugin" nomenclature.
via_extractor → plugin Rename (14 files):
via_extractor to just pluginLogging Functions (archivebox/misc/logging_util.py):
log_archive_method_started() → log_extractor_started() (line 326)log_archive_method_finished() → log_extractor_finished() (line 330)Form Functions (archivebox/core/forms.py):
get_archive_methods() → get_plugin_choices() (line 15)archive_methods → plugins (line 24, 29)Comments and Docstrings (81 files with "extractor" references):
Package Manager Plugin Documentation:
String Literals in Error Messages:
archivebox/core/models.py - ArchiveResult, ArchiveResultManager, Snapshotarchivebox/core/migrations/0033_*.py - New migrationarchivebox/hooks.py - All hook execution and discovery functionsarchivebox/misc/jsonl.py - Serialization/deserializationarchivebox/cli/archivebox_extract.pyarchivebox/cli/archivebox_add.pyarchivebox/cli/archivebox_update.pyarchivebox/api/v1_core.pyarchivebox/api/v1_cli.pyarchivebox/core/admin_archiveresults.pyarchivebox/core/views.pyarchivebox/core/templatetags/core_tags.pyarchivebox/workers/worker.pyarchivebox/core/statemachines.pytests/test_oneshot.pyarchivebox/tests/test_hooks.pyarchivebox/tests/test_migrations_helpers.py - Schema SQL definitionstests/test_recursive_crawl.py - SQL queries with field namesarchivebox/cli/tests_piping.py - Test function docstringsarchivebox/misc/logging_util.py - Rename logging functionsarchivebox/core/forms.py - Rename form helper and fieldarchivebox/templates/admin/progress_monitor.html - JavaScript field refsdef forwards(apps, schema_editor):
ArchiveResult = apps.get_model('core', 'ArchiveResult')
# All existing records get empty hook_name
ArchiveResult.objects.all().update(hook_name='')
BREAKING CHANGES (per user requirements - no backwards compat):
--plugins (no aliases)extractor removed, plugin requiredplugin_*PARTIAL COMPAT (for migration):
via_extractor → plugin (not via_plugin, just "plugin")extractor = CharField(max_length=32)migrations.RenameField('ArchiveResult', 'extractor', 'plugin') for clean migrationBreaking Changes:
--extract, --extractors → --plugins (no aliases)extractor field → plugin field (no backwards compat)extractor_* → plugin_* (users must update custom templates)archive_methods → pluginsMigration Required: Yes - all instances must run migrations before upgrading
Estimated Impact: ~150+ files will need updates across the entire codebase
Note: Migration can be tested immediately - the migration file is ready to run!