The migrations currently LOSE DATA during the v0.7.2 → v0.9.0 upgrade:
extractor field data is not being copied to plugin fieldoutput field data is not being copied to output_str fieldadded, updated) may not be properly transformedSample databases for testing are available at:
/Users/squash/Local/Code/archiveboxes/archivebox-migration-path/archivebox-v0.7.2/data/index.sqlite3
/Users/squash/Local/Code/archiveboxes/archivebox-migration-path/archivebox-v0.8.6rc0/data/index.sqlite3
Schema comparison reports:
/tmp/schema_comparison_report.md
/tmp/table_presence_matrix.md
rm -rf /tmp/test_fresh && mkdir -p /tmp/test_fresh
DATA_DIR=/tmp/test_fresh python -m archivebox init
DATA_DIR=/tmp/test_fresh python -m archivebox status
rm -rf /tmp/test_v072 && mkdir -p /tmp/test_v072
cp /Users/squash/Local/Code/archiveboxes/archivebox-migration-path/archivebox-v0.7.2/data/index.sqlite3 /tmp/test_v072/
DATA_DIR=/tmp/test_v072 python -m archivebox init
DATA_DIR=/tmp/test_v072 python -m archivebox status
rm -rf /tmp/test_v086 && mkdir -p /tmp/test_v086
cp /Users/squash/Local/Code/archiveboxes/archivebox-migration-path/archivebox-v0.8.6rc0/data/index.sqlite3 /tmp/test_v086/
DATA_DIR=/tmp/test_v086 python -m archivebox init
DATA_DIR=/tmp/test_v086 python -m archivebox status
After each test, compare original vs migrated data:
# Check ArchiveResult data preservation
echo "=== ORIGINAL ==="
sqlite3 /path/to/original.db "SELECT id, extractor, output, status FROM core_archiveresult LIMIT 5;"
echo "=== MIGRATED ==="
sqlite3 /tmp/test_vXXX/index.sqlite3 "SELECT id, plugin, output_str, status FROM core_archiveresult LIMIT 5;"
# Check Snapshot data preservation
echo "=== ORIGINAL SNAPSHOTS ==="
sqlite3 /path/to/original.db "SELECT id, url, title, added, updated FROM core_snapshot LIMIT 5;"
echo "=== MIGRATED SNAPSHOTS ==="
sqlite3 /tmp/test_vXXX/index.sqlite3 "SELECT id, url, title, bookmarked_at, created_at, modified_at FROM core_snapshot LIMIT 5;"
# Check Tag data preservation
echo "=== ORIGINAL TAGS ==="
sqlite3 /path/to/original.db "SELECT * FROM core_tag;"
echo "=== MIGRATED TAGS ==="
sqlite3 /tmp/test_vXXX/index.sqlite3 "SELECT * FROM core_tag;"
# Check snapshot-tag relationships
sqlite3 /tmp/test_vXXX/index.sqlite3 "SELECT COUNT(*) FROM core_snapshot_tags;"
CRITICAL: Verify:
Use this approach for complex migrations:
Python: Detect existing schema version
def get_table_columns(table_name):
cursor = connection.cursor()
cursor.execute(f"PRAGMA table_info({table_name})")
return {row[1] for row in cursor.fetchall()}
cols = get_table_columns('core_archiveresult')
has_extractor = 'extractor' in cols
has_plugin = 'plugin' in cols
SQL: Modify database structure during migration
CREATE TABLE core_archiveresult_new (...);
INSERT INTO core_archiveresult_new SELECT ... FROM core_archiveresult;
DROP TABLE core_archiveresult;
ALTER TABLE core_archiveresult_new RENAME TO core_archiveresult;
Python: Copy data between old and new field names
if 'extractor' in cols and 'plugin' in cols:
cursor.execute("UPDATE core_archiveresult SET plugin = COALESCE(extractor, '')")
SQL: Drop old columns/tables
-- Django's RemoveField will handle this
Django: Register the end state so Django knows what the schema should be
migrations.SeparateDatabaseAndState(
database_operations=[...], # Your SQL/Python migrations
state_operations=[...] # Tell Django what the final schema looks like
)
core/migrations/0023_upgrade_to_0_9_0.py: Raw SQL migration that upgrades tables from v0.7.2/v0.8.6 schema
core/migrations/0025_alter_archiveresultoptions...py: Django-generated migration
crawls/migrations/0002_upgrade_from_0_8_6.py: Handles crawls_crawl table upgrade
seed_id + persona (VARCHAR)urls + persona_id (UUID FK)Always run from the archivebox/ subdirectory (NOT from a data dir):
cd archivebox/
./manage.py makemigrations
./manage.py makemigrations --check # Verify no unreflected changes
This works because archivebox/manage.py has:
os.environ.setdefault('ARCHIVEBOX_DATA_DIR', '.')
Always run from inside a data directory using archivebox init:
# WRONG - Don't do this:
cd /some/data/dir
../path/to/archivebox/manage.py migrate
# RIGHT - Do this:
DATA_DIR=/some/data/dir python -m archivebox init
Why? Because archivebox init:
id (INTEGER), uuid, extractor, output, cmd, pwd, cmd_version, start_ts, end_ts, status, snapshot_idid, url, timestamp, title, added, updated, crawl_idid (INTEGER), name, slugid, abid (not uuid!), extractor, output, created_at, modified_at, retry_at, status, ...id, url, bookmarked_at, created_at, modified_at, crawl_id, status, retry_at, ...id (UUID/CHAR!), name, slug, abid, created_at, modified_at, created_by_idid, seed_id, persona (VARCHAR), max_depth, tags_str, status, retry_at, ...id (INTEGER), uuid, plugin (not extractor!), output_str (not output!), hook_name, created_at, modified_at, output_files, output_json, output_size, output_mimetypes, retry_at, ...id, url, bookmarked_at (not added!), created_at, modified_at (not updated!), crawl_id, parent_snapshot_id, status, retry_at, current_step, depth, fs_version, ...id (INTEGER!), name, slug, created_at, modified_at, created_by_idid, urls (not seed_id!), persona_id (not persona!), label, notes, output_dir, ...WRONG:
# In core/migrations/0023_upgrade_to_0_9_0.py
cursor.execute("""
CREATE TABLE core_archiveresult_new (
id INTEGER PRIMARY KEY,
plugin VARCHAR(32), # ❌ New field!
output_str TEXT, # ❌ New field!
...
)
""")
RIGHT:
# In core/migrations/0023_upgrade_to_0_9_0.py - Keep OLD field names!
cursor.execute("""
CREATE TABLE core_archiveresult_new (
id INTEGER PRIMARY KEY,
extractor VARCHAR(32), # ✓ OLD field name
output VARCHAR(1024), # ✓ OLD field name
...
)
""")
Why: If you create new fields in SQL, Django's AddField operation in migration 0025 will overwrite them with default values, losing your data!
WRONG:
# In core/migrations/0023
cursor.execute("""
INSERT INTO core_archiveresult_new (plugin, output_str, ...)
SELECT COALESCE(extractor, ''), COALESCE(output, ''), ...
FROM core_archiveresult
""")
RIGHT: Keep old field names in SQL, let Django AddField create new columns, then copy:
# In core/migrations/0025 (AFTER AddField operations)
def copy_old_to_new(apps, schema_editor):
cursor = connection.cursor()
cursor.execute("UPDATE core_archiveresult SET plugin = COALESCE(extractor, '')")
cursor.execute("UPDATE core_archiveresult SET output_str = COALESCE(output, '')")
WRONG:
cursor.execute("SELECT COUNT(*) FROM core_archiveresult")
if cursor.fetchone()[0] == 0:
return # Skip migration
Why: Fresh installs run migrations 0001-0022 which CREATE empty tables with old schema. Migration 0023 must still upgrade the schema even if tables are empty!
RIGHT: Detect schema version by checking column names:
cols = get_table_columns('core_archiveresult')
has_extractor = 'extractor' in cols
if has_extractor:
# Old schema - needs upgrade
WRONG:
cd /path/to/data/dir
python manage.py makemigrations
RIGHT:
cd archivebox/ # The archivebox package directory
./manage.py makemigrations
WRONG:
INSERT INTO new_table SELECT uuid FROM old_table
WHERE EXISTS (SELECT 1 FROM pragma_table_info('old_table') WHERE name='uuid');
Why: SQLite still evaluates the uuid column reference even if WHERE clause is false, causing "no such column" errors.
RIGHT: Use Python to detect schema, then run appropriate SQL:
if 'uuid' in get_table_columns('old_table'):
cursor.execute("INSERT INTO new_table SELECT uuid FROM old_table")
else:
cursor.execute("INSERT INTO new_table SELECT abid as uuid FROM old_table")
v0.8.6rc0 has Tag.id as UUID, but v0.9.0 needs INTEGER. The conversion must:
See core/migrations/0023_upgrade_to_0_9_0.py PART 3 for the correct approach.
When you manually change the database with SQL, you MUST tell Django what the final state is:
migrations.SeparateDatabaseAndState(
database_operations=[
migrations.RunPython(my_sql_function),
],
state_operations=[
migrations.RemoveField('archiveresult', 'extractor'),
migrations.RemoveField('archiveresult', 'output'),
],
)
Without state_operations, Django won't know the old fields are gone and makemigrations --check will show unreflected changes.
print(f'Migrating ArchiveResult from v0.7.2 schema...')
print(f'DEBUG: has_uuid={has_uuid}, has_abid={has_abid}, row_count={row_count}')
This helps diagnose which migration path is being taken.
Always test:
After all changes:
cd archivebox/
./manage.py makemigrations --check
# Should output: No changes detected
As of 2025-01-01, migrations have these issues:
extractor → plugin field data not copiedoutput → output_str field data not copiedcore/migrations/0023_upgrade_to_0_9_0.py
core/migrations/0025_...py
{extractor" in cols} → {"extractor" in cols}crawls/migrations/0002_upgrade_from_0_8_6.py
print(f'DEBUG:...)./manage.py makemigrations --check to ensure no unreflected changes| Old Field (v0.7.2/v0.8.6) | New Field (v0.9.0) | Notes |
|---|---|---|
extractor |
plugin |
Rename |
output |
output_str |
Rename |
added |
bookmarked_at |
Rename + also use for created_at |
updated |
modified_at |
Rename |
abid |
uuid |
v0.8.6 only, field rename |
| Tag.id (UUID) | Tag.id (INTEGER) | v0.8.6 only, type conversion |
seed_id |
urls |
Crawl table, v0.8.6 only |
persona (VARCHAR) |
persona_id (UUID FK) |
Crawl table, v0.8.6 only |
extractor → plugin (check first 5 rows)output → output_str (check first 5 rows)added → bookmarked_at (compare timestamps)updated → modified_at (compare timestamps)abid → uuid field./manage.py makemigrations --check shows no changesarchivebox status shows correct snapshot/link counts