|
|
@@ -0,0 +1,172 @@
|
|
|
+# ArchiveBox UI
|
|
|
+
|
|
|
+## Page: Getting Started
|
|
|
+
|
|
|
+### What do you want to capture?
|
|
|
+
|
|
|
+- Save some URLs now -> [Add page]
|
|
|
+ - Paste some URLs to archive now
|
|
|
+ - Upload a file containing URLs (bookmarks.html export, RSS.xml feed, markdown file, word doc, PDF, etc.)
|
|
|
+ - Pull in URLs to archive from a remote location (e.g. RSS feed URL, remote TXT file, JSON file, etc.)
|
|
|
+
|
|
|
+- Import URLs from a browser -> [Import page]
|
|
|
+ - Desktop: Get the ArchiveBox Chrome/Firefox extension
|
|
|
+ - Mobile: Get the ArchiveBox iOS App / Android App
|
|
|
+ - Upload a bookmarks.html export file
|
|
|
+ - Upload a browser_history.sqlite3 export file
|
|
|
+
|
|
|
+- Import URLs from a 3rd party bookmarking service -> [Sync page]
|
|
|
+ - Pocket
|
|
|
+ - Pinboard
|
|
|
+ - Instapaper
|
|
|
+ - Wallabag
|
|
|
+ - Zapier, N8N, IFTTT, etc.
|
|
|
+ - Upload a bookmarks.html export, bookmarks.json, RSS, etc. file
|
|
|
+
|
|
|
+- Archive URLs on a schedule -> [Schedule page]
|
|
|
+
|
|
|
+- Archive an entire website -> [Crawl page]
|
|
|
+ - What starting URL/domain?
|
|
|
+ - How deep?
|
|
|
+ - Follow links to external domains?
|
|
|
+ - Follow links to parent URLs?
|
|
|
+ - Maximum number of pages to save?
|
|
|
+ - Maximum number of requests/minute?
|
|
|
+
|
|
|
+- Crawl for URLs with a search engine and save automatically
|
|
|
+ -
|
|
|
+- Some URLs on a schedule
|
|
|
+- Save an entire website (e.g. `https://example.com`)
|
|
|
+- Save results matching a search query (e.g. "site:example.com")
|
|
|
+- Save a social media feed (e.g. `https://x.com/user/1234567890`)
|
|
|
+
|
|
|
+--------------------------------------------------------------------------------
|
|
|
+
|
|
|
+### Crawls App
|
|
|
+
|
|
|
+- Archive an entire website -> [Crawl page]
|
|
|
+ - What are the seed URLs?
|
|
|
+ - How many hops to follow?
|
|
|
+ - Follow links to external domains?
|
|
|
+ - Follow links to parent URLs?
|
|
|
+ - Maximum number of pages to save?
|
|
|
+ - Maximum number of requests/minute?
|
|
|
+
|
|
|
+
|
|
|
+--------------------------------------------------------------------------------
|
|
|
+
|
|
|
+### Scheduler App
|
|
|
+
|
|
|
+
|
|
|
+- Archive URLs on a schedule -> [Schedule page]
|
|
|
+ - What URL(s)?
|
|
|
+ - How often?
|
|
|
+ - Do you want to discard old snapshots after x amount of time?
|
|
|
+ - Any filter rules?
|
|
|
+ - Want to be notified when changes are detected -> redirect[Alerts app/create new alert(crawl=self)]
|
|
|
+
|
|
|
+
|
|
|
+* Choose Schedule check for new URLs: Schedule.objects.get(pk=xyz)
|
|
|
+ - 1 minute
|
|
|
+ - 5 minutes
|
|
|
+ - 1 hour
|
|
|
+ - 1 day
|
|
|
+
|
|
|
+ * Choose Destination Crawl to archive URLs using : Crawl.objects.get(pk=xyz)
|
|
|
+ - Tags
|
|
|
+ - Persona
|
|
|
+ - Created By ID
|
|
|
+ - Config
|
|
|
+ - Filters
|
|
|
+ - URL patterns to include
|
|
|
+ - URL patterns to exclude
|
|
|
+ - ONLY_NEW= Ignore URLs if already saved once / save URL each time it appears / only save is last save > x time ago
|
|
|
+
|
|
|
+
|
|
|
+--------------------------------------------------------------------------------
|
|
|
+
|
|
|
+### Sources App (For managing sources that ArchiveBox pulls URLs in from)
|
|
|
+
|
|
|
+- Add a new source to pull URLs in from (WIZARD)
|
|
|
+ - Choose URI:
|
|
|
+ - [x] Web UI
|
|
|
+ - [x] CLI
|
|
|
+ - Local filesystem path (directory to monitor for new files containing URLs)
|
|
|
+ - Remote URL (RSS/JSON/XML feed)
|
|
|
+ - Chrome browser profile sync (login using gmail to pull bookmarks/history)
|
|
|
+ - Pocket, Pinboard, Instapaper, Wallabag, etc.
|
|
|
+ - Zapier, N8N, IFTTT, etc.
|
|
|
+ - Local server filesystem path (directory to monitor for new files containing URLs)
|
|
|
+ - Google drive (directory to monitor for new files containing URLs)
|
|
|
+ - Remote server FTP/SFTP/SCP path (directory to monitor for new files containing URLs)
|
|
|
+ - AWS/S3/B2/GCP bucket (directory to monitor for new files containing URLs)
|
|
|
+ - XBrowserSync (login to pull bookmarks)
|
|
|
+ - Choose extractor
|
|
|
+ - auto
|
|
|
+ - RSS
|
|
|
+ - Pocket
|
|
|
+ - etc.
|
|
|
+ - Specify extra Config, e.g.
|
|
|
+ - credentials
|
|
|
+ - extractor tuning options (e.g. verify_ssl, cookies, etc.)
|
|
|
+
|
|
|
+- Provide credentials for the source
|
|
|
+ - API Key
|
|
|
+ - Username / Password
|
|
|
+ - OAuth
|
|
|
+
|
|
|
+--------------------------------------------------------------------------------
|
|
|
+
|
|
|
+### Alerts App
|
|
|
+
|
|
|
+- Create a new alert, choose condition
|
|
|
+ - Get notified when a site goes down (<x% success ratio for Snapshots)
|
|
|
+ - Get notified when a site changes visually more than x% (screenshot diff)
|
|
|
+ - Get notified when a site's text content changes more than x% (text diff)
|
|
|
+ - Get notified when a keyword appears
|
|
|
+ - Get notified when a keyword dissapears
|
|
|
+ - When an AI prompt returns some result
|
|
|
+
|
|
|
+- Choose alert threshold:
|
|
|
+ - any condition is met
|
|
|
+ - all conditions are met
|
|
|
+ - condition is met for x% of URLs
|
|
|
+ - condition is met for x% of time
|
|
|
+
|
|
|
+- Choose how to notify: (List[AlertDestination])
|
|
|
+ - maximum alert frequency
|
|
|
+ - destination type: email / Slack / Webhook / Google Sheet / logfile
|
|
|
+ - destination info:
|
|
|
+ - email address(es)
|
|
|
+ - Slack channel
|
|
|
+ - Webhook URL
|
|
|
+
|
|
|
+- Choose scope:
|
|
|
+ - Choose ArchiveResult scope (extractors): (a query that returns ArchiveResult.objects QuerySet)
|
|
|
+ - All extractors
|
|
|
+ - Only screenshots
|
|
|
+ - Only readability / mercury text
|
|
|
+ - Only video
|
|
|
+ - Only html
|
|
|
+ - Only headers
|
|
|
+
|
|
|
+ - Choose Snapshot scope (URL): (a query that returns Snapshot.objects QuerySet)
|
|
|
+ - All domains
|
|
|
+ - Specific domain
|
|
|
+ - All domains in a tag
|
|
|
+ - All domains in a tag category
|
|
|
+ - All URLs matching a certain regex pattern
|
|
|
+
|
|
|
+ - Choose crawl scope: (a query that returns Crawl.objects QuerySet)
|
|
|
+ - All crawls
|
|
|
+ - Specific crawls
|
|
|
+ - crawls by a certain user
|
|
|
+ - crawls using a certain persona
|
|
|
+
|
|
|
+
|
|
|
+class AlertDestination(models.Model):
|
|
|
+ destination_type: [email, slack, webhook, google_sheet, local logfile, b2/s3/gcp bucket, etc.]
|
|
|
+ maximum_frequency
|
|
|
+ filter_rules
|
|
|
+ credentials
|
|
|
+ alert_template: JINJA2 json/text template that gets populated with alert contents
|