1 an în urmă · 0866f4aaf3
--- a/archivebox/Architecture.md
+++ b/archivebox/Architecture.md
@@ -0,0 +1,172 @@
 
				+# ArchiveBox UI
			
 
				+
			
 
				+## Page: Getting Started
			
 
				+
			
 
				+### What do you want to capture?
			
 
				+
			
 
				+- Save some URLs now -> [Add page]
			
 
				+    - Paste some URLs to archive now
			
 
				+    - Upload a file containing URLs (bookmarks.html export, RSS.xml feed, markdown file, word doc, PDF, etc.)
			
 
				+    - Pull in URLs to archive from a remote location (e.g. RSS feed URL, remote TXT file, JSON file, etc.)
			
 
				+
			
 
				+- Import URLs from a browser -> [Import page]
			
 
				+    - Desktop: Get the ArchiveBox Chrome/Firefox extension
			
 
				+    - Mobile: Get the ArchiveBox iOS App / Android App
			
 
				+    - Upload a bookmarks.html export file
			
 
				+    - Upload a browser_history.sqlite3 export file
			
 
				+
			
 
				+- Import URLs from a 3rd party bookmarking service -> [Sync page]
			
 
				+    - Pocket
			
 
				+    - Pinboard
			
 
				+    - Instapaper
			
 
				+    - Wallabag
			
 
				+    - Zapier, N8N, IFTTT, etc.
			
 
				+    - Upload a bookmarks.html export, bookmarks.json, RSS, etc. file
			
 
				+
			
 
				+- Archive URLs on a schedule -> [Schedule page]
			
 
				+
			
 
				+- Archive an entire website -> [Crawl page]
			
 
				+    - What starting URL/domain?
			
 
				+    - How deep?
			
 
				+    - Follow links to external domains?
			
 
				+    - Follow links to parent URLs?
			
 
				+    - Maximum number of pages to save?
			
 
				+    - Maximum number of requests/minute?
			
 
				+
			
 
				+- Crawl for URLs with a search engine and save automatically
			
 
				+    - 
			
 
				+- Some URLs on a schedule
			
 
				+- Save an entire website (e.g. `https://example.com`)
			
 
				+- Save results matching a search query (e.g. "site:example.com")
			
 
				+- Save a social media feed (e.g. `https://x.com/user/1234567890`)
			
 
				+
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+### Crawls App
			
 
				+
			
 
				+- Archive an entire website -> [Crawl page]
			
 
				+    - What are the seed URLs?
			
 
				+    - How many hops to follow?
			
 
				+    - Follow links to external domains?
			
 
				+    - Follow links to parent URLs?
			
 
				+    - Maximum number of pages to save?
			
 
				+    - Maximum number of requests/minute?
			
 
				+
			
 
				+
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+### Scheduler App
			
 
				+
			
 
				+
			
 
				+- Archive URLs on a schedule -> [Schedule page]
			
 
				+    - What URL(s)?
			
 
				+    - How often?
			
 
				+    - Do you want to discard old snapshots after x amount of time?
			
 
				+    - Any filter rules?
			
 
				+    - Want to be notified when changes are detected -> redirect[Alerts app/create new alert(crawl=self)]
			
 
				+
			
 
				+
			
 
				+* Choose Schedule check for new URLs: Schedule.objects.get(pk=xyz)
			
 
				+    - 1 minute
			
 
				+    - 5 minutes
			
 
				+    - 1 hour
			
 
				+    - 1 day
			
 
				+
			
 
				+    * Choose Destination Crawl to archive URLs using : Crawl.objects.get(pk=xyz)
			
 
				+        - Tags
			
 
				+        - Persona
			
 
				+        - Created By ID
			
 
				+        - Config
			
 
				+        - Filters
			
 
				+            - URL patterns to include
			
 
				+            - URL patterns to exclude
			
 
				+            - ONLY_NEW= Ignore URLs if already saved once / save URL each time it appears / only save is last save > x time ago
			
 
				+
			
 
				+
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+### Sources App (For managing sources that ArchiveBox pulls URLs in from)
			
 
				+
			
 
				+- Add a new source to pull URLs in from (WIZARD)
			
 
				+    - Choose URI:
			
 
				+        - [x] Web UI
			
 
				+        - [x] CLI
			
 
				+        - Local filesystem path (directory to monitor for new files containing URLs)
			
 
				+        - Remote URL (RSS/JSON/XML feed)
			
 
				+        - Chrome browser profile sync (login using gmail to pull bookmarks/history)
			
 
				+        - Pocket, Pinboard, Instapaper, Wallabag, etc.
			
 
				+        - Zapier, N8N, IFTTT, etc.
			
 
				+        - Local server filesystem path (directory to monitor for new files containing URLs)
			
 
				+        - Google drive (directory to monitor for new files containing URLs)
			
 
				+        - Remote server FTP/SFTP/SCP path (directory to monitor for new files containing URLs)
			
 
				+        - AWS/S3/B2/GCP bucket (directory to monitor for new files containing URLs)
			
 
				+        - XBrowserSync (login to pull bookmarks)
			
 
				+    - Choose extractor
			
 
				+        - auto
			
 
				+        - RSS
			
 
				+        - Pocket
			
 
				+        - etc.
			
 
				+    - Specify extra Config, e.g.
			
 
				+        - credentials
			
 
				+        - extractor tuning options (e.g. verify_ssl, cookies, etc.)
			
 
				+
			
 
				+- Provide credentials for the source
			
 
				+    - API Key
			
 
				+    - Username / Password
			
 
				+    - OAuth
			
 
				+
			
 
				+--------------------------------------------------------------------------------
			
 
				+
			
 
				+### Alerts App
			
 
				+
			
 
				+- Create a new alert, choose condition
			
 
				+    - Get notified when a site goes down (<x% success ratio for Snapshots)
			
 
				+    - Get notified when a site changes visually more than x% (screenshot diff)
			
 
				+    - Get notified when a site's text content changes more than x% (text diff)
			
 
				+    - Get notified when a keyword appears
			
 
				+    - Get notified when a keyword dissapears
			
 
				+    - When an AI prompt returns some result
			
 
				+
			
 
				+- Choose alert threshold:
			
 
				+    - any condition is met
			
 
				+    - all conditions are met
			
 
				+    - condition is met for x% of URLs
			
 
				+    - condition is met for x% of time
			
 
				+
			
 
				+- Choose how to notify: (List[AlertDestination])
			
 
				+    - maximum alert frequency
			
 
				+    - destination type: email / Slack / Webhook / Google Sheet / logfile
			
 
				+    - destination info:
			
 
				+        - email address(es)
			
 
				+        - Slack channel
			
 
				+        - Webhook URL
			
 
				+
			
 
				+- Choose scope:
			
 
				+    - Choose ArchiveResult scope (extractors): (a query that returns ArchiveResult.objects QuerySet)
			
 
				+        - All extractors
			
 
				+        - Only screenshots
			
 
				+        - Only readability / mercury text
			
 
				+        - Only video
			
 
				+        - Only html
			
 
				+        - Only headers
			
 
				+
			
 
				+    - Choose Snapshot scope (URL): (a query that returns Snapshot.objects QuerySet)
			
 
				+        - All domains
			
 
				+        - Specific domain
			
 
				+        - All domains in a tag
			
 
				+        - All domains in a tag category
			
 
				+        - All URLs matching a certain regex pattern
			
 
				+
			
 
				+    - Choose crawl scope: (a query that returns Crawl.objects QuerySet)
			
 
				+        - All crawls
			
 
				+        - Specific crawls
			
 
				+        - crawls by a certain user
			
 
				+        - crawls using a certain persona
			
 
				+
			
 
				+
			
 
				+class AlertDestination(models.Model):
			
 
				+    destination_type: [email, slack, webhook, google_sheet, local logfile, b2/s3/gcp bucket, etc.]
			
 
				+    maximum_frequency
			
 
				+    filter_rules
			
 
				+    credentials
			
 
				+    alert_template: JINJA2 json/text template that gets populated with alert contents