Browse Source

collapse README sections to reduce length and link to PUID PGID and root_squash info

Nick Sweeting 2 years ago
parent
commit
64bfd7667e
1 changed files with 52 additions and 1 deletions
  1. 52 1
      README.md

+ 52 - 1
README.md

@@ -564,12 +564,22 @@ MAX_MEDIA_SIZE=1500m       # default: 750m  raise/lower youtubedl output size
 PUBLIC_INDEX=True          # default: True  whether anon users can view index
 PUBLIC_SNAPSHOTS=True      # default: True  whether anon users can view pages
 PUBLIC_ADD_VIEW=False      # default: False whether anon users can add new URLs
+
+CHROME_USER_AGENT="Mozilla/5.0 ..."  # change these to get around bot blocking
+WGET_USER_AGENT="Mozilla/5.0 ..."
+CURL_USER_AGENT="Mozilla/5.0 ..."
 ```
 
 <br/>
 
 ## Dependencies
 
+To achieve high-fidelity archives in as many situations as possible, ArchiveBox depends on a variety of high-quality 3rd-party tools and libraries that specialize in extracting different types of content.
+
+<br/>
+<details>
+<summary><i>Expand to learn more about ArchiveBox's dependencies...</i></summary>
+
 For better security, easier updating, and to avoid polluting your host system with extra dependencies, **it is strongly recommended to use the official [Docker image](https://github.com/ArchiveBox/ArchiveBox/wiki/Docker)** with everything pre-installed for the best experience.
 
 These optional dependencies used for archiving sites include:
@@ -601,12 +611,18 @@ Installing directly on **Windows without Docker or WSL/WSL2/Cygwin is not offici
 
 For detailed information about upgrading ArchiveBox and its dependencies, see: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives
 
+</details>
+
 <br/>
 
 ## Archive Layout
 
 All of ArchiveBox's state (including the index, snapshot data, and config file) is stored in a single folder called the "ArchiveBox data folder". All `archivebox` CLI commands must be run from inside this folder, and you first create it by running `archivebox init`.
 
+<br/>
+<details>
+<summary><i>Expand to learn more about the layout of Archivebox's data on-disk...</i></summary>
+
 The on-disk layout is optimized to be easy to browse by hand and durable long-term. The main index is a standard `index.sqlite3` database in the root of the data folder (it can also be exported as static JSON/HTML), and the archive snapshots are organized by date-added timestamp in the `./archive/` subfolder.
 
 <img src="https://user-images.githubusercontent.com/511499/117453293-c7b91600-af12-11eb-8a3f-aa48b0f9da3c.png" width="400px" align="right">
@@ -630,12 +646,17 @@ The on-disk layout is optimized to be easy to browse by hand and durable long-te
 
 Each snapshot subfolder `./archive/<timestamp>/` includes a static `index.json` and `index.html` describing its contents, and the snapshot extractor outputs are plain files within the folder.
 
+</details>
 
 <br/>
 
 ## Static Archive Exporting
 
-You can export the main index to browse it statically without needing to run a server.
+You can export the main index to browse it statically as plain HTML files in a folder (without needing to run a server).
+
+<br/>
+<details>
+<summary><i>Expand to learn how to export your ArchiveBox collection...</i></summary>
 
 > **Note**
 > These exports are not paginated, exporting many URLs or the entire archive at once may be slow. Use the filtering CLI flags on the `archivebox list` command to export specific Snapshots or ranges.
@@ -652,6 +673,7 @@ archivebox list --csv=timestamp,url,title > index.csv  # export to csv spreadshe
 
 The paths in the static exports are relative, make sure to keep them next to your `./archive` folder when backing them up or viewing them.
 
+</details>
 
 <br/>
 
@@ -667,6 +689,10 @@ The paths in the static exports are relative, make sure to keep them next to you
 
 <a id="archiving-private-urls"></a>
 
+<br/>
+<details>
+<summary><i>Click to expand...</i></summary>
+
 If you're importing pages with private content or URLs containing secret tokens you don't want public (e.g Google Docs, paywalled content, unlisted videos, etc.), **you may want to disable some of the extractor methods to avoid leaking that content to 3rd party APIs or the public**.
 
 ```bash
@@ -687,8 +713,16 @@ archivebox config --set SAVE_FAVICON=False          # disable favicon fetching (
 archivebox config --set CHROME_BINARY=chromium      # ensure it's using Chromium instead of Chrome
 ```
 
+</details>
+<br/>
+
+
 ### Security Risks of Viewing Archived JS
 
+<br/>
+<details>
+<summary><i>Click to expand...</i></summary>
+
 Be aware that malicious archived JS can access the contents of other pages in your archive when viewed. Because the Web UI serves all viewed snapshots from a single domain, they share a request context and **typical CSRF/CORS/XSS/CSP protections do not work to prevent cross-site request attacks**. See the [Security Overview](https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#stealth-mode) page and [Issue #239](https://github.com/ArchiveBox/ArchiveBox/issues/239) for more details.
 
 ```bash
@@ -705,8 +739,15 @@ The admin UI is also served from the same origin as replayed JS, so malicious pa
 
 *Note: Only the `wget` & `dom` extractor methods execute archived JS when viewing snapshots, all other archive methods produce static output that does not execute JS on viewing. If you are worried about these issues ^ you should disable these extractors using `archivebox config --set SAVE_WGET=False SAVE_DOM=False`.*
 
+</details>
+<br/>
+
 ### Saving Multiple Snapshots of a Single URL
 
+<br/>
+<details>
+<summary><i>Click to expand...</i></summary>
+
 First-class support for saving multiple snapshots of each site over time will be [added eventually](https://github.com/ArchiveBox/ArchiveBox/issues/179) (along with the ability to view diffs of the changes between runs). For now **ArchiveBox is designed to only archive each unique URL with each extractor type once**. The workaround to take multiple snapshots of the same URL is to make them slightly different by adding a hash:
 
 ```bash
@@ -717,12 +758,22 @@ archivebox add 'https://example.com#2020-10-25'
 
 The <img src="https://user-images.githubusercontent.com/511499/115942091-73c02300-a476-11eb-958e-5c1fc04da488.png" alt="Re-Snapshot Button" height="24px"/> button in the Admin UI is a shortcut for this hash-date workaround.
 
+</details>
+<br/>
+
 ### Storage Requirements
 
+<br/>
+<details>
+<summary><i>Click to expand...</i></summary>
+
 Because ArchiveBox is designed to ingest a firehose of browser history and bookmark feeds to a local disk, it can be much more disk-space intensive than a centralized service like the Internet Archive or Archive.today. **ArchiveBox can use anywhere from ~1gb per 1000 articles, to ~50gb per 1000 articles**, mostly dependent on whether you're saving audio & video using `SAVE_MEDIA=True` and whether you lower `MEDIA_MAX_SIZE=750mb`.
 
 Disk usage can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by turning off extractors methods you don't need. You can also deduplicate content with a tool like [fdupes](https://github.com/adrianlopezroche/fdupes) or [rdfind](https://github.com/pauldreik/rdfind). **Don't store large collections on older filesystems like EXT3/FAT** as they may not be able to handle more than 50k directory entries in the `archive/` folder. **Try to keep the `index.sqlite3` file on local drive (not a network mount)** or SSD for maximum performance, however the `archive/` folder can be on a network mount or slower HDD.
 
+If using Docker or NFS/SMB/FUSE for the `data/archive/` folder, you may need to set [`PUID` & `PGID`](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#puid--pgid) and [disable `root_squash`](https://github.com/ArchiveBox/ArchiveBox/issues/1304) on your fileshare server.
+
+</details>
 <br/>
 
 ---