extra/bulk_download/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40


## Download Fatcat Fulltext from web.archive.org in Bulk

These quick-and-dirty directions use UNIX utilities to download from the
Internet Archive (either in the wayback machine or archive.org). To make a
proper mirror (eg, for research or preservation use), you would want to verify
hashes (fixity), handle additional retries, and handle files which are not
preserved in Internet Archive, retain linkage between files and fatcat
identifiers, etc.

You can download a file entity dump from the most recent "Bulk Metadata Export"
item from the [snapshots and exports collection](https://archive.org/details/fatcat_snapshots_and_exports?sort=-publicdate).

Create a TSV file containing the SHA1 and a single URL for each file
entity:

    zcat file_export.json.gz \
        | grep '"application/pdf"'
        | jq -cr '.sha1 as $sha1 | .urls | map(select((.url | startswith("https://web.archive.org/web/")) or (.url | startswith("https://archive.org/download/")))) | select(. != []) | [$sha1, .[0].url] | @tsv' \
        > fatcat_files_sha1_iaurl.tsv

Then use the GNU `parallel` command to call `curl` in parallel to fetch files.
The `-j` argument controls parallelism. Please don't create exessive load on
Internet Archive infrastructure by downloading with too many threads. 10
parallel threads is a decent amount of load.

    cat fatcat_files_sha1_iaurl.tsv \
        | awk '{print "curl -Lfs --write-out \"%{http_code}\\t" $1 "\\t%{url_effective}\\n\" \"" $2 "\" -o ", $1 ".pdf"}' \
        | parallel --bar -j4 {} \
        > fetch_status.log

This will write out a status log containing the HTTP status code, expected file
SHA1, and attempted URL. You can check for errors (and potentially try) with:

    grep -v "^200" fetch_status.log

Or, count status codes:

    awk '{s[$1]++} END {for(k in s){print k, s[k]}}' fetch_status.log | sort -nrk2