datacite: resolve formatting issues in tests

author: Martin Czygan <martin@archive.org> 2020-07-10 21:32:41 +0000
committer: Martin Czygan <martin@archive.org> 2020-07-10 21:32:41 +0000
commit: 3c266e07771271241aa8cff3e3199a45109362af (patch)
tree: 73fa6aedf1bbfeffeac9c94593f5f9c4f2dd645b /extra/bulk_download
parent: fdf1028c19b0623e30b91e49ffa65ed130dcfdc1 (diff)
parent: c9d8550be4bab808c2bad0b0d3642a71075202c0 (diff)
download: fatcat-3c266e07771271241aa8cff3e3199a45109362af.tar.gz
fatcat-3c266e07771271241aa8cff3e3199a45109362af.zip
1 files changed, 40 insertions, 0 deletions
diff --git a/extra/bulk_download/README.md b/extra/bulk_download/README.md
new file mode 100644
index 00000000..83b92fd9
--- /dev/null
+++ b/extra/bulk_download/README.md
@@ -0,0 +1,40 @@
+
+## Download Fatcat Fulltext from web.archive.org in Bulk
+
+These quick-and-dirty directions use UNIX utilities to download from the
+Internet Archive (either in the wayback machine or archive.org). To make a
+proper mirror (eg, for research or preservation use), you would want to verify
+hashes (fixity), handle additional retries, and handle files which are not
+preserved in Internet Archive, retain linkage between files and fatcat
+identifiers, etc.
+
+You can download a file entity dump from the most recent "Bulk Metadata Export"
+item from the [snapshots and exports collection](https://archive.org/details/fatcat_snapshots_and_exports?sort=-publicdate).
+
+Create a TSV file containing the SHA1 and a single URL for each file
+entity:
+
+    zcat file_export.json.gz \
+        | grep '"application/pdf"'
+        | jq -cr '.sha1 as $sha1 | .urls | map(select((.url | startswith("https://web.archive.org/web/")) or (.url | startswith("https://archive.org/download/")))) | select(. != []) | [$sha1, .[0].url] | @tsv' \
+        > fatcat_files_sha1_iaurl.tsv
+
+Then use the GNU `parallel` command to call `curl` in parallel to fetch files.
+The `-j` argument controls parallelism. Please don't create exessive load on
+Internet Archive infrastructure by downloading with too many threads. 10
+parallel threads is a decent amount of load.
+
+    cat fatcat_files_sha1_iaurl.tsv \
+        | awk '{print "curl -Lfs --write-out \"%{http_code}\\t" $1 "\\t%{url_effective}\\n\" \"" $2 "\" -o ", $1 ".pdf"}' \
+        | parallel --bar -j4 {} \
+        > fetch_status.log
+
+This will write out a status log containing the HTTP status code, expected file
+SHA1, and attempted URL. You can check for errors (and potentially try) with:
+
+    grep -v "^200" fetch_status.log
+
+Or, count status codes:
+
+    cut -f1 fetch_status.log | sort | uniq -c | sort -nr
+
author	Martin Czygan <martin@archive.org>	2020-07-10 21:32:41 +0000
committer	Martin Czygan <martin@archive.org>	2020-07-10 21:32:41 +0000
commit	3c266e07771271241aa8cff3e3199a45109362af (patch)
tree	73fa6aedf1bbfeffeac9c94593f5f9c4f2dd645b /extra/bulk_download
parent	fdf1028c19b0623e30b91e49ffa65ed130dcfdc1 (diff)
parent	c9d8550be4bab808c2bad0b0d3642a71075202c0 (diff)
download	fatcat-3c266e07771271241aa8cff3e3199a45109362af.tar.gz fatcat-3c266e07771271241aa8cff3e3199a45109362af.zip