diff options
author | Martin Czygan <martin@archive.org> | 2020-07-10 21:32:41 +0000 |
---|---|---|
committer | Martin Czygan <martin@archive.org> | 2020-07-10 21:32:41 +0000 |
commit | 3c266e07771271241aa8cff3e3199a45109362af (patch) | |
tree | 73fa6aedf1bbfeffeac9c94593f5f9c4f2dd645b /extra | |
parent | fdf1028c19b0623e30b91e49ffa65ed130dcfdc1 (diff) | |
parent | c9d8550be4bab808c2bad0b0d3642a71075202c0 (diff) | |
download | fatcat-3c266e07771271241aa8cff3e3199a45109362af.tar.gz fatcat-3c266e07771271241aa8cff3e3199a45109362af.zip |
datacite: resolve formatting issues in tests
Diffstat (limited to 'extra')
-rw-r--r-- | extra/bulk_download/README.md | 40 | ||||
-rw-r--r-- | extra/elasticsearch/sql_queries.md | 8 |
2 files changed, 48 insertions, 0 deletions
diff --git a/extra/bulk_download/README.md b/extra/bulk_download/README.md new file mode 100644 index 00000000..83b92fd9 --- /dev/null +++ b/extra/bulk_download/README.md @@ -0,0 +1,40 @@ + +## Download Fatcat Fulltext from web.archive.org in Bulk + +These quick-and-dirty directions use UNIX utilities to download from the +Internet Archive (either in the wayback machine or archive.org). To make a +proper mirror (eg, for research or preservation use), you would want to verify +hashes (fixity), handle additional retries, and handle files which are not +preserved in Internet Archive, retain linkage between files and fatcat +identifiers, etc. + +You can download a file entity dump from the most recent "Bulk Metadata Export" +item from the [snapshots and exports collection](https://archive.org/details/fatcat_snapshots_and_exports?sort=-publicdate). + +Create a TSV file containing the SHA1 and a single URL for each file +entity: + + zcat file_export.json.gz \ + | grep '"application/pdf"' + | jq -cr '.sha1 as $sha1 | .urls | map(select((.url | startswith("https://web.archive.org/web/")) or (.url | startswith("https://archive.org/download/")))) | select(. != []) | [$sha1, .[0].url] | @tsv' \ + > fatcat_files_sha1_iaurl.tsv + +Then use the GNU `parallel` command to call `curl` in parallel to fetch files. +The `-j` argument controls parallelism. Please don't create exessive load on +Internet Archive infrastructure by downloading with too many threads. 10 +parallel threads is a decent amount of load. + + cat fatcat_files_sha1_iaurl.tsv \ + | awk '{print "curl -Lfs --write-out \"%{http_code}\\t" $1 "\\t%{url_effective}\\n\" \"" $2 "\" -o ", $1 ".pdf"}' \ + | parallel --bar -j4 {} \ + > fetch_status.log + +This will write out a status log containing the HTTP status code, expected file +SHA1, and attempted URL. You can check for errors (and potentially try) with: + + grep -v "^200" fetch_status.log + +Or, count status codes: + + cut -f1 fetch_status.log | sort | uniq -c | sort -nr + diff --git a/extra/elasticsearch/sql_queries.md b/extra/elasticsearch/sql_queries.md new file mode 100644 index 00000000..3ea168e5 --- /dev/null +++ b/extra/elasticsearch/sql_queries.md @@ -0,0 +1,8 @@ + +Top missing OA journals by `container_id`: + + POST _xpack/sql?format=txt + { + "query": "SELECT container_id, count(*) from fatcat_release WHERE preservation = 'none' AND is_oa = true GROUP BY container_id ORDER BY count(*) DESC LIMIT 20" + } + |