aboutsummaryrefslogtreecommitdiffstats
path: root/notes/ingest/2022-01-13_doi_crawl.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2022-02-08 17:49:39 -0800
committerBryan Newbold <bnewbold@archive.org>2022-02-08 17:49:50 -0800
commit3a6fc1f1c26885fd7a44b13ee156fcdb61e6aadd (patch)
tree077afcd3c48553dbc65760db047b2e81ba080a73 /notes/ingest/2022-01-13_doi_crawl.md
parent067c97a59a4a8728add7b9e561082a5403be52e5 (diff)
downloadsandcrawler-3a6fc1f1c26885fd7a44b13ee156fcdb61e6aadd.tar.gz
sandcrawler-3a6fc1f1c26885fd7a44b13ee156fcdb61e6aadd.zip
more patch crawling
Diffstat (limited to 'notes/ingest/2022-01-13_doi_crawl.md')
-rw-r--r--notes/ingest/2022-01-13_doi_crawl.md80
1 files changed, 72 insertions, 8 deletions
diff --git a/notes/ingest/2022-01-13_doi_crawl.md b/notes/ingest/2022-01-13_doi_crawl.md
index 6f3b2c8..09a3b46 100644
--- a/notes/ingest/2022-01-13_doi_crawl.md
+++ b/notes/ingest/2022-01-13_doi_crawl.md
@@ -1,6 +1,8 @@
Could roll this in to current patch crawl instead of starting a new crawl from scratch.
+This file is misnamed; these are mostly non-DOI-specific small updates.
+
## KBART "almost complete" experimentation
Random 10 releases:
@@ -133,15 +135,12 @@ many of these are likely to crawl successfully.
| pv -l \
| gzip \
> /srv/fatcat/tasks/ingest_nonoa_doi.json.gz
- # Expecting 8255693 release objects in search queries
-
-## Seeds: not daily, but OA DOI
+ # re-running 2022-02-08 after this VM was upgraded
+ # Expecting 8321448 release objects in search queries
+ # TODO: in-progress
-There are a bunch of things we are no longer attempting daily, but should do
-heritrix crawls of periodically.
-
-TODO: maybe in daily crawling, should check container coverage and see if most URLs are bright, and if so do ingest? hrm
-TODO: What are they? zenodo.org?
+This is large enough that it will probably be a bulk ingest, and then probably
+a follow-up crawl.
## Seeds: HTML and XML links from HTML biblio
@@ -152,6 +151,71 @@ TODO: What are they? zenodo.org?
| gzip \
> ingest_file_result_fulltext_urls.2022-01-13.json.gz
+ # cut this off at some point? gzip is terminated weird
+
+ zcat ingest_file_result_fulltext_urls.2022-01-13.json.gz | wc -l
+ # gzip: ingest_file_result_fulltext_urls.2022-01-13.json.gz: unexpected end of file
+ # 2,538,433
+
+Prepare seedlists (to include in heritrix patch crawl):
+
+ zcat ingest_file_result_fulltext_urls.2022-01-13.json.gz \
+ | jq .html_biblio.xml_fulltext_url -r \
+ | rg '://' \
+ | sort -u -S 4G \
+ | pv -l \
+ | gzip \
+ > ingest_file_result_fulltext_urls.2022-01-13.xml_urls.txt.gz
+ # 1.24M 0:01:35 [12.9k/s]
+
+ zcat ingest_file_result_fulltext_urls.2022-01-13.json.gz \
+ | jq .html_biblio.html_fulltext_url -r \
+ | rg '://' \
+ | sort -u -S 4G \
+ | pv -l \
+ | gzip \
+ > ingest_file_result_fulltext_urls.2022-01-13.html_urls.txt.gz
+ # 549k 0:01:27 [6.31k/s]
+
+ zcat ingest_file_result_fulltext_urls.2022-01-13.xml_urls.txt.gz ingest_file_result_fulltext_urls.2022-01-13.html_urls.txt.gz \
+ | cut -f3 -d/ \
+ | sort -S 4G \
+ | uniq -c \
+ | sort -nr \
+ | head -n20
+
+ 534005 dlc.library.columbia.edu
+ 355319 www.degruyter.com
+ 196421 zenodo.org
+ 101450 serval.unil.ch
+ 100631 biblio.ugent.be
+ 47986 digi.ub.uni-heidelberg.de
+ 39187 www.emerald.com
+ 33195 www.cairn.info
+ 25703 boris.unibe.ch
+ 19516 journals.openedition.org
+ 15911 academic.oup.com
+ 11091 repository.dl.itc.u-tokyo.ac.jp
+ 9847 oxfordworldsclassics.com
+ 9698 www.thieme-connect.de
+ 9552 www.idunn.no
+ 9265 www.zora.uzh.ch
+ 8030 www.scielo.br
+ 6543 www.hanspub.org
+ 6229 asmedigitalcollection.asme.org
+ 5651 brill.com
+
+ zcat ingest_file_result_fulltext_urls.2022-01-13.xml_urls.txt.gz ingest_file_result_fulltext_urls.2022-01-13.html_urls.txt.gz \
+ | awk '{print "F+ " $1}' \
+ > ingest_file_result_fulltext_urls.2022-01-13.xml_and_html.schedule
+
+ wc -l ingest_file_result_fulltext_urls.2022-01-13.xml_and_html.schedule
+ 1785901 ingest_file_result_fulltext_urls.2022-01-13.xml_and_html.schedule
+
+Added to `JOURNALS-PATCH-CRAWL-2022-01`
+
## Seeds: most doi.org terminal non-success
Unless it is a 404, should retry.
+
+TODO: generate this list