more patch crawling

author: Bryan Newbold <bnewbold@archive.org> 2022-02-08 17:49:39 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2022-02-08 17:49:50 -0800
commit: 3a6fc1f1c26885fd7a44b13ee156fcdb61e6aadd (patch)
tree: 077afcd3c48553dbc65760db047b2e81ba080a73 /notes/ingest/2022-01-13_doi_crawl.md
parent: 067c97a59a4a8728add7b9e561082a5403be52e5 (diff)
download: sandcrawler-3a6fc1f1c26885fd7a44b13ee156fcdb61e6aadd.tar.gz
sandcrawler-3a6fc1f1c26885fd7a44b13ee156fcdb61e6aadd.zip
1 files changed, 72 insertions, 8 deletions
diff --git a/notes/ingest/2022-01-13_doi_crawl.md b/notes/ingest/2022-01-13_doi_crawl.md
index 6f3b2c8..09a3b46 100644
--- a/notes/ingest/2022-01-13_doi_crawl.md
+++ b/notes/ingest/2022-01-13_doi_crawl.md
@@ -1,6 +1,8 @@
 
 Could roll this in to current patch crawl instead of starting a new crawl from scratch.
 
+This file is misnamed; these are mostly non-DOI-specific small updates.
+
 ## KBART "almost complete" experimentation
 
 Random 10 releases:
@@ -133,15 +135,12 @@ many of these are likely to crawl successfully.
         | pv -l \
         | gzip \
         > /srv/fatcat/tasks/ingest_nonoa_doi.json.gz
-    # Expecting 8255693 release objects in search queries
-
-## Seeds: not daily, but OA DOI
+    # re-running 2022-02-08 after this VM was upgraded
+    # Expecting 8321448 release objects in search queries
+    # TODO: in-progress
 
-There are a bunch of things we are no longer attempting daily, but should do
-heritrix crawls of periodically.
-
-TODO: maybe in daily crawling, should check container coverage and see if most URLs are bright, and if so do ingest? hrm
-TODO: What are they? zenodo.org?
+This is large enough that it will probably be a bulk ingest, and then probably
+a follow-up crawl.
 
 ## Seeds: HTML and XML links from HTML biblio
 
@@ -152,6 +151,71 @@ TODO: What are they? zenodo.org?
         | gzip \
         > ingest_file_result_fulltext_urls.2022-01-13.json.gz
 
+    # cut this off at some point? gzip is terminated weird
+
+    zcat ingest_file_result_fulltext_urls.2022-01-13.json.gz | wc -l
+    # gzip: ingest_file_result_fulltext_urls.2022-01-13.json.gz: unexpected end of file
+    # 2,538,433
+
+Prepare seedlists (to include in heritrix patch crawl):
+
+    zcat ingest_file_result_fulltext_urls.2022-01-13.json.gz \
+        | jq .html_biblio.xml_fulltext_url -r \
+        | rg '://' \
+        | sort -u -S 4G \
+        | pv -l \
+        | gzip \
+        > ingest_file_result_fulltext_urls.2022-01-13.xml_urls.txt.gz
+    # 1.24M 0:01:35 [12.9k/s]
+
+    zcat ingest_file_result_fulltext_urls.2022-01-13.json.gz \
+        | jq .html_biblio.html_fulltext_url -r \
+        | rg '://' \
+        | sort -u -S 4G \
+        | pv -l \
+        | gzip \
+        > ingest_file_result_fulltext_urls.2022-01-13.html_urls.txt.gz
+    # 549k 0:01:27 [6.31k/s]
+
+    zcat ingest_file_result_fulltext_urls.2022-01-13.xml_urls.txt.gz ingest_file_result_fulltext_urls.2022-01-13.html_urls.txt.gz \
+        | cut -f3 -d/ \
+        | sort -S 4G \
+        | uniq -c \
+        | sort -nr \
+        | head -n20
+
+     534005 dlc.library.columbia.edu
+     355319 www.degruyter.com
+     196421 zenodo.org
+     101450 serval.unil.ch
+     100631 biblio.ugent.be
+      47986 digi.ub.uni-heidelberg.de
+      39187 www.emerald.com
+      33195 www.cairn.info
+      25703 boris.unibe.ch
+      19516 journals.openedition.org
+      15911 academic.oup.com
+      11091 repository.dl.itc.u-tokyo.ac.jp
+       9847 oxfordworldsclassics.com
+       9698 www.thieme-connect.de
+       9552 www.idunn.no
+       9265 www.zora.uzh.ch
+       8030 www.scielo.br
+       6543 www.hanspub.org
+       6229 asmedigitalcollection.asme.org
+       5651 brill.com
+
+    zcat ingest_file_result_fulltext_urls.2022-01-13.xml_urls.txt.gz ingest_file_result_fulltext_urls.2022-01-13.html_urls.txt.gz \
+        | awk '{print "F+ " $1}' \
+        > ingest_file_result_fulltext_urls.2022-01-13.xml_and_html.schedule
+
+    wc -l ingest_file_result_fulltext_urls.2022-01-13.xml_and_html.schedule
+    1785901 ingest_file_result_fulltext_urls.2022-01-13.xml_and_html.schedule
+
+Added to `JOURNALS-PATCH-CRAWL-2022-01`
+
 ## Seeds: most doi.org terminal non-success
 
 Unless it is a 404, should retry.
+
+TODO: generate this list
author	Bryan Newbold <bnewbold@archive.org>	2022-02-08 17:49:39 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2022-02-08 17:49:50 -0800
commit	3a6fc1f1c26885fd7a44b13ee156fcdb61e6aadd (patch)
tree	077afcd3c48553dbc65760db047b2e81ba080a73 /notes/ingest/2022-01-13_doi_crawl.md
parent	067c97a59a4a8728add7b9e561082a5403be52e5 (diff)
download	sandcrawler-3a6fc1f1c26885fd7a44b13ee156fcdb61e6aadd.tar.gz sandcrawler-3a6fc1f1c26885fd7a44b13ee156fcdb61e6aadd.zip