diff options
author | Bryan Newbold <bnewbold@archive.org> | 2022-02-08 17:49:39 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2022-02-08 17:49:50 -0800 |
commit | 3a6fc1f1c26885fd7a44b13ee156fcdb61e6aadd (patch) | |
tree | 077afcd3c48553dbc65760db047b2e81ba080a73 /notes/ingest/2022-01-13_doi_crawl.md | |
parent | 067c97a59a4a8728add7b9e561082a5403be52e5 (diff) | |
download | sandcrawler-3a6fc1f1c26885fd7a44b13ee156fcdb61e6aadd.tar.gz sandcrawler-3a6fc1f1c26885fd7a44b13ee156fcdb61e6aadd.zip |
more patch crawling
Diffstat (limited to 'notes/ingest/2022-01-13_doi_crawl.md')
-rw-r--r-- | notes/ingest/2022-01-13_doi_crawl.md | 80 |
1 files changed, 72 insertions, 8 deletions
diff --git a/notes/ingest/2022-01-13_doi_crawl.md b/notes/ingest/2022-01-13_doi_crawl.md index 6f3b2c8..09a3b46 100644 --- a/notes/ingest/2022-01-13_doi_crawl.md +++ b/notes/ingest/2022-01-13_doi_crawl.md @@ -1,6 +1,8 @@ Could roll this in to current patch crawl instead of starting a new crawl from scratch. +This file is misnamed; these are mostly non-DOI-specific small updates. + ## KBART "almost complete" experimentation Random 10 releases: @@ -133,15 +135,12 @@ many of these are likely to crawl successfully. | pv -l \ | gzip \ > /srv/fatcat/tasks/ingest_nonoa_doi.json.gz - # Expecting 8255693 release objects in search queries - -## Seeds: not daily, but OA DOI + # re-running 2022-02-08 after this VM was upgraded + # Expecting 8321448 release objects in search queries + # TODO: in-progress -There are a bunch of things we are no longer attempting daily, but should do -heritrix crawls of periodically. - -TODO: maybe in daily crawling, should check container coverage and see if most URLs are bright, and if so do ingest? hrm -TODO: What are they? zenodo.org? +This is large enough that it will probably be a bulk ingest, and then probably +a follow-up crawl. ## Seeds: HTML and XML links from HTML biblio @@ -152,6 +151,71 @@ TODO: What are they? zenodo.org? | gzip \ > ingest_file_result_fulltext_urls.2022-01-13.json.gz + # cut this off at some point? gzip is terminated weird + + zcat ingest_file_result_fulltext_urls.2022-01-13.json.gz | wc -l + # gzip: ingest_file_result_fulltext_urls.2022-01-13.json.gz: unexpected end of file + # 2,538,433 + +Prepare seedlists (to include in heritrix patch crawl): + + zcat ingest_file_result_fulltext_urls.2022-01-13.json.gz \ + | jq .html_biblio.xml_fulltext_url -r \ + | rg '://' \ + | sort -u -S 4G \ + | pv -l \ + | gzip \ + > ingest_file_result_fulltext_urls.2022-01-13.xml_urls.txt.gz + # 1.24M 0:01:35 [12.9k/s] + + zcat ingest_file_result_fulltext_urls.2022-01-13.json.gz \ + | jq .html_biblio.html_fulltext_url -r \ + | rg '://' \ + | sort -u -S 4G \ + | pv -l \ + | gzip \ + > ingest_file_result_fulltext_urls.2022-01-13.html_urls.txt.gz + # 549k 0:01:27 [6.31k/s] + + zcat ingest_file_result_fulltext_urls.2022-01-13.xml_urls.txt.gz ingest_file_result_fulltext_urls.2022-01-13.html_urls.txt.gz \ + | cut -f3 -d/ \ + | sort -S 4G \ + | uniq -c \ + | sort -nr \ + | head -n20 + + 534005 dlc.library.columbia.edu + 355319 www.degruyter.com + 196421 zenodo.org + 101450 serval.unil.ch + 100631 biblio.ugent.be + 47986 digi.ub.uni-heidelberg.de + 39187 www.emerald.com + 33195 www.cairn.info + 25703 boris.unibe.ch + 19516 journals.openedition.org + 15911 academic.oup.com + 11091 repository.dl.itc.u-tokyo.ac.jp + 9847 oxfordworldsclassics.com + 9698 www.thieme-connect.de + 9552 www.idunn.no + 9265 www.zora.uzh.ch + 8030 www.scielo.br + 6543 www.hanspub.org + 6229 asmedigitalcollection.asme.org + 5651 brill.com + + zcat ingest_file_result_fulltext_urls.2022-01-13.xml_urls.txt.gz ingest_file_result_fulltext_urls.2022-01-13.html_urls.txt.gz \ + | awk '{print "F+ " $1}' \ + > ingest_file_result_fulltext_urls.2022-01-13.xml_and_html.schedule + + wc -l ingest_file_result_fulltext_urls.2022-01-13.xml_and_html.schedule + 1785901 ingest_file_result_fulltext_urls.2022-01-13.xml_and_html.schedule + +Added to `JOURNALS-PATCH-CRAWL-2022-01` + ## Seeds: most doi.org terminal non-success Unless it is a 404, should retry. + +TODO: generate this list |