commit old patch notes (will rework)

author: Bryan Newbold <bnewbold@archive.org> 2021-09-03 09:04:55 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2021-09-03 10:36:49 -0700
commit: f074a6aafd9af06866829d35555afe10286126fb (patch)
tree: f0c30485ddd6fae7abe996c3812fa6cc360974a7 /notes
parent: ffd6cd86bb8a4756d123decaa5f2ef03428f208f (diff)
download: sandcrawler-f074a6aafd9af06866829d35555afe10286126fb.tar.gz
sandcrawler-f074a6aafd9af06866829d35555afe10286126fb.zip
1 files changed, 110 insertions, 0 deletions
diff --git a/notes/ingest/2020-12-08_patch_crawl_notes.md b/notes/ingest/2020-12-08_patch_crawl_notes.md
new file mode 100644
index 0000000..bb22b82
--- /dev/null
+++ b/notes/ingest/2020-12-08_patch_crawl_notes.md
@@ -0,0 +1,110 @@
+
+Notes here about re-ingesting or re-crawling large batches. Goal around end of
+2020 is to generate a broad patch crawl of terminal no-capture attempts for all
+major sources crawled thus far. Have already tried run this process for unpaywall.
+
+For each, want filtered ingest request JSON objects (filtering out platforms
+that don't crawl well, and possibly things like figshare+zenodo), and a broader
+seedlist (including terminal URLs). Will de-dupe all the seedlist URLs and do a
+heritrix crawl with new config, then re-ingest all the requests individually.
+
+Summary of what to do here:
+
+    OA DOI: expecting some 2.4 million seeds
+    OAI-PMH: expecting some 5 million no-capture URLs, plus more from missing PDF URL not found
+    Unpaywall: another ~900k no-capture URLs (maybe filtered?)
+
+For all, re-attempt for these status codes:
+
+     no-capture
+     cdx-error
+     wayback-error
+     petabox-error
+
+And at least do bulk re-ingest for these, if updated before 2020-11-20 or so:
+
+     no-pdf-link
+
+## OAI-PMH
+
+Need to re-ingest all of the (many!) no-capture and no-pdf-link
+
+TODO: repec-specific URL extraction?
+
+Skip these OAI prefixes:
+
+     kb.dk
+     bnf.fr
+     hispana.mcu.es
+     bdr.oai.bsb-muenchen.de
+     ukm.si
+     hsp.org
+
+Skip these domains:
+
+    www.kb.dk (kb.dk)
+    kb-images.kb.dk (kb.dk)
+    mdz-nbn-resolving.de (TODO: what prefix?)
+    aggr.ukm.um.si (ukm.si)
+
+Check PDF link extraction for these prefixes, or skip them (TODO):
+
+    repec (mixed success)
+    biodiversitylibrary.org
+    juser.fz-juelich.de
+    americanae.aecid.es
+    www.irgrid.ac.cn
+    hal
+    espace.library.uq.edu.au
+    igi.indrastra.com
+    invenio.nusl.cz
+    hypotheses.org
+    t2r2.star.titech.ac.jp
+    quod.lib.umich.edu
+
+    domain: hemerotecadigital.bne.es
+    domain: bib-pubdb1.desy.de
+    domain: publikationen.bibliothek.kit.edu
+    domain: edoc.mpg.de
+    domain: bibliotecadigital.jcyl.es
+    domain: lup.lub.lu.se
+    domain: orbi.uliege.be
+
+TODO:
+- consider deleting ingest requests from skipped prefixes (large database use)
+
+
+## Unpaywall
+
+About 900k `no-pdf-link`, and up to 2.5 million more `no-pdf-link`.
+
+Re-bulk-ingest filtered requests which hit `no-pdf-link` before 2020-11-20:
+
+    COPY (
+        SELECT row_to_json(ingest_request.*)
+        FROM ingest_request
+        LEFT JOIN ingest_file_result
+            ON ingest_file_result.ingest_type = ingest_request.ingest_type
+            AND ingest_file_result.base_url = ingest_request.base_url
+        WHERE
+            ingest_request.ingest_type = 'pdf'
+            AND ingest_request.link_source = 'unpaywall'
+            AND date(ingest_request.created) < '2020-11-20'
+            AND ingest_file_result.status = 'no-pdf-link'
+            AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%'
+            AND ingest_request.base_url NOT LIKE '%pubs.acs.org%'
+            AND ingest_request.base_url NOT LIKE '%ahajournals.org%'
+            AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%'
+            AND ingest_request.base_url NOT LIKE '%aip.scitation.org%'
+            AND ingest_request.base_url NOT LIKE '%academic.oup.com%'
+            AND ingest_request.base_url NOT LIKE '%tandfonline.com%'
+            AND ingest_request.base_url NOT LIKE '%://archive.org/%'
+            AND ingest_request.base_url NOT LIKE '%://web.archive.org/%'
+            AND ingest_request.base_url NOT LIKE '%://www.archive.org/%'
+    ) TO '/grande/snapshots/unpaywall_nopdflink_2020-12-08.rows.json';
+    => COPY 1309990
+
+    ./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_nopdflink_2020-12-08.rows.json | pv -l | shuf > /grande/snapshots/unpaywall_nopdflink_2020-12-08.ingest_request.json
+    => 1.31M 0:00:51 [25.6k/s]
+
+    cat /grande/snapshots/unpaywall_nopdflink_2020-12-08.rows.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
author	Bryan Newbold <bnewbold@archive.org>	2021-09-03 09:04:55 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2021-09-03 10:36:49 -0700
commit	f074a6aafd9af06866829d35555afe10286126fb (patch)
tree	f0c30485ddd6fae7abe996c3812fa6cc360974a7 /notes
parent	ffd6cd86bb8a4756d123decaa5f2ef03428f208f (diff)
download	sandcrawler-f074a6aafd9af06866829d35555afe10286126fb.tar.gz sandcrawler-f074a6aafd9af06866829d35555afe10286126fb.zip