daily ingest notes

author: Bryan Newbold <bnewbold@archive.org> 2020-09-02 16:11:04 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2020-09-02 16:11:04 -0700
commit: 07754188a6003a8f4cce1c6012b2a05df49269f8 (patch)
tree: f553dc41f3d3d25495ec9fe4f34828ad10ad56bb
parent: 8cc3cebd2392d16026214f5e92b99a322ef2e044 (diff)
download: sandcrawler-07754188a6003a8f4cce1c6012b2a05df49269f8.tar.gz
sandcrawler-07754188a6003a8f4cce1c6012b2a05df49269f8.zip
1 files changed, 202 insertions, 0 deletions
diff --git a/notes/ingest/2020-08_daily_improvements.md b/notes/ingest/2020-08_daily_improvements.md
new file mode 100644
index 0000000..da57065
--- /dev/null
+++ b/notes/ingest/2020-08_daily_improvements.md
@@ -0,0 +1,202 @@
+
+Goal is to increase rate of successful daily changelog crawling, but reduce
+wasted attempts.
+
+Status by domain, past 30 days:
+
+                    domain                |     status      | count 
+    --------------------------------------+-----------------+-------
+     arxiv.org                            | success         | 21792
+     zenodo.org                           | success         | 10646
+     res.mdpi.com                         | success         | 10449
+     springernature.figshare.com          | no-pdf-link     | 10430
+     s3-eu-west-1.amazonaws.com           | success         |  8966
+     zenodo.org                           | no-pdf-link     |  8137
+     hkvalidate.perfdrive.com             | no-pdf-link     |  5943
+     www.ams.org:80                       | no-pdf-link     |  5799
+     assets.researchsquare.com            | success         |  4651
+     pdf.sciencedirectassets.com          | success         |  4145
+     fjfsdata01prod.blob.core.windows.net | success         |  3500
+     sage.figshare.com                    | no-pdf-link     |  3174
+     onlinelibrary.wiley.com              | no-pdf-link     |  2869
+     www.e-periodica.ch                   | no-pdf-link     |  2709
+     revistas.uned.es                     | success         |  2631
+     figshare.com                         | no-pdf-link     |  2500
+     www.sciencedirect.com                | link-loop       |  2477
+     linkinghub.elsevier.com              | gateway-timeout |  1878
+     downloads.hindawi.com                | success         |  1819
+     www.scielo.br                        | success         |  1691
+     jps.library.utoronto.ca              | success         |  1590
+     www.ams.org                          | no-pdf-link     |  1568
+     digi.ub.uni-heidelberg.de            | no-pdf-link     |  1496
+     research-repository.griffith.edu.au  | success         |  1412
+     journals.plos.org                    | success         |  1330
+    (25 rows)
+
+Status by DOI prefix, past 30 days:
+
+     doi_prefix |         status          | count 
+    ------------+-------------------------+-------
+     10.6084    | no-pdf-link             | 14410   <- figshare; small fraction success
+     10.6084    | success                 |  4007
+     10.6084    | cdx-error               |  1746
+
+     10.13140   | gateway-timeout         |  9689   <- researchgate
+     10.13140   | cdx-error               |  4154
+
+     10.5281    | success                 |  9408   <- zenodo
+     10.5281    | no-pdf-link             |  6079
+     10.5281    | cdx-error               |  3200
+     10.5281    | wayback-error           |  2098
+
+     10.1090    | no-pdf-link             |  7420   <- AMS (ams.org)
+
+     10.3390    | success                 |  6599   <- MDPI
+     10.3390    | cdx-error               |  3032
+     10.3390    | wayback-error           |  1636
+
+     10.1088    | no-pdf-link             |  3227   <- IOP science
+
+     10.1101    | gateway-timeout         |  3168   <- coldspring harbor: press, biorxiv, medrxiv, etc
+     10.1101    | cdx-error               |  1147
+
+     10.21203   | success                 |  3124   <- researchsquare
+     10.21203   | cdx-error               |  1181
+
+     10.1016    | success                 |  3083   <- elsevier
+     10.1016    | cdx-error               |  2465
+     10.1016    | gateway-timeout         |  1682
+     10.1016    | wayback-error           |  1567
+
+     10.25384   | no-pdf-link             |  3058   <- sage figshare
+     10.25384   | success                 |  2456
+
+     10.1007    | gateway-timeout         |  2913   <- springer
+     10.1007    | cdx-error               |  1164
+
+     10.5944    | success                 |  2831
+     10.1186    | success                 |  2650
+     10.5169    | no-pdf-link             |  2644   <- www.e-periodica.ch
+     10.3389    | success                 |  2279
+     10.24411   | gateway-timeout         |  2184   <- cyberleninka.ru
+     10.1038    | gateway-timeout         |  2143   <- nature group
+     10.1177    | gateway-timeout         |  2038   <- SAGE
+     10.11588   | no-pdf-link             |  1574   <- journals.ub.uni-heidelberg.de (OJS?)
+     10.25904   | success                 |  1416
+     10.1155    | success                 |  1304
+     10.21994   | no-pdf-link             |  1268   <- loar.kb.dk
+     10.18720   | spn2-cdx-lookup-failure |  1232   <- elib.spbstu.ru
+     10.24411   | cdx-error               |  1202
+     10.1055    | no-pdf-link             |  1170   <- thieme-connect.de
+    (40 rows)
+
+code changes for ingest:
+x hkvalidate.perfdrive.com: just bail when we see this
+x skip large publishers which gateway-timeout (for now)
+    - springerlink (10.1007)
+    - nature group (10.1038)
+    - SAGE (10.1177)
+    - IOP (10.1088)
+
+fatcat:
+x figshare (by `doi_prefix`): if not versioned (suffix), skip crawl
+x zenodo: also try to not crawl if unversioned (group)
+x figshare import metadata
+
+sandcrawler:
+x ends with `cookieAbsent` or `cookieSet=1` -> status as cookie-blocked
+x https://profile.thieme.de/HTML/sso/ejournals/login.htm[...] => blocklist
+x verify that we do quick-get for arxiv.org + europmc.org (+ figshare/zenodo?)
+    => we were not!
+x shorten post-SPNv2 CDX pause? for throughput, given that we are re-trying anyways
+x ensure that we store uncrawled URL somewhere on no-capture status
+    => in HTML or last of hops
+    => not in DB, but that is a bigger change
+
+- try to get un-blocked:
+    - coldspring harbor has been blocking since 2020-06-22? yikes!
+    - cyberleninka.ru
+    - arxiv.org
+
+- no-pdf-link
+    x www.ams.org (10.1090)
+        => these seem to be stale captures, eg from 2008. newer captures have citation_pdf_url
+        => should consider recrawling all of ams.org?
+        => not sure why these crawl requests are happening only now
+        => on the order of 15k OA articles not in ia; 43k total not preserved
+        => force recrawl OA subset (DONE)
+    x www.e-periodica.ch (10.5169)
+        => TODO: dump un-preserved URLs, transform to PDF urls, heritrix crawl, re-ingest
+    x digi.ub.uni-heidelberg.de (10.11588)
+        => TODO: bulk re-enqueue? then heritrix crawl?
+    - https://loar.kb.dk/handle/1902/6988 (10.21994)
+        => TODO: bulk re-enqueue
+        => site was updated recently (august 2020); now it crawls fine. need to re-ingest all?
+        => 7433 hits
+    - thieme-connect.de (10.1055)
+        => 600k+ missing
+        => TODO: bulk re-enqueue? then heritrix crawl?
+        => https://profile.thieme.de/HTML/sso/ejournals/login.htm[...] => blocklist
+        => generally just need to re-crawl all?
+
+Unresolved:
+- why so many spn2-errors on https://elib.spbstu.ru/ (10.18720)?
+
+## figshare
+
+10.6084     regular figshare
+10.25384    SAGE figshare
+
+For sage, "collections" are bogus? can we detect these in datacite metadata?
+
+If figshare types like:
+
+    ris: "GEN",
+    bibtex: "misc",
+    citeproc: "article",
+    schemaOrg: "Collection",
+    resourceType: "Collection",
+    resourceTypeGeneral: "Collection"
+
+then mark as 'stub'.
+
+"Additional file" items don't seem like "stub"; -> "component".
+
+title:"Figure {} from " -> component
+
+current types are mostly: article, stub, dataset, graphic, article-journal
+
+If DOI starts with "sage.", then publisher is "Sage" (not figshare). Container
+name should be... sage.figshare.com?
+
+set version to the version from DOI
+
+## zenodo
+
+doi_prefix: 10.5281
+
+if on zenodo, and has a "Identical to" relation, then this is a pre-print. in
+that case, drop container_id and set container_name to zenodo.org. *But*, there
+are some journals now publishing exclusively to zenodo.org, so retain that
+metadata. examples:
+
+    "Detection of keyboard vibrations and effects on perceived piano quality"
+    https://fatcat.wiki/release/mufzkdgt2nbzfha44o7p7gkrpy
+
+    "Editing LAF: Educate, don't defend!"
+    https://zenodo.org/record/2583025
+
+version number not available in zenodo metadata
+
+## Gitlab MR Notes
+
+The main goal of this group of changes is to do a better job at daily ingest.
+
+Currently we have on the order of 20k new releases added to the index every day, and about half of them get are marked as OA (either CC license or via container being in DOAJ or ROAD), and pass some filters (eg, release_type), and are selected for ingest. Of those, about half fail to crawl to fulltext, either due to blocking (gateway-timeout, cookie tests, anti-bot detection, loginwall, etc). On the other hand, we don't attempt to crawl lots of "bronze" OA, which is content that is available from the publisher website, but isn't marked explicitly OA.
+
+Based on investigating daily crawling from the past month (will commit these notes to sandcrawler soon), I have identified some DOI prefixes that almost always fail ingest via SPNv2. I also have some patches to sandcrawler ingest to improve ability to crawl some large repositories etc.
+
+Some of the biggest "OA but failed to crawl" are from figshare and zenodo, which register a relatively large fraction of daily OA DOIs. We want to crawl most of that content, but both of these platforms register at least DOIs for each piece of content (a "group" DOI and a "versioned" DOI), and we only need to crawl one. There were also some changes needed to release-type filtering and assignment specific to these platforms, or based on the title of entities.
+
+This MR mixes changes to the datacite metadata import routing (including some refactors out of the main parse_record method) and behavior changes to the entity updater (which is where the code to decide about whether to send an ingest request on release creation lives). I will have a separate MR for importer metadata changes that don't impact ingest behavior.
+
author	Bryan Newbold <bnewbold@archive.org>	2020-09-02 16:11:04 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2020-09-02 16:11:04 -0700
commit	07754188a6003a8f4cce1c6012b2a05df49269f8 (patch)
tree	f553dc41f3d3d25495ec9fe4f34828ad10ad56bb
parent	8cc3cebd2392d16026214f5e92b99a322ef2e044 (diff)
download	sandcrawler-07754188a6003a8f4cce1c6012b2a05df49269f8.tar.gz sandcrawler-07754188a6003a8f4cce1c6012b2a05df49269f8.zip