aboutsummaryrefslogtreecommitdiffstats
path: root/notes/ingest/2020-08_daily_improvements.md
blob: da57065bd92e798b34aaef0c3751d2583e3cb684 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202

Goal is to increase rate of successful daily changelog crawling, but reduce
wasted attempts.

Status by domain, past 30 days:

                    domain                |     status      | count 
    --------------------------------------+-----------------+-------
     arxiv.org                            | success         | 21792
     zenodo.org                           | success         | 10646
     res.mdpi.com                         | success         | 10449
     springernature.figshare.com          | no-pdf-link     | 10430
     s3-eu-west-1.amazonaws.com           | success         |  8966
     zenodo.org                           | no-pdf-link     |  8137
     hkvalidate.perfdrive.com             | no-pdf-link     |  5943
     www.ams.org:80                       | no-pdf-link     |  5799
     assets.researchsquare.com            | success         |  4651
     pdf.sciencedirectassets.com          | success         |  4145
     fjfsdata01prod.blob.core.windows.net | success         |  3500
     sage.figshare.com                    | no-pdf-link     |  3174
     onlinelibrary.wiley.com              | no-pdf-link     |  2869
     www.e-periodica.ch                   | no-pdf-link     |  2709
     revistas.uned.es                     | success         |  2631
     figshare.com                         | no-pdf-link     |  2500
     www.sciencedirect.com                | link-loop       |  2477
     linkinghub.elsevier.com              | gateway-timeout |  1878
     downloads.hindawi.com                | success         |  1819
     www.scielo.br                        | success         |  1691
     jps.library.utoronto.ca              | success         |  1590
     www.ams.org                          | no-pdf-link     |  1568
     digi.ub.uni-heidelberg.de            | no-pdf-link     |  1496
     research-repository.griffith.edu.au  | success         |  1412
     journals.plos.org                    | success         |  1330
    (25 rows)

Status by DOI prefix, past 30 days:

     doi_prefix |         status          | count 
    ------------+-------------------------+-------
     10.6084    | no-pdf-link             | 14410   <- figshare; small fraction success
     10.6084    | success                 |  4007
     10.6084    | cdx-error               |  1746

     10.13140   | gateway-timeout         |  9689   <- researchgate
     10.13140   | cdx-error               |  4154

     10.5281    | success                 |  9408   <- zenodo
     10.5281    | no-pdf-link             |  6079
     10.5281    | cdx-error               |  3200
     10.5281    | wayback-error           |  2098

     10.1090    | no-pdf-link             |  7420   <- AMS (ams.org)

     10.3390    | success                 |  6599   <- MDPI
     10.3390    | cdx-error               |  3032
     10.3390    | wayback-error           |  1636

     10.1088    | no-pdf-link             |  3227   <- IOP science

     10.1101    | gateway-timeout         |  3168   <- coldspring harbor: press, biorxiv, medrxiv, etc
     10.1101    | cdx-error               |  1147

     10.21203   | success                 |  3124   <- researchsquare
     10.21203   | cdx-error               |  1181

     10.1016    | success                 |  3083   <- elsevier
     10.1016    | cdx-error               |  2465
     10.1016    | gateway-timeout         |  1682
     10.1016    | wayback-error           |  1567

     10.25384   | no-pdf-link             |  3058   <- sage figshare
     10.25384   | success                 |  2456

     10.1007    | gateway-timeout         |  2913   <- springer
     10.1007    | cdx-error               |  1164

     10.5944    | success                 |  2831
     10.1186    | success                 |  2650
     10.5169    | no-pdf-link             |  2644   <- www.e-periodica.ch
     10.3389    | success                 |  2279
     10.24411   | gateway-timeout         |  2184   <- cyberleninka.ru
     10.1038    | gateway-timeout         |  2143   <- nature group
     10.1177    | gateway-timeout         |  2038   <- SAGE
     10.11588   | no-pdf-link             |  1574   <- journals.ub.uni-heidelberg.de (OJS?)
     10.25904   | success                 |  1416
     10.1155    | success                 |  1304
     10.21994   | no-pdf-link             |  1268   <- loar.kb.dk
     10.18720   | spn2-cdx-lookup-failure |  1232   <- elib.spbstu.ru
     10.24411   | cdx-error               |  1202
     10.1055    | no-pdf-link             |  1170   <- thieme-connect.de
    (40 rows)

code changes for ingest:
x hkvalidate.perfdrive.com: just bail when we see this
x skip large publishers which gateway-timeout (for now)
    - springerlink (10.1007)
    - nature group (10.1038)
    - SAGE (10.1177)
    - IOP (10.1088)

fatcat:
x figshare (by `doi_prefix`): if not versioned (suffix), skip crawl
x zenodo: also try to not crawl if unversioned (group)
x figshare import metadata

sandcrawler:
x ends with `cookieAbsent` or `cookieSet=1` -> status as cookie-blocked
x https://profile.thieme.de/HTML/sso/ejournals/login.htm[...] => blocklist
x verify that we do quick-get for arxiv.org + europmc.org (+ figshare/zenodo?)
    => we were not!
x shorten post-SPNv2 CDX pause? for throughput, given that we are re-trying anyways
x ensure that we store uncrawled URL somewhere on no-capture status
    => in HTML or last of hops
    => not in DB, but that is a bigger change

- try to get un-blocked:
    - coldspring harbor has been blocking since 2020-06-22? yikes!
    - cyberleninka.ru
    - arxiv.org

- no-pdf-link
    x www.ams.org (10.1090)
        => these seem to be stale captures, eg from 2008. newer captures have citation_pdf_url
        => should consider recrawling all of ams.org?
        => not sure why these crawl requests are happening only now
        => on the order of 15k OA articles not in ia; 43k total not preserved
        => force recrawl OA subset (DONE)
    x www.e-periodica.ch (10.5169)
        => TODO: dump un-preserved URLs, transform to PDF urls, heritrix crawl, re-ingest
    x digi.ub.uni-heidelberg.de (10.11588)
        => TODO: bulk re-enqueue? then heritrix crawl?
    - https://loar.kb.dk/handle/1902/6988 (10.21994)
        => TODO: bulk re-enqueue
        => site was updated recently (august 2020); now it crawls fine. need to re-ingest all?
        => 7433 hits
    - thieme-connect.de (10.1055)
        => 600k+ missing
        => TODO: bulk re-enqueue? then heritrix crawl?
        => https://profile.thieme.de/HTML/sso/ejournals/login.htm[...] => blocklist
        => generally just need to re-crawl all?

Unresolved:
- why so many spn2-errors on https://elib.spbstu.ru/ (10.18720)?

## figshare

10.6084     regular figshare
10.25384    SAGE figshare

For sage, "collections" are bogus? can we detect these in datacite metadata?

If figshare types like:

    ris: "GEN",
    bibtex: "misc",
    citeproc: "article",
    schemaOrg: "Collection",
    resourceType: "Collection",
    resourceTypeGeneral: "Collection"

then mark as 'stub'.

"Additional file" items don't seem like "stub"; -> "component".

title:"Figure {} from " -> component

current types are mostly: article, stub, dataset, graphic, article-journal

If DOI starts with "sage.", then publisher is "Sage" (not figshare). Container
name should be... sage.figshare.com?

set version to the version from DOI

## zenodo

doi_prefix: 10.5281

if on zenodo, and has a "Identical to" relation, then this is a pre-print. in
that case, drop container_id and set container_name to zenodo.org. *But*, there
are some journals now publishing exclusively to zenodo.org, so retain that
metadata. examples:

    "Detection of keyboard vibrations and effects on perceived piano quality"
    https://fatcat.wiki/release/mufzkdgt2nbzfha44o7p7gkrpy

    "Editing LAF: Educate, don't defend!"
    https://zenodo.org/record/2583025

version number not available in zenodo metadata

## Gitlab MR Notes

The main goal of this group of changes is to do a better job at daily ingest.

Currently we have on the order of 20k new releases added to the index every day, and about half of them get are marked as OA (either CC license or via container being in DOAJ or ROAD), and pass some filters (eg, release_type), and are selected for ingest. Of those, about half fail to crawl to fulltext, either due to blocking (gateway-timeout, cookie tests, anti-bot detection, loginwall, etc). On the other hand, we don't attempt to crawl lots of "bronze" OA, which is content that is available from the publisher website, but isn't marked explicitly OA.

Based on investigating daily crawling from the past month (will commit these notes to sandcrawler soon), I have identified some DOI prefixes that almost always fail ingest via SPNv2. I also have some patches to sandcrawler ingest to improve ability to crawl some large repositories etc.

Some of the biggest "OA but failed to crawl" are from figshare and zenodo, which register a relatively large fraction of daily OA DOIs. We want to crawl most of that content, but both of these platforms register at least DOIs for each piece of content (a "group" DOI and a "versioned" DOI), and we only need to crawl one. There were also some changes needed to release-type filtering and assignment specific to these platforms, or based on the title of entities.

This MR mixes changes to the datacite metadata import routing (including some refactors out of the main parse_record method) and behavior changes to the entity updater (which is where the code to decide about whether to send an ingest request on release creation lives). I will have a separate MR for importer metadata changes that don't impact ingest behavior.