proposals/2019_ingest.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283


status: work-in-progress

This document proposes structure and systems for ingesting (crawling) paper
PDFs and other content as part of sandcrawler.

## Overview

The main abstraction is a sandcrawler "ingest request" object, which can be
created and submitted to one of several systems for automatic harvesting,
resulting in an "ingest result" metadata object. This result should contain
enough metadata to be automatically imported into fatcat as a file/release
mapping.

The structure and pipelines should be flexible enough to work with individual
PDF files, web captures, and datasets. It should work for on-demand
(interactive) ingest (for "save paper now" features), soft-real-time
(hourly/daily/queued), batches of hundreds or thousands of requests, and scale
up to batch ingest crawls of tens of millions of URLs. Most code should not
care about how or when content is actually crawled.

The motivation for this structure is to consolidate and automate the current ad
hoc systems for crawling, matching, and importing into fatcat. It is likely
that there will still be a few special cases with their own importers, but the
goal is that in almost all cases that we discover a new structured source of
content to ingest (eg, a new manifest of identifiers to URLs), we can quickly
transform the task into a list of ingest requests, then submit those requests
to an automated system to have them archived and inserted into fatcat with as
little manual effort as possible.

## Use Cases and Workflows

### Unpaywall Example

As a motivating example, consider how unpaywall crawls are done today:

- download and archive JSON dump from unpaywall. transform and filter into a
  TSV with DOI, URL, release-stage columns.
- filter out previously crawled URLs from this seed file, based on last dump,
  with the intent of not repeating crawls unnecessarily
- run heritrix3 crawl, usually by sharding seedlist over multiple machines.
  after crawl completes:
    - backfill CDX PDF subset into hbase (for future de-dupe)
    - generate CRL files etc and upload to archive items
- run arabesque over complete crawl logs. this takes time, is somewhat manual,
  and has scaling issues past a few million seeds
- depending on source/context, run fatcat import with arabesque results
- periodically run GROBID (and other transforms) over all new harvested files

Issues with this are:

- seedlist generation and arabesque step are toilsome (manual), and arabesque
  likely has metadata issues or otherwise "leaks" content
- brozzler pipeline is entirely separate
- results in re-crawls of content already in wayback, in particular links
  between large corpuses

New plan:

- download dump, filter, transform into ingest requests (mostly the same as
  before)
- load into ingest-request SQL table. only new rows (unique by source, type,
  and URL) are loaded. run a SQL query for new rows from the source with URLs
  that have not been ingested
- (optional) pre-crawl bulk/direct URLs using heritrix3, as before, to reduce
  later load on SPN
- run ingest script over the above SQL output. ingest first hits CDX/wayback,
  and falls back to SPNv2 (brozzler) for "hard" requests, or based on URL.
  ingest worker handles file metadata, GROBID, any other processing. results go
  to kafka, then SQL table
- either do a bulk fatcat import (via join query), or just have workers
  continuously import into fatcat from kafka ingest feed (with various quality
  checks)

## Request/Response Schema

For now, plan is to have a single request type, and multiple similar but
separate result types, depending on the ingest type (file, fileset,
webcapture). The initial use case is single file PDF ingest.

NOTE: what about crawl requests where we don't know if we will get a PDF or
HTML? Or both? Let's just recrawl.

*IngestRequest*
  - `ingest_type`: required, one of `pdf`, `xml`, `html`, `dataset`. For
    backwards compatibility, `file` should be interpreted as `pdf`. `pdf` and
    `xml` return file ingest respose; `html` and `dataset` not implemented but
    would be webcapture (wayback) and fileset (archive.org item or wayback?).
    In the future: `epub`, `video`, `git`, etc.
  - `base_url`: required, where to start crawl process
  - `link_source`: recommended, slug string. indicating the database or "authority"
    where URL/identifier match is coming from (eg, `doi`, `pmc`, `unpaywall`
    (doi), `s2` (semantic-scholar id), `spn` (fatcat release), `core` (CORE
    id), `mag` (MAG id))
  - `link_source_id`: recommended, identifier string. pairs with `link_source`.
  - `ingest_request_source`: recommended, slug string. tracks the service or
    user who submitted request. eg, `fatcat-changelog`, `editor_<ident>`,
    `savepapernow-web`
  - `release_stage`: optional. indicates the release stage of fulltext expected to be found at this URL
  - `fatcat`
    - `release_ident`: optional. if provided, indicates that ingest is expected
      to be fulltext copy of this release (though may be a sibling release
      under same work if `release_stage` doesn't match)
    - `work_ident`: optional, unused. might eventually be used if, eg,
      `release_stage` of ingested file doesn't match that of the `release_ident`
    - `edit_extra`: additional metadata to be included in any eventual fatcat
      commits.
  - `ext_ids`: matching fatcat schema. used for later lookups. sometimes
    `link_source` and id are sufficient.
    - `doi`
    - `pmcid`
    - ...

*FileIngestResult*
  - `request` (object): the full IngestRequest, copied
  - `status` (slug): 'success', 'error', etc
  - `hit` (boolean): whether we got something that looks like what was requested
  - `terminal` (object): last crawled resource (if any)
    - `terminal_url` (string; formerly `url`)
    - `terminal_dt` (string): wayback capture datetime (string)
    - `terminal_status_code`
    - `terminal_sha1hex`: should match true `file_meta` SHA1 (not necessarily CDX SHA1)
      (in case of transport encoding difference)
  - `file_meta` (object): info about the terminal file
    - same schema as sandcrawler-db table
    - `size_bytes`
    - `md5hex`
    - `sha1hex`
    - `sha256hex`
    - `mimetype`: if not know, `application/octet-stream`
  - `cdx`: CDX record matching terminal resource. *MAY* be a revisit or partial
    record (eg, if via SPNv2)
    - same schema as sandcrawler-db table 
  - `revisit_cdx` (optional): if `cdx` is a revisit record, this will be the
    best "original" location for retrieval of the body (matching `flie_meta`)
    - same schema as sandcrawler-db table 
  - `grobid`
    - same schema as sandcrawler-db table
    - `status` (string)
    - `status_code` (int)
    - `grobid_version` (string, from metadata)
    - `fatcat_release` (string, from metadata)
    - `metadata` (JSON) (with `grobid_version` and `fatcat_release` removed)
    - NOT `tei_xml` (strip from reply)
    - NOT `file_meta` (strip from reply)

In general, it is the `terminal_dt` and `terminal_url` that should be used to
construct wayback links (eg, for insertion to fatcat), not from the `cdx`.

## New SQL Tables

Sandcrawler should persist status about:

- claimed locations (links) to fulltext copies of in-scope works, from indexes
  like unpaywall, MAG, semantic scholar, CORE
    - with enough context to help insert into fatcat if works are crawled and
      found. eg, external identifier that is indexed in fatcat, and
      release-stage
- state of attempting to crawl all such links
    - again, enough to insert into fatcat
    - also info about when/how crawl happened, particularly for failures, so we
      can do retries

Proposing two tables:

    -- source/source_id examples:
    --  unpaywall / doi
    --  mag / mag_id
    --  core / core_id
    --  s2 / semanticscholar_id
    --  doi / doi (for any base_url which is just https://doi.org/10..., regardless of why enqueued)
    --  pmc / pmcid (for any base_url like europmc.org, regardless of why enqueued)
    --  arxiv / arxiv_id (for any base_url like arxiv.org, regardless of why enqueued)
    CREATE TABLE IF NOT EXISTS ingest_request (
        -- conceptually: source, source_id, ingest_type, url
        -- but we use this order for PRIMARY KEY so we have a free index on type/URL
        ingest_type             TEXT NOT NULL CHECK (octet_length(ingest_type) >= 1),
        base_url                TEXT NOT NULL CHECK (octet_length(url) >= 1),
        link_source             TEXT NOT NULL CHECK (octet_length(link_source) >= 1),
        link_source_id          TEXT NOT NULL CHECK (octet_length(link_source_id) >= 1),

        created                 TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
        release_stage           TEXT CHECK (octet_length(release_stage) >= 1),
        request                 JSONB,
        -- request isn't required, but can stash extra fields there for import, eg:
        --   ext_ids (source/source_id sometimes enough)
        --   release_ident (if ext_ids and source/source_id not specific enough; eg SPN)
        --   edit_extra
        -- ingest_request_source   TEXT NOT NULL CHECK (octet_length(ingest_request_source) >= 1),

        PRIMARY KEY (ingest_type, base_url, link_source, link_source_id)
    );

    CREATE TABLE IF NOT EXISTS ingest_file_result (
        ingest_type             TEXT NOT NULL CHECK (octet_length(ingest_type) >= 1),
        base_url                TEXT NOT NULL CHECK (octet_length(url) >= 1),

        updated                 TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
        hit                     BOOLEAN NOT NULL,
        status                  TEXT
        terminal_url            TEXT, INDEX
        terminal_dt             TEXT
        terminal_status_code    INT
        terminal_sha1hex        TEXT, INDEX

        PRIMARY KEY (ingest_type, base_url)
    );

## New Kafka Topics

- `sandcrawler-ENV.ingest-file-requests`
- `sandcrawler-ENV.ingest-file-results`

## Ingest Tool Design

The basics of the ingest tool are to:

- use native wayback python library to do fast/efficient lookups and redirect
  lookups
- starting from base-url, do a fetch to either target resource or landing page:
  follow redirects, at terminus should have both CDX metadata and response body
    - if no capture, or most recent is too old (based on request param), do
      SPNv2 (brozzler) fetches before wayback lookups
- if looking for PDF but got landing page (HTML), try to extract a PDF link
  from HTML using various tricks, then do another fetch. limit this
  recursion/spidering to just landing page (or at most one or two additional
  hops)

Note that if we pre-crawled with heritrix3 (with `citation_pdf_url` link
following), then in the large majority of simple cases we

## Design Issues

### Open Questions

Do direct aggregator/repositories crawls need to go through this process? Eg
arxiv.org or pubmed central. I guess so, otherwise how do we get full file
metadata (size, other hashes)?

When recording hit status for a URL (ingest result), is that status dependent
on the crawl context? Eg, for save-paper-now we might want to require GROBID.
Semantics of `hit` should probably be consistent: if we got the filetype
expected based on type, not whether we would actually import to fatcat.

Where to include knowledge about, eg, single-page abstract PDFs being bogus? Do
we just block crawling, set an ingest result status, or only filter at fatcat
import time? Definitely need to filter at fatcat import time to make sure
things don't slip through elsewhere.

### Yet Another PDF Harvester

This system could result in "yet another" set of publisher-specific heuristics
and hacks to crawl publicly available papers. Related existing work includes
[unpaywall's crawler][unpaywall_crawl], LOCKSS extraction code, dissem.in's
efforts, zotero's bibliography extractor, etc. The "memento tracer" work is
also similar. Many of these are even in python! It would be great to reduce
duplicated work and maintenance. An analagous system in the wild is youtube-dl
for downloading video from many sources.

[unpaywall_crawl]: https://github.com/ourresearch/oadoi/blob/master/webpage.py
[memento_tracer]: http://tracer.mementoweb.org/

One argument against this would be that our use-case is closely tied to
save-page-now, wayback, and the CDX API. However, a properly modular
implementation of a paper downloader would allow components to be re-used, and
perhaps dependency ingjection for things like HTTP fetches to allow use of SPN
or similar. Another argument for modularity would be support for headless
crawling (eg, brozzler).

Note that this is an internal implementation detail; the ingest API would
abstract all this.

## Test Examples

Some example works that are difficult to crawl. Should have mechanisms to crawl
and unit tests for all these.

- <https://pubs.acs.org>
- <https://linkinghub.elsevier.com> / <https://sciencedirect.com>
- <https://www.osapublishing.org/captcha/?guid=39B0E947-C0FC-B5D8-2C0C-CCF004FF16B8>
- <https://utpjournals.press/action/cookieAbsent>
- <https://academic.oup.com/jes/article/3/Supplement_1/SUN-203/5484104>
- <http://www.jcancer.org/v10p4038.htm>