diff options
Diffstat (limited to 'proposals/2019_ingest.md')
-rw-r--r-- | proposals/2019_ingest.md | 287 |
1 files changed, 287 insertions, 0 deletions
diff --git a/proposals/2019_ingest.md b/proposals/2019_ingest.md new file mode 100644 index 0000000..c649809 --- /dev/null +++ b/proposals/2019_ingest.md @@ -0,0 +1,287 @@ + +status: work-in-progress + +This document proposes structure and systems for ingesting (crawling) paper +PDFs and other content as part of sandcrawler. + +## Overview + +The main abstraction is a sandcrawler "ingest request" object, which can be +created and submitted to one of several systems for automatic harvesting, +resulting in an "ingest result" metadata object. This result should contain +enough metadata to be automatically imported into fatcat as a file/release +mapping. + +The structure and pipelines should be flexible enough to work with individual +PDF files, web captures, and datasets. It should work for on-demand +(interactive) ingest (for "save paper now" features), soft-real-time +(hourly/daily/queued), batches of hundreds or thousands of requests, and scale +up to batch ingest crawls of tens of millions of URLs. Most code should not +care about how or when content is actually crawled. + +The motivation for this structure is to consolidate and automate the current ad +hoc systems for crawling, matching, and importing into fatcat. It is likely +that there will still be a few special cases with their own importers, but the +goal is that in almost all cases that we discover a new structured source of +content to ingest (eg, a new manifest of identifiers to URLs), we can quickly +transform the task into a list of ingest requests, then submit those requests +to an automated system to have them archived and inserted into fatcat with as +little manual effort as possible. + +## Use Cases and Workflows + +### Unpaywall Example + +As a motivating example, consider how unpaywall crawls are done today: + +- download and archive JSON dump from unpaywall. transform and filter into a + TSV with DOI, URL, release-stage columns. +- filter out previously crawled URLs from this seed file, based on last dump, + with the intent of not repeating crawls unnecessarily +- run heritrix3 crawl, usually by sharding seedlist over multiple machines. + after crawl completes: + - backfill CDX PDF subset into hbase (for future de-dupe) + - generate CRL files etc and upload to archive items +- run arabesque over complete crawl logs. this takes time, is somewhat manual, + and has scaling issues past a few million seeds +- depending on source/context, run fatcat import with arabesque results +- periodically run GROBID (and other transforms) over all new harvested files + +Issues with this are: + +- seedlist generation and arabesque step are toilsome (manual), and arabesque + likely has metadata issues or otherwise "leaks" content +- brozzler pipeline is entirely separate +- results in re-crawls of content already in wayback, in particular links + between large corpuses + +New plan: + +- download dump, filter, transform into ingest requests (mostly the same as + before) +- load into ingest-request SQL table. only new rows (unique by source, type, + and URL) are loaded. run a SQL query for new rows from the source with URLs + that have not been ingested +- (optional) pre-crawl bulk/direct URLs using heritrix3, as before, to reduce + later load on SPN +- run ingest script over the above SQL output. ingest first hits CDX/wayback, + and falls back to SPNv2 (brozzler) for "hard" requests, or based on URL. + ingest worker handles file metadata, GROBID, any other processing. results go + to kafka, then SQL table +- either do a bulk fatcat import (via join query), or just have workers + continuously import into fatcat from kafka ingest feed (with various quality + checks) + +## Request/Response Schema + +For now, plan is to have a single request type, and multiple similar but +separate result types, depending on the ingest type (file, fileset, +webcapture). The initial use case is single file PDF ingest. + +NOTE: what about crawl requests where we don't know if we will get a PDF or +HTML? Or both? Let's just recrawl. + +*IngestRequest* + - `ingest_type`: required, one of `pdf`, `xml`, `html`, `dataset`. For + backwards compatibility, `file` should be interpreted as `pdf`. `pdf` and + `xml` return file ingest respose; `html` and `dataset` not implemented but + would be webcapture (wayback) and fileset (archive.org item or wayback?). + In the future: `epub`, `video`, `git`, etc. + - `base_url`: required, where to start crawl process + - `link_source`: recommended, slug string. indicating the database or "authority" + where URL/identifier match is coming from (eg, `doi`, `pmc`, `unpaywall` + (doi), `s2` (semantic-scholar id), `spn` (fatcat release), `core` (CORE + id), `mag` (MAG id)) + - `link_source_id`: recommended, identifier string. pairs with `link_source`. + - `ingest_request_source`: recommended, slug string. tracks the service or + user who submitted request. eg, `fatcat-changelog`, `editor_<ident>`, + `savepapernow-web` + - `release_stage`: optional. indicates the release stage of fulltext expected to be found at this URL + - `rel`: optional. indicates the link type + - `force_recrawl`: optional. if true, will always SPNv2 (won't check wayback) + - `oa_status`: optional. unpaywall schema + - `edit_extra`: additional metadata to be included in any eventual fatcat commits. + - `fatcat` + - `release_ident`: optional. if provided, indicates that ingest is expected + to be fulltext copy of this release (though may be a sibling release + under same work if `release_stage` doesn't match) + - `work_ident`: optional, unused. might eventually be used if, eg, + `release_stage` of ingested file doesn't match that of the `release_ident` + - `ext_ids`: matching fatcat schema. used for later lookups. sometimes + `link_source` and id are sufficient. + - `doi` + - `pmcid` + - ... + +*FileIngestResult* + - `request` (object): the full IngestRequest, copied + - `status` (slug): 'success', 'error', etc + - `hit` (boolean): whether we got something that looks like what was requested + - `terminal` (object): last crawled resource (if any) + - `terminal_url` (string; formerly `url`) + - `terminal_dt` (string): wayback capture datetime (string) + - `terminal_status_code` + - `terminal_sha1hex`: should match true `file_meta` SHA1 (not necessarily CDX SHA1) + (in case of transport encoding difference) + - `file_meta` (object): info about the terminal file + - same schema as sandcrawler-db table + - `size_bytes` + - `md5hex` + - `sha1hex` + - `sha256hex` + - `mimetype`: if not know, `application/octet-stream` + - `cdx`: CDX record matching terminal resource. *MAY* be a revisit or partial + record (eg, if via SPNv2) + - same schema as sandcrawler-db table + - `revisit_cdx` (optional): if `cdx` is a revisit record, this will be the + best "original" location for retrieval of the body (matching `flie_meta`) + - same schema as sandcrawler-db table + - `grobid` + - same schema as sandcrawler-db table + - `status` (string) + - `status_code` (int) + - `grobid_version` (string, from metadata) + - `fatcat_release` (string, from metadata) + - `metadata` (JSON) (with `grobid_version` and `fatcat_release` removed) + - NOT `tei_xml` (strip from reply) + - NOT `file_meta` (strip from reply) + +In general, it is the `terminal_dt` and `terminal_url` that should be used to +construct wayback links (eg, for insertion to fatcat), not from the `cdx`. + +## New SQL Tables + +Sandcrawler should persist status about: + +- claimed locations (links) to fulltext copies of in-scope works, from indexes + like unpaywall, MAG, semantic scholar, CORE + - with enough context to help insert into fatcat if works are crawled and + found. eg, external identifier that is indexed in fatcat, and + release-stage +- state of attempting to crawl all such links + - again, enough to insert into fatcat + - also info about when/how crawl happened, particularly for failures, so we + can do retries + +Proposing two tables: + + -- source/source_id examples: + -- unpaywall / doi + -- mag / mag_id + -- core / core_id + -- s2 / semanticscholar_id + -- doi / doi (for any base_url which is just https://doi.org/10..., regardless of why enqueued) + -- pmc / pmcid (for any base_url like europmc.org, regardless of why enqueued) + -- arxiv / arxiv_id (for any base_url like arxiv.org, regardless of why enqueued) + CREATE TABLE IF NOT EXISTS ingest_request ( + -- conceptually: source, source_id, ingest_type, url + -- but we use this order for PRIMARY KEY so we have a free index on type/URL + ingest_type TEXT NOT NULL CHECK (octet_length(ingest_type) >= 1), + base_url TEXT NOT NULL CHECK (octet_length(url) >= 1), + link_source TEXT NOT NULL CHECK (octet_length(link_source) >= 1), + link_source_id TEXT NOT NULL CHECK (octet_length(link_source_id) >= 1), + + created TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL, + release_stage TEXT CHECK (octet_length(release_stage) >= 1), + request JSONB, + -- request isn't required, but can stash extra fields there for import, eg: + -- ext_ids (source/source_id sometimes enough) + -- release_ident (if ext_ids and source/source_id not specific enough; eg SPN) + -- edit_extra + -- rel + -- oa_status + -- ingest_request_source TEXT NOT NULL CHECK (octet_length(ingest_request_source) >= 1), + + PRIMARY KEY (ingest_type, base_url, link_source, link_source_id) + ); + + CREATE TABLE IF NOT EXISTS ingest_file_result ( + ingest_type TEXT NOT NULL CHECK (octet_length(ingest_type) >= 1), + base_url TEXT NOT NULL CHECK (octet_length(url) >= 1), + + updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL, + hit BOOLEAN NOT NULL, + status TEXT + terminal_url TEXT, INDEX + terminal_dt TEXT + terminal_status_code INT + terminal_sha1hex TEXT, INDEX + + PRIMARY KEY (ingest_type, base_url) + ); + +## New Kafka Topics + +- `sandcrawler-ENV.ingest-file-requests` +- `sandcrawler-ENV.ingest-file-results` + +## Ingest Tool Design + +The basics of the ingest tool are to: + +- use native wayback python library to do fast/efficient lookups and redirect + lookups +- starting from base-url, do a fetch to either target resource or landing page: + follow redirects, at terminus should have both CDX metadata and response body + - if no capture, or most recent is too old (based on request param), do + SPNv2 (brozzler) fetches before wayback lookups +- if looking for PDF but got landing page (HTML), try to extract a PDF link + from HTML using various tricks, then do another fetch. limit this + recursion/spidering to just landing page (or at most one or two additional + hops) + +Note that if we pre-crawled with heritrix3 (with `citation_pdf_url` link +following), then in the large majority of simple cases we + +## Design Issues + +### Open Questions + +Do direct aggregator/repositories crawls need to go through this process? Eg +arxiv.org or pubmed central. I guess so, otherwise how do we get full file +metadata (size, other hashes)? + +When recording hit status for a URL (ingest result), is that status dependent +on the crawl context? Eg, for save-paper-now we might want to require GROBID. +Semantics of `hit` should probably be consistent: if we got the filetype +expected based on type, not whether we would actually import to fatcat. + +Where to include knowledge about, eg, single-page abstract PDFs being bogus? Do +we just block crawling, set an ingest result status, or only filter at fatcat +import time? Definitely need to filter at fatcat import time to make sure +things don't slip through elsewhere. + +### Yet Another PDF Harvester + +This system could result in "yet another" set of publisher-specific heuristics +and hacks to crawl publicly available papers. Related existing work includes +[unpaywall's crawler][unpaywall_crawl], LOCKSS extraction code, dissem.in's +efforts, zotero's bibliography extractor, etc. The "memento tracer" work is +also similar. Many of these are even in python! It would be great to reduce +duplicated work and maintenance. An analagous system in the wild is youtube-dl +for downloading video from many sources. + +[unpaywall_crawl]: https://github.com/ourresearch/oadoi/blob/master/webpage.py +[memento_tracer]: http://tracer.mementoweb.org/ + +One argument against this would be that our use-case is closely tied to +save-page-now, wayback, and the CDX API. However, a properly modular +implementation of a paper downloader would allow components to be re-used, and +perhaps dependency ingjection for things like HTTP fetches to allow use of SPN +or similar. Another argument for modularity would be support for headless +crawling (eg, brozzler). + +Note that this is an internal implementation detail; the ingest API would +abstract all this. + +## Test Examples + +Some example works that are difficult to crawl. Should have mechanisms to crawl +and unit tests for all these. + +- <https://pubs.acs.org> +- <https://linkinghub.elsevier.com> / <https://sciencedirect.com> +- <https://www.osapublishing.org/captcha/?guid=39B0E947-C0FC-B5D8-2C0C-CCF004FF16B8> +- <https://utpjournals.press/action/cookieAbsent> +- <https://academic.oup.com/jes/article/3/Supplement_1/SUN-203/5484104> +- <http://www.jcancer.org/v10p4038.htm> |