aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2019-12-11 15:20:23 -0800
committerBryan Newbold <bnewbold@archive.org>2019-12-11 15:22:53 -0800
commite6983247ee6f3b02a8c2fa74d5f09a4440d7511f (patch)
tree086b36cc1385d13f16e99bc421cf2e6e56065f42
parenta49ac726f2c42fcd1bcb6b1882a2d305a1f198e9 (diff)
downloadsandcrawler-e6983247ee6f3b02a8c2fa74d5f09a4440d7511f.tar.gz
sandcrawler-e6983247ee6f3b02a8c2fa74d5f09a4440d7511f.zip
update ingest proposal
-rw-r--r--proposals/2019_ingest.md156
1 files changed, 145 insertions, 11 deletions
diff --git a/proposals/2019_ingest.md b/proposals/2019_ingest.md
index bfe16f4..a631811 100644
--- a/proposals/2019_ingest.md
+++ b/proposals/2019_ingest.md
@@ -28,6 +28,50 @@ transform the task into a list of ingest requests, then submit those requests
to an automated system to have them archived and inserted into fatcat with as
little manual effort as possible.
+## Use Cases and Workflows
+
+### Unpaywall Example
+
+As a motivating example, consider how unpaywall crawls are done today:
+
+- download and archive JSON dump from unpaywall. transform and filter into a
+ TSV with DOI, URL, release-stage columns.
+- filter out previously crawled URLs from this seed file, based on last dump,
+ with the intent of not repeating crawls unnecessarily
+- run heritrix3 crawl, usually by sharding seedlist over multiple machines.
+ after crawl completes:
+ - backfill CDX PDF subset into hbase (for future de-dupe)
+ - generate CRL files etc and upload to archive items
+- run arabesque over complete crawl logs. this takes time, is somewhat manual,
+ and has scaling issues past a few million seeds
+- depending on source/context, run fatcat import with arabesque results
+- periodically run GROBID (and other transforms) over all new harvested files
+
+Issues with this are:
+
+- seedlist generation and arabesque step are toilsome (manual), and arabesque
+ likely has metadata issues or otherwise "leaks" content
+- brozzler pipeline is entirely separate
+- results in re-crawls of content already in wayback, in particular links
+ between large corpuses
+
+New plan:
+
+- download dump, filter, transform into ingest requests (mostly the same as
+ before)
+- load into ingest-request SQL table. only new rows (unique by source, type,
+ and URL) are loaded. run a SQL query for new rows from the source with URLs
+ that have not been ingested
+- (optional) pre-crawl bulk/direct URLs using heritrix3, as before, to reduce
+ later load on SPN
+- run ingest script over the above SQL output. ingest first hits CDX/wayback,
+ and falls back to SPNv2 (brozzler) for "hard" requests, or based on URL.
+ ingest worker handles file metadata, GROBID, any other processing. results go
+ to kafka, then SQL table
+- either do a bulk fatcat import (via join query), or just have workers
+ continuously import into fatcat from kafka ingest feed (with various quality
+ checks)
+
## Request/Response Schema
For now, plan is to have a single request type, and multiple similar but
@@ -35,13 +79,14 @@ separate result types, depending on the ingest type (file, fileset,
webcapture). The initial use case is single file PDF ingest.
NOTE: what about crawl requests where we don't know if we will get a PDF or
-HTML? Or both?
+HTML? Or both? Let's just recrawl.
*IngestRequest*
- - `ingest_type`: required, one of `file`, `fileset`, or `webcapture`
+ - `ingest_type`: required, one of `pdf`, `xml`, `html`, `dataset`
- `base_url`: required, where to start crawl process
- - `project`/`source`: recommended, slug string. to track where this ingest
- request is coming from
+ - `source`: recommended, slug string. indicating the database or "authority" where URL/identifier match is coming from (eg, `unpaywall`, `semantic-scholar`, `save-paper-now`, `doi`)
+ - `source_id`: recommended, slug string. to track where this ingest request is coming from
+ - `actor`: recommended, slug string. tracks the code or user who submitted request
- `fatcat`
- `release_stage`: optional
- `release_ident`: optional
@@ -52,7 +97,6 @@ HTML? Or both?
- `doi`
- `pmcid`
- ...
- - `expect_mimetypes`:
- `expect_hash`: optional, if we are expecting a specific file
- `sha1`
- ...
@@ -62,7 +106,7 @@ HTML? Or both?
- terminal
- url
- status_code
- - wayback
+ - wayback (XXX: ?)
- datetime
- archive_url
- file_meta (same schema as sandcrawler-db table)
@@ -73,26 +117,116 @@ HTML? Or both?
- mimetype
- cdx (same schema as sandcrawler-db table)
- grobid (same schema as sandcrawler-db table)
- - version
+ - status
+ - grobid_version
- status_code
- xml_url
- - release_id
+ - fatcat_release (via biblio-glutton match)
+ - metadata (JSON)
- status (slug): 'success', 'error', etc
- hit (boolean): whether we got something that looks like what was requested
-## Result Schema
+## New SQL Tables
+
+Sandcrawler should persist status about:
+
+- claimed locations (links) to fulltext copies of in-scope works, from indexes
+ like unpaywall, MAG, semantic scholar, CORE
+ - with enough context to help insert into fatcat if works are crawled and
+ found. eg, external identifier that is indexed in fatcat, and
+ release-stage
+- state of attempting to crawl all such links
+ - again, enough to insert into fatcat
+ - also info about when/how crawl happened, particularly for failures, so we
+ can do retries
+
+Proposing two tables:
+
+ -- source/source_id examples:
+ -- unpaywall / doi
+ -- mag / mag_id
+ -- core / core_id
+ -- s2 / semanticscholar_id
+ -- save-paper-now / fatcat_release
+ -- doi / doi (for any base_url which is just https://doi.org/10..., regardless of why enqueued)
+ -- pubmed / pmid (for any base_url like europmc.org, regardless of why enqueued)
+ -- arxiv / arxiv_id (for any base_url like arxiv.org, regardless of why enqueued)
+ CREATE TABLE IF NOT EXISTS ingest_request (
+ -- conceptually: source, source_id, ingest_type, url
+ -- but we use this order for PRIMARY KEY so we have a free index on type/URL
+ ingest_type TEXT NOT NULL CHECK (octet_length(ingest_type) >= 1),
+ base_url TEXT NOT NULL CHECK (octet_length(url) >= 1),
+ source TEXT NOT NULL CHECK (octet_length(source) >= 1),
+ source_id TEXT NOT NULL CHECK (octet_length(source_id) >= 1),
+ actor TEXT NOT NULL CHECK (octet_length(actor) >= 1),
-## New API Endpoints
+ created TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
+ release_stage TEXT CHECK (octet_length(release_stage) >= 1),
+ request JSONB,
+ -- request isn't required, but can stash extra fields there for import, eg:
+ -- ext_ids (source/source_id sometimes enough)
+ -- fatcat_release (if ext_ids and source/source_id not specific enough; eg SPN)
+ -- edit_extra
+
+ PRIMARY KEY (ingest_type, base_url, source, source_id)
+ );
+
+ CREATE TABLE IF NOT EXISTS ingest_file_result (
+ ingest_type TEXT NOT NULL CHECK (octet_length(ingest_type) >= 1),
+ base_url TEXT NOT NULL CHECK (octet_length(url) >= 1),
+
+ updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
+ hit BOOLEAN NOT NULL,
+ status TEXT
+ terminal_url TEXT, INDEX
+ terminal_dt TEXT
+ terminal_status_code INT
+ terminal_sha1hex TEXT, INDEX
+
+ PRIMARY KEY (ingest_type, base_url)
+ );
## New Kafka Topics
- `sandcrawler-ENV.ingest-file-requests`
- `sandcrawler-ENV.ingest-file-results`
-## New Fatcat Features
+## Ingest Tool Design
+
+The basics of the ingest tool are to:
+
+- use native wayback python library to do fast/efficient lookups and redirect
+ lookups
+- starting from base-url, do a fetch to either target resource or landing page:
+ follow redirects, at terminus should have both CDX metadata and response body
+ - if no capture, or most recent is too old (based on request param), do
+ SPNv2 (brozzler) fetches before wayback lookups
+- if looking for PDF but got landing page (HTML), try to extract a PDF link
+ from HTML using various tricks, then do another fetch. limit this
+ recursion/spidering to just landing page (or at most one or two additional
+ hops)
+
+Note that if we pre-crawled with heritrix3 (with `citation_pdf_url` link
+following), then in the large majority of simple cases we
## Design Issues
+### Open Questions
+
+Do direct aggregator/repositories crawls need to go through this process? Eg
+arxiv.org or pubmed. I guess so, otherwise how do we get full file metadata
+(size, other hashes)?
+
+When recording hit status for a URL (ingest result), is that status dependent
+on the crawl context? Eg, for save-paper-now we might want to require GROBID.
+Semantics of `hit` should probably be consistent: if we got the filetype
+expected based on type, not whether we would actually import to fatcat.
+
+Where to include knowledge about, eg, single-page abstract PDFs being bogus? Do
+we just block crawling, set an ingest result status, or only filter at fatcat
+import time? Definitely need to filter at fatcat import time to make sure
+things don't slip through elsewhere.
+
### Yet Another PDF Harvester
This system could result in "yet another" set of publisher-specific heuristics