add structure of ingest proposal

Still needs some details flushed out
author: Bryan Newbold <bnewbold@archive.org> 2019-11-13 16:44:04 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2019-11-13 16:44:08 -0800
commit: cb126507009eb6691269fe4869f88b16d9c57e1b (patch)
tree: 90bca5a58f7dccd9bc065098c7e172063eb40fc0 /proposals
parent: d98577f9016466622593bedf2740ac28c3a2d606 (diff)
download: sandcrawler-cb126507009eb6691269fe4869f88b16d9c57e1b.tar.gz
sandcrawler-cb126507009eb6691269fe4869f88b16d9c57e1b.zip
1 files changed, 129 insertions, 0 deletions
diff --git a/proposals/2019_ingest.md b/proposals/2019_ingest.md
new file mode 100644
index 0000000..bfe16f4
--- /dev/null
+++ b/proposals/2019_ingest.md
@@ -0,0 +1,129 @@
+
+status: work-in-progress
+
+This document proposes structure and systems for ingesting (crawling) paper
+PDFs and other content as part of sandcrawler.
+
+## Overview
+
+The main abstraction is a sandcrawler "ingest request" object, which can be
+created and submitted to one of several systems for automatic harvesting,
+resulting in an "ingest result" metadata object. This result should contain
+enough metadata to be automatically imported into fatcat as a file/release
+mapping.
+
+The structure and pipelines should be flexible enough to work with individual
+PDF files, web captures, and datasets. It should work for on-demand
+(interactive) ingest (for "save paper now" features), soft-real-time
+(hourly/daily/queued), batches of hundreds or thousands of requests, and scale
+up to batch ingest crawls of tens of millions of URLs. Most code should not
+care about how or when content is actually crawled.
+
+The motivation for this structure is to consolidate and automate the current ad
+hoc systems for crawling, matching, and importing into fatcat. It is likely
+that there will still be a few special cases with their own importers, but the
+goal is that in almost all cases that we discover a new structured source of
+content to ingest (eg, a new manifest of identifiers to URLs), we can quickly
+transform the task into a list of ingest requests, then submit those requests
+to an automated system to have them archived and inserted into fatcat with as
+little manual effort as possible.
+
+## Request/Response Schema
+
+For now, plan is to have a single request type, and multiple similar but
+separate result types, depending on the ingest type (file, fileset,
+webcapture). The initial use case is single file PDF ingest.
+
+NOTE: what about crawl requests where we don't know if we will get a PDF or
+HTML? Or both?
+
+*IngestRequest*
+  - `ingest_type`: required, one of `file`, `fileset`, or `webcapture`
+  - `base_url`: required, where to start crawl process
+  - `project`/`source`: recommended, slug string. to track where this ingest
+    request is coming from
+  - `fatcat`
+    - `release_stage`: optional
+    - `release_ident`: optional
+    - `work_ident`: optional
+    - `edit_extra`: additional metadata to be included in any eventual fatcat
+      commits. supplements project/source
+  - `ext_ids`
+    - `doi`
+    - `pmcid`
+    - ...
+  - `expect_mimetypes`: 
+  - `expect_hash`: optional, if we are expecting a specific file
+    - `sha1`
+    - ...
+
+*FileIngestResult*
+  - request (object): the full IngestRequest, copied
+  - terminal
+    - url
+    - status_code
+  - wayback
+    - datetime
+    - archive_url
+  - file_meta (same schema as sandcrawler-db table)
+    - size_bytes
+    - md5
+    - sha1
+    - sha256
+    - mimetype
+  - cdx (same schema as sandcrawler-db table)
+  - grobid (same schema as sandcrawler-db table)
+    - version
+    - status_code
+    - xml_url
+    - release_id
+  - status (slug): 'success', 'error', etc
+  - hit (boolean): whether we got something that looks like what was requested
+
+## Result Schema
+
+## New API Endpoints
+
+## New Kafka Topics
+
+- `sandcrawler-ENV.ingest-file-requests`
+- `sandcrawler-ENV.ingest-file-results`
+
+## New Fatcat Features
+
+## Design Issues
+
+### Yet Another PDF Harvester
+
+This system could result in "yet another" set of publisher-specific heuristics
+and hacks to crawl publicly available papers. Related existing work includes
+[unpaywall's crawler][unpaywall_crawl], LOCKSS extraction code, dissem.in's
+efforts, zotero's bibliography extractor, etc. The "memento tracer" work is
+also similar. Many of these are even in python! It would be great to reduce
+duplicated work and maintenance. An analagous system in the wild is youtube-dl
+for downloading video from many sources.
+
+[unpaywall_crawl]: https://github.com/ourresearch/oadoi/blob/master/webpage.py
+[memento_tracer]: http://tracer.mementoweb.org/
+
+One argument against this would be that our use-case is closely tied to
+save-page-now, wayback, and the CDX API. However, a properly modular
+implementation of a paper downloader would allow components to be re-used, and
+perhaps dependency ingjection for things like HTTP fetches to allow use of SPN
+or similar. Another argument for modularity would be support for headless
+crawling (eg, brozzler).
+
+Note that this is an internal implementation detail; the ingest API would
+abstract all this.
+
+## Test Examples
+
+Some example works that are difficult to crawl. Should have mechanisms to crawl
+and unit tests for all these.
+
+- <https://pubs.acs.org>
+- <https://linkinghub.elsevier.com> / <https://sciencedirect.com>
+- <https://www.osapublishing.org/captcha/?guid=39B0E947-C0FC-B5D8-2C0C-CCF004FF16B8>
+- <https://utpjournals.press/action/cookieAbsent>
+- <https://academic.oup.com/jes/article/3/Supplement_1/SUN-203/5484104>
+- <http://www.jcancer.org/v10p4038.htm>
author	Bryan Newbold <bnewbold@archive.org>	2019-11-13 16:44:04 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2019-11-13 16:44:08 -0800
commit	cb126507009eb6691269fe4869f88b16d9c57e1b (patch)
tree	90bca5a58f7dccd9bc065098c7e172063eb40fc0 /proposals
parent	d98577f9016466622593bedf2740ac28c3a2d606 (diff)
download	sandcrawler-cb126507009eb6691269fe4869f88b16d9c57e1b.tar.gz sandcrawler-cb126507009eb6691269fe4869f88b16d9c57e1b.zip