From 40e2e20378fb06e43cc93f67427f865a0de0a692 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Tue, 3 Nov 2020 11:29:22 -0800 Subject: commit WIP HTML ingest proposal --- proposals/20201026_html_ingest.md | 97 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 97 insertions(+) create mode 100644 proposals/20201026_html_ingest.md diff --git a/proposals/20201026_html_ingest.md b/proposals/20201026_html_ingest.md new file mode 100644 index 0000000..90bc6e5 --- /dev/null +++ b/proposals/20201026_html_ingest.md @@ -0,0 +1,97 @@ + +status: wip + +HTML Ingest Pipeline +======================== + +Basic goal: given an ingest request of type 'html', output an object (JSON) +which could be imported into fatcat. + +Should work with things like (scholarly) blog posts, micropubs, registrations, +protocols. Doesn't need to work with everything to start. "Platform" sites +(like youtube, figshare, etc) will probably be a different ingest worker. + +A current unknown is what the expected size of this metadata is. Both in number +of documents and amount of metadata per document. + +Example HTML articles to start testing: + +- complex distill article: +- old HTML journal: +- NIH pub: +- first mondays (OJS): +- d-lib: + +## Ingest Process + +Follow base URL to terminal document, which is assumed to be a status=200 HTML document. + +Verify that terminal document is fulltext. Extract both metadata and fulltext. + +Extract list of sub-resources. Filter out unwanted (eg favicon, analytics, +unnecessary), apply a sanity limit. Convert to fully qualified URLs. For each +sub-resource, fetch down to the terminal resource, and compute hashes/metadata. + +TODO: +- will probably want to parallelize sub-resource fetching. async? +- behavior when failure fetching sub-resources + + +## Ingest Result Schema + +JSON should + +The minimum that could be persisted for later table lookup are: + +- (url, datetime): CDX table +- sha1hex: `file_meta` table + +Probably makes most sense to have all this end up in a large JSON object though. + + +## New SQL Tables + +`html_meta` + surt, + timestamp (str?) + primary key: (surt, timestamp) + sha1hex (indexed) + updated + status + has_teixml + biblio (JSON) + resources (JSON) + +Also writes to `ingest_file_result`, `file_meta`, and `cdx`, all only for the base HTML document. + +## Fatcat API Wants + +Would be nice to have lookup by SURT+timestamp, and/or by sha1hex of terminal base file. + +`hide` option for cdx rows; also for fileset equivalent. + +## New Workers + +Could reuse existing worker, have code branch depending on type of ingest. + +ingest file worker + => same as existing worker, because could be calling SPN + +persist result + => same as existing worker + +persist html text + => talks to seaweedfs + + +## New Kafka Topics + +HTML ingest result topic (webcapture-ish) + +sandcrawler-ENV.html-teixml + JSON + same as other fulltext topics + +## TODO + +- refactor ingest worker to be more general -- cgit v1.2.3