Merge branch 'bnewbold-html-ingest'

author: Bryan Newbold <bnewbold@archive.org> 2020-11-06 18:32:35 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2020-11-06 18:32:35 -0800
commit: 175019c96fced3e21d0f60ea1a4a37da6b8872ac (patch)
tree: f42fbbe9c8ac06ae9eb06373ab9eec96d2b3a177 /proposals
parent: b0b66c20c6ffb9d8acc626068964d7dfd5d3bcdc (diff)
parent: 47ca1a273912c8836630b0930b71a4e66fd2c85b (diff)
download: sandcrawler-175019c96fced3e21d0f60ea1a4a37da6b8872ac.tar.gz
sandcrawler-175019c96fced3e21d0f60ea1a4a37da6b8872ac.zip
2 files changed, 208 insertions, 0 deletions
diff --git a/proposals/20201026_html_ingest.md b/proposals/20201026_html_ingest.md
new file mode 100644
index 0000000..c06f180
--- /dev/null
+++ b/proposals/20201026_html_ingest.md
@@ -0,0 +1,127 @@
+
+status: wip
+
+HTML Ingest Pipeline
+========================
+
+Basic goal: given an ingest request of type 'html', output an object (JSON)
+which could be imported into fatcat.
+
+Should work with things like (scholarly) blog posts, micropubs, registrations,
+protocols. Doesn't need to work with everything to start. "Platform" sites
+(like youtube, figshare, etc) will probably be a different ingest worker.
+
+A current unknown is what the expected size of this metadata is. Both in number
+of documents and amount of metadata per document.
+
+Example HTML articles to start testing:
+
+- complex distill article: <https://distill.pub/2020/bayesian-optimization/>
+- old HTML journal: <http://web.archive.org/web/20081120141926fw_/http://www.mundanebehavior.org/issues/v5n1/rosen.htm>
+- NIH pub: <https://www.nlm.nih.gov/pubs/techbull/ja02/ja02_locatorplus_merge.html>
+- first mondays (OJS): <https://firstmonday.org/ojs/index.php/fm/article/view/10274/9729>
+- d-lib: <http://www.dlib.org/dlib/july17/williams/07williams.html> 
+
+
+## Ingest Process
+
+Follow base URL to terminal document, which is assumed to be a status=200 HTML document.
+
+Verify that terminal document is fulltext. Extract both metadata and fulltext.
+
+Extract list of sub-resources. Filter out unwanted (eg favicon, analytics,
+unnecessary), apply a sanity limit. Convert to fully qualified URLs. For each
+sub-resource, fetch down to the terminal resource, and compute hashes/metadata.
+
+Open questions:
+
+- will probably want to parallelize sub-resource fetching. async?
+- behavior when failure fetching sub-resources
+
+
+## Ingest Result Schema
+
+JSON should be basically compatible with existing `ingest_file_result` objects,
+with some new sub-objects.
+
+Overall object (`IngestWebResult`):
+
+- `status`: str
+- `hit`: bool
+- `error_message`: optional, if an error
+- `hops`: optional, array of URLs
+- `cdx`: optional; single CDX row of primary HTML document
+- `terminal`: optional; same as ingest result
+    - `terminal_url`
+    - `terminal_dt`
+    - `terminal_status_code`
+    - `terminal_sha1hex`
+- `request`: optional but usually present; ingest request object, verbatim
+- `file_meta`: optional; file metadata about primary HTML document
+- `html_biblio`: optional; extracted biblio metadata from primary HTML document
+- `scope`: optional; detected/guessed scope (fulltext, etc)
+- `html_resources`: optional; array of sub-resources. primary HTML is not included
+- `html_body`: optional; just the status code and some metadata is passed through;
+  actual document would go through a different KafkaTopic
+    - `status`: str
+    - `agent`: str, eg "trafilatura/0.4"
+    - `tei_xml`: optional, str
+    - `word_count`: optional, str
+
+
+## New SQL Tables
+
+`html_meta`
+    sha1hex (primary key)
+    updated (of SQL row)
+    status
+    scope
+    has_teixml
+    has_thumbnail
+    word_count (from teixml fulltext)
+    biblio (JSON)
+    resources (JSON)
+
+Also writes to `ingest_file_result`, `file_meta`, and `cdx`, all only for the base HTML document.
+
+
+## Fatcat API Wants
+
+Would be nice to have lookup by SURT+timestamp, and/or by sha1hex of terminal base file.
+
+`hide` option for cdx rows; also for fileset equivalent.
+
+
+## New Workers
+
+Could reuse existing worker, have code branch depending on type of ingest.
+
+ingest file worker
+  => same as existing worker, because could be calling SPN
+
+persist result
+  => same as existing worker; adds persisting various HTML metadata
+
+persist html text
+  => talks to seaweedfs
+
+
+## New Kafka Topics
+
+HTML ingest result topic (webcapture-ish)
+
+sandcrawler-ENV.html-teixml
+    JSON wrapping TEI-XML (same as other fulltext topics)
+    key compaction and content compression enabled
+
+JSON schema:
+
+- `key` and `sha1hex`: str; used as kafka key
+- `status`: str
+- `tei_xml`: str, optional
+- `word_count`: int, optional
+
+## New S3/SeaweedFS Content
+
+`sandcrawler` bucket, `html` folder, `.tei.xml` suffix.
+
diff --git a/proposals/20201103_xml_ingest.md b/proposals/20201103_xml_ingest.md
new file mode 100644
index 0000000..181cc11
--- /dev/null
+++ b/proposals/20201103_xml_ingest.md
@@ -0,0 +1,81 @@
+
+status: wip
+
+TODO:
+x XML fulltext URL extractor (based on HTML biblio metadata, not PDF url extractor)
+x differential JATS XML and scielo XML from generic XML?
+    application/xml+jats is what fatcat is doing for abstracts
+    but it should be application/jats+xml?
+    application/tei+xml
+    if startswith "<article " and "<article-meta>" => JATS
+x refactor ingest worker to be more general
+x have ingest code publish body to kafka topic
+x write a persist worker
+/ create/configure kafka topic
+- test everything locally
+- fatcat: ingest tool to create requests
+- fatcat: entity updates worker creates XML ingest requests for specific sources
+- fatcat: ingest file import worker allows XML results
+- ansible: deployment of persist worker
+
+XML Fulltext Ingest
+====================
+
+This document details changes to include XML fulltext ingest in the same way
+that we currently ingest PDF fulltext.
+
+Currently this will just fetch the single XML document, which is often lacking
+figures, tables, and other required files.
+
+## Text Encoding
+
+Because we would like to treat XML as a string in a couple contexts, but XML
+can have multiple encodings (indicated in an XML header), we are in a bit of a
+bind. Simply parsing into unicode and then re-encoding as UTF-8 could result in
+a header/content mismatch. Any form of re-encoding will change the hash of the
+document. For recording in fatcat, the file metadata will be passed through.
+For storing in Kafka and blob store (for downstream analysis), we will parse
+the raw XML document (as "bytes") with an XML parser, then re-output with UTF-8
+encoding. The hash of the *original* XML file will be used as the key for
+refering to this document. This is unintuitive, but similar to what we are
+doing with PDF and HTML documents (extracting in a useful format, but keeping
+the original document's hash as a key).
+
+Unclear if we need to do this re-encode process for XML documents already in
+UTF-8 encoding.
+
+## Ingest Worker
+
+Could either re-use HTML metadata extractor to fetch XML fulltext links, or
+fork that code off to a separate method, like the PDF fulltext URL extractor.
+
+Hopefully can re-use almost all of the PDF pipeline code, by making that ingest
+worker class more generic and subclassing it.
+
+Result objects are treated the same as PDF ingest results: the result object
+has context about status, and if successful, file metadata and CDX row of the
+terminal object.
+
+TODO: should it be assumed that XML fulltext will end up in S3 bucket? or
+should there be an `xml_meta` SQL table tracking this, like we have for PDFs
+and HTML?
+
+TODO: should we detect and specify the XML schema better? Eg, indicate if JATS.
+
+
+## Persist Pipeline
+
+### Kafka Topic
+
+sandcrawler-ENV.xml-doc
+    similar to other fulltext topics; JSON wrapping the XML
+    key compaction, content compression
+
+### S3/SeaweedFS
+
+`sandcrawler` bucket, `xml` folder. Extension could depend on sub-type of XML?
+
+### Persist Worker
+
+New S3-only worker that pulls from kafka topic and pushes to S3. Works
+basically the same as PDF persist in S3-only mode, or like pdf-text worker.
author	Bryan Newbold <bnewbold@archive.org>	2020-11-06 18:32:35 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2020-11-06 18:32:35 -0800
commit	175019c96fced3e21d0f60ea1a4a37da6b8872ac (patch)
tree	f42fbbe9c8ac06ae9eb06373ab9eec96d2b3a177 /proposals
parent	b0b66c20c6ffb9d8acc626068964d7dfd5d3bcdc (diff)
parent	47ca1a273912c8836630b0930b71a4e66fd2c85b (diff)
download	sandcrawler-175019c96fced3e21d0f60ea1a4a37da6b8872ac.tar.gz sandcrawler-175019c96fced3e21d0f60ea1a4a37da6b8872ac.zip