html: update proposal (docs)

author: Bryan Newbold <bnewbold@archive.org> 2020-11-06 18:25:55 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2020-11-06 18:25:55 -0800
commit: 47ca1a273912c8836630b0930b71a4e66fd2c85b (patch)
tree: 1c08f0a42fe16b3401d34a6a63c4c19be8aead30
parent: b86a6fd5bb74f9f11e682b9a98f02b5dba8c4cc1 (diff)
download: sandcrawler-47ca1a273912c8836630b0930b71a4e66fd2c85b.tar.gz
sandcrawler-47ca1a273912c8836630b0930b71a4e66fd2c85b.zip
1 files changed, 49 insertions, 19 deletions
diff --git a/proposals/20201026_html_ingest.md b/proposals/20201026_html_ingest.md
index 90bc6e5..c06f180 100644
--- a/proposals/20201026_html_ingest.md
+++ b/proposals/20201026_html_ingest.md
@@ -22,6 +22,7 @@ Example HTML articles to start testing:
 - first mondays (OJS): <https://firstmonday.org/ojs/index.php/fm/article/view/10274/9729>
 - d-lib: <http://www.dlib.org/dlib/july17/williams/07williams.html> 
 
+
 ## Ingest Process
 
 Follow base URL to terminal document, which is assumed to be a status=200 HTML document.
@@ -32,44 +33,65 @@ Extract list of sub-resources. Filter out unwanted (eg favicon, analytics,
 unnecessary), apply a sanity limit. Convert to fully qualified URLs. For each
 sub-resource, fetch down to the terminal resource, and compute hashes/metadata.
 
-TODO:
+Open questions:
+
 - will probably want to parallelize sub-resource fetching. async?
 - behavior when failure fetching sub-resources
 
 
 ## Ingest Result Schema
 
-JSON should
-
-The minimum that could be persisted for later table lookup are:
-
-- (url, datetime): CDX table 
-- sha1hex: `file_meta` table
-
-Probably makes most sense to have all this end up in a large JSON object though.
+JSON should be basically compatible with existing `ingest_file_result` objects,
+with some new sub-objects.
+
+Overall object (`IngestWebResult`):
+
+- `status`: str
+- `hit`: bool
+- `error_message`: optional, if an error
+- `hops`: optional, array of URLs
+- `cdx`: optional; single CDX row of primary HTML document
+- `terminal`: optional; same as ingest result
+    - `terminal_url`
+    - `terminal_dt`
+    - `terminal_status_code`
+    - `terminal_sha1hex`
+- `request`: optional but usually present; ingest request object, verbatim
+- `file_meta`: optional; file metadata about primary HTML document
+- `html_biblio`: optional; extracted biblio metadata from primary HTML document
+- `scope`: optional; detected/guessed scope (fulltext, etc)
+- `html_resources`: optional; array of sub-resources. primary HTML is not included
+- `html_body`: optional; just the status code and some metadata is passed through;
+  actual document would go through a different KafkaTopic
+    - `status`: str
+    - `agent`: str, eg "trafilatura/0.4"
+    - `tei_xml`: optional, str
+    - `word_count`: optional, str
 
 
 ## New SQL Tables
 
 `html_meta`
-    surt,
-    timestamp (str?)
-    primary key: (surt, timestamp)
-    sha1hex (indexed)
-    updated
+    sha1hex (primary key)
+    updated (of SQL row)
     status
+    scope
     has_teixml
+    has_thumbnail
+    word_count (from teixml fulltext)
     biblio (JSON)
     resources (JSON)
 
 Also writes to `ingest_file_result`, `file_meta`, and `cdx`, all only for the base HTML document.
 
+
 ## Fatcat API Wants
 
 Would be nice to have lookup by SURT+timestamp, and/or by sha1hex of terminal base file.
 
 `hide` option for cdx rows; also for fileset equivalent.
 
+
 ## New Workers
 
 Could reuse existing worker, have code branch depending on type of ingest.
@@ -78,7 +100,7 @@ ingest file worker
   => same as existing worker, because could be calling SPN
 
 persist result
-  => same as existing worker
+  => same as existing worker; adds persisting various HTML metadata
 
 persist html text
   => talks to seaweedfs
@@ -89,9 +111,17 @@ persist html text
 HTML ingest result topic (webcapture-ish)
 
 sandcrawler-ENV.html-teixml
-    JSON
-    same as other fulltext topics
+    JSON wrapping TEI-XML (same as other fulltext topics)
+    key compaction and content compression enabled
+
+JSON schema:
+
+- `key` and `sha1hex`: str; used as kafka key
+- `status`: str
+- `tei_xml`: str, optional
+- `word_count`: int, optional
+
+## New S3/SeaweedFS Content
 
-## TODO
+`sandcrawler` bucket, `html` folder, `.tei.xml` suffix.
 
-- refactor ingest worker to be more general
author	Bryan Newbold <bnewbold@archive.org>	2020-11-06 18:25:55 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2020-11-06 18:25:55 -0800
commit	47ca1a273912c8836630b0930b71a4e66fd2c85b (patch)
tree	1c08f0a42fe16b3401d34a6a63c4c19be8aead30
parent	b86a6fd5bb74f9f11e682b9a98f02b5dba8c4cc1 (diff)
download	sandcrawler-47ca1a273912c8836630b0930b71a4e66fd2c85b.tar.gz sandcrawler-47ca1a273912c8836630b0930b71a4e66fd2c85b.zip