aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--proposals/20201026_html_ingest.md68
1 files changed, 49 insertions, 19 deletions
diff --git a/proposals/20201026_html_ingest.md b/proposals/20201026_html_ingest.md
index 90bc6e5..c06f180 100644
--- a/proposals/20201026_html_ingest.md
+++ b/proposals/20201026_html_ingest.md
@@ -22,6 +22,7 @@ Example HTML articles to start testing:
- first mondays (OJS): <https://firstmonday.org/ojs/index.php/fm/article/view/10274/9729>
- d-lib: <http://www.dlib.org/dlib/july17/williams/07williams.html>
+
## Ingest Process
Follow base URL to terminal document, which is assumed to be a status=200 HTML document.
@@ -32,44 +33,65 @@ Extract list of sub-resources. Filter out unwanted (eg favicon, analytics,
unnecessary), apply a sanity limit. Convert to fully qualified URLs. For each
sub-resource, fetch down to the terminal resource, and compute hashes/metadata.
-TODO:
+Open questions:
+
- will probably want to parallelize sub-resource fetching. async?
- behavior when failure fetching sub-resources
## Ingest Result Schema
-JSON should
-
-The minimum that could be persisted for later table lookup are:
-
-- (url, datetime): CDX table
-- sha1hex: `file_meta` table
-
-Probably makes most sense to have all this end up in a large JSON object though.
+JSON should be basically compatible with existing `ingest_file_result` objects,
+with some new sub-objects.
+
+Overall object (`IngestWebResult`):
+
+- `status`: str
+- `hit`: bool
+- `error_message`: optional, if an error
+- `hops`: optional, array of URLs
+- `cdx`: optional; single CDX row of primary HTML document
+- `terminal`: optional; same as ingest result
+ - `terminal_url`
+ - `terminal_dt`
+ - `terminal_status_code`
+ - `terminal_sha1hex`
+- `request`: optional but usually present; ingest request object, verbatim
+- `file_meta`: optional; file metadata about primary HTML document
+- `html_biblio`: optional; extracted biblio metadata from primary HTML document
+- `scope`: optional; detected/guessed scope (fulltext, etc)
+- `html_resources`: optional; array of sub-resources. primary HTML is not included
+- `html_body`: optional; just the status code and some metadata is passed through;
+ actual document would go through a different KafkaTopic
+ - `status`: str
+ - `agent`: str, eg "trafilatura/0.4"
+ - `tei_xml`: optional, str
+ - `word_count`: optional, str
## New SQL Tables
`html_meta`
- surt,
- timestamp (str?)
- primary key: (surt, timestamp)
- sha1hex (indexed)
- updated
+ sha1hex (primary key)
+ updated (of SQL row)
status
+ scope
has_teixml
+ has_thumbnail
+ word_count (from teixml fulltext)
biblio (JSON)
resources (JSON)
Also writes to `ingest_file_result`, `file_meta`, and `cdx`, all only for the base HTML document.
+
## Fatcat API Wants
Would be nice to have lookup by SURT+timestamp, and/or by sha1hex of terminal base file.
`hide` option for cdx rows; also for fileset equivalent.
+
## New Workers
Could reuse existing worker, have code branch depending on type of ingest.
@@ -78,7 +100,7 @@ ingest file worker
=> same as existing worker, because could be calling SPN
persist result
- => same as existing worker
+ => same as existing worker; adds persisting various HTML metadata
persist html text
=> talks to seaweedfs
@@ -89,9 +111,17 @@ persist html text
HTML ingest result topic (webcapture-ish)
sandcrawler-ENV.html-teixml
- JSON
- same as other fulltext topics
+ JSON wrapping TEI-XML (same as other fulltext topics)
+ key compaction and content compression enabled
+
+JSON schema:
+
+- `key` and `sha1hex`: str; used as kafka key
+- `status`: str
+- `tei_xml`: str, optional
+- `word_count`: int, optional
+
+## New S3/SeaweedFS Content
-## TODO
+`sandcrawler` bucket, `html` folder, `.tei.xml` suffix.
-- refactor ingest worker to be more general