diff options
Diffstat (limited to 'proposals')
| -rw-r--r-- | proposals/20201026_html_ingest.md | 68 | 
1 files changed, 49 insertions, 19 deletions
| diff --git a/proposals/20201026_html_ingest.md b/proposals/20201026_html_ingest.md index 90bc6e5..c06f180 100644 --- a/proposals/20201026_html_ingest.md +++ b/proposals/20201026_html_ingest.md @@ -22,6 +22,7 @@ Example HTML articles to start testing:  - first mondays (OJS): <https://firstmonday.org/ojs/index.php/fm/article/view/10274/9729>  - d-lib: <http://www.dlib.org/dlib/july17/williams/07williams.html>  +  ## Ingest Process  Follow base URL to terminal document, which is assumed to be a status=200 HTML document. @@ -32,44 +33,65 @@ Extract list of sub-resources. Filter out unwanted (eg favicon, analytics,  unnecessary), apply a sanity limit. Convert to fully qualified URLs. For each  sub-resource, fetch down to the terminal resource, and compute hashes/metadata. -TODO: +Open questions: +  - will probably want to parallelize sub-resource fetching. async?  - behavior when failure fetching sub-resources  ## Ingest Result Schema -JSON should - -The minimum that could be persisted for later table lookup are: - -- (url, datetime): CDX table  -- sha1hex: `file_meta` table - -Probably makes most sense to have all this end up in a large JSON object though. +JSON should be basically compatible with existing `ingest_file_result` objects, +with some new sub-objects. + +Overall object (`IngestWebResult`): + +- `status`: str +- `hit`: bool +- `error_message`: optional, if an error +- `hops`: optional, array of URLs +- `cdx`: optional; single CDX row of primary HTML document +- `terminal`: optional; same as ingest result +    - `terminal_url` +    - `terminal_dt` +    - `terminal_status_code` +    - `terminal_sha1hex` +- `request`: optional but usually present; ingest request object, verbatim +- `file_meta`: optional; file metadata about primary HTML document +- `html_biblio`: optional; extracted biblio metadata from primary HTML document +- `scope`: optional; detected/guessed scope (fulltext, etc) +- `html_resources`: optional; array of sub-resources. primary HTML is not included +- `html_body`: optional; just the status code and some metadata is passed through; +  actual document would go through a different KafkaTopic +    - `status`: str +    - `agent`: str, eg "trafilatura/0.4" +    - `tei_xml`: optional, str +    - `word_count`: optional, str  ## New SQL Tables  `html_meta` -    surt, -    timestamp (str?) -    primary key: (surt, timestamp) -    sha1hex (indexed) -    updated +    sha1hex (primary key) +    updated (of SQL row)      status +    scope      has_teixml +    has_thumbnail +    word_count (from teixml fulltext)      biblio (JSON)      resources (JSON)  Also writes to `ingest_file_result`, `file_meta`, and `cdx`, all only for the base HTML document. +  ## Fatcat API Wants  Would be nice to have lookup by SURT+timestamp, and/or by sha1hex of terminal base file.  `hide` option for cdx rows; also for fileset equivalent. +  ## New Workers  Could reuse existing worker, have code branch depending on type of ingest. @@ -78,7 +100,7 @@ ingest file worker    => same as existing worker, because could be calling SPN  persist result -  => same as existing worker +  => same as existing worker; adds persisting various HTML metadata  persist html text    => talks to seaweedfs @@ -89,9 +111,17 @@ persist html text  HTML ingest result topic (webcapture-ish)  sandcrawler-ENV.html-teixml -    JSON -    same as other fulltext topics +    JSON wrapping TEI-XML (same as other fulltext topics) +    key compaction and content compression enabled + +JSON schema: + +- `key` and `sha1hex`: str; used as kafka key +- `status`: str +- `tei_xml`: str, optional +- `word_count`: int, optional + +## New S3/SeaweedFS Content -## TODO +`sandcrawler` bucket, `html` folder, `.tei.xml` suffix. -- refactor ingest worker to be more general | 
