status: wip HTML Ingest Pipeline ======================== Basic goal: given an ingest request of type 'html', output an object (JSON) which could be imported into fatcat. Should work with things like (scholarly) blog posts, micropubs, registrations, protocols. Doesn't need to work with everything to start. "Platform" sites (like youtube, figshare, etc) will probably be a different ingest worker. A current unknown is what the expected size of this metadata is. Both in number of documents and amount of metadata per document. Example HTML articles to start testing: - complex distill article: - old HTML journal: - NIH pub: - first mondays (OJS): - d-lib: ## Ingest Process Follow base URL to terminal document, which is assumed to be a status=200 HTML document. Verify that terminal document is fulltext. Extract both metadata and fulltext. Extract list of sub-resources. Filter out unwanted (eg favicon, analytics, unnecessary), apply a sanity limit. Convert to fully qualified URLs. For each sub-resource, fetch down to the terminal resource, and compute hashes/metadata. TODO: - will probably want to parallelize sub-resource fetching. async? - behavior when failure fetching sub-resources ## Ingest Result Schema JSON should The minimum that could be persisted for later table lookup are: - (url, datetime): CDX table - sha1hex: `file_meta` table Probably makes most sense to have all this end up in a large JSON object though. ## New SQL Tables `html_meta` surt, timestamp (str?) primary key: (surt, timestamp) sha1hex (indexed) updated status has_teixml biblio (JSON) resources (JSON) Also writes to `ingest_file_result`, `file_meta`, and `cdx`, all only for the base HTML document. ## Fatcat API Wants Would be nice to have lookup by SURT+timestamp, and/or by sha1hex of terminal base file. `hide` option for cdx rows; also for fileset equivalent. ## New Workers Could reuse existing worker, have code branch depending on type of ingest. ingest file worker => same as existing worker, because could be calling SPN persist result => same as existing worker persist html text => talks to seaweedfs ## New Kafka Topics HTML ingest result topic (webcapture-ish) sandcrawler-ENV.html-teixml JSON same as other fulltext topics ## TODO - refactor ingest worker to be more general