diff options
Diffstat (limited to 'proposals/2021-12-09_trawling.md')
-rw-r--r-- | proposals/2021-12-09_trawling.md | 177 |
1 files changed, 177 insertions, 0 deletions
diff --git a/proposals/2021-12-09_trawling.md b/proposals/2021-12-09_trawling.md new file mode 100644 index 0000000..96c5f3f --- /dev/null +++ b/proposals/2021-12-09_trawling.md @@ -0,0 +1,177 @@ + +status: in-progress + +Trawling for Unstructured Scholarly Web Content +=============================================== + +## Background and Motivation + +A long-term goal for sandcrawler has been the ability to pick through +unstructured web archive content (or even non-web collection), identify +potential in-scope research outputs, extract metadata for those outputs, and +merge the content in to a catalog (fatcat). + +This process requires integration of many existing tools (HTML and PDF +extraction; fuzzy bibliographic metadata matching; machine learning to identify +in-scope content; etc), as well as high-level curration, targetting, and +evaluation by human operators. The goal is to augment and improve the +productivity of human operators as much as possible. + +This process will be similar to "ingest", which is where we start with a +specific URL and have some additional context about the expected result (eg, +content type, exernal identifier). Some differences with trawling are that we +are start with a collection or context (instead of single URL); have little or +no context about the content we are looking for; and may even be creating a new +catalog entry, as opposed to matching to a known existing entry. + + +## Architecture + +The core operation is to take a resource and run a flowchart of processing +steps on it, resulting in an overall status and possible related metadata. The +common case is that the resource is a PDF or HTML coming from wayback (with +contextual metadata about the capture), but we should be flexible to supporting +more content types in the future, and should try to support plain files with no +context as well. + +Some relatively simple wrapper code handles fetching resources and summarizing +status/counts. + +Outside of the scope of sandcrawler, new fatcat code (importer or similar) will +be needed to handle trawl results. It will probably make sense to pre-filter +(with `jq` or `rg`) before passing results to fatcat. + +At this stage, trawl workers will probably be run manually. Some successful +outputs (like GROBID, HTML metadata) would be written to existing kafka topics +to be persisted, but there would not be any specific `trawl` SQL tables or +automation. + +It will probably be helpful to have some kind of wrapper script that can run +sandcrawler trawl processes, then filter and pipe the output into fatcat +importer, all from a single invocation, while reporting results. + +TODO: +- for HTML imports, do we fetch the full webcapture stuff and return that? + + +## Methods of Operation + +### `cdx_file` + +An existing CDX file is provided on-disk locally. + +### `cdx_api` + +Simplified variants: `cdx_domain`, `cdx_surt` + +Uses CDX API to download records matching the configured filters, then processes the file. + +Saves the CDX file intermediate result somewhere locally (working or tmp +directory), with timestamp in the path, to make re-trying with `cdx_file` fast +and easy. + + +### `archiveorg_web_collection` + +Uses `cdx_collection.py` (or similar) to fetch a full CDX list, by iterating over +then process it. + +Saves the CDX file intermediate result somewhere locally (working or tmp +directory), with timestamp in the path, to make re-trying with `cdx_file` fast +and easy. + +### Others + +- `archiveorg_file_collection`: fetch file list via archive.org metadata, then processes each + +## Schema + +Per-resource results: + + hit (bool) + indicates whether resource seems in scope and was processed successfully + (roughly, status 'success', and + status (str) + success: fetched resource, ran processing, pa + skip-cdx: filtered before even fetching resource + skip-resource: filtered after fetching resource + wayback-error (etc): problem fetching + content_scope (str) + filtered-{filtertype} + article (etc) + landing-page + resource_type (str) + pdf, html + file_meta{} + cdx{} + revisit_cdx{} + + # below are resource_type specific + grobid + pdf_meta + pdf_trio + html_biblio + (other heuristics and ML) + +High-level request: + + trawl_method: str + cdx_file_path + default_filters: bool + resource_filters[] + scope: str + surt_prefix, domain, host, mimetype, size, datetime, resource_type, http_status + value: any + values[]: any + min: any + max: any + biblio_context{}: set of expected/default values + container_id + release_type + release_stage + url_rel + +High-level summary / results: + + status + request{}: the entire request object + counts + total_resources + status{} + content_scope{} + resource_type{} + +## Example Corpuses + +All PDFs (`application/pdf`) in web.archive.org from before the year 2000. +Starting point would be a CDX list. + +Spidering crawls starting from a set of OA journal homepage URLs. + +Archive-It partner collections from research universities, particularly of +their own .edu domains. Starting point would be an archive.org collection, from +which WARC files or CDX lists can be accessed. + +General archive.org PDF collections, such as +[ERIC](https://archive.org/details/ericarchive) or +[Document Cloud](https://archive.org/details/documentcloud). + +Specific Journal or Publisher URL patterns. Starting point could be a domain, +hostname, SURT prefix, and/or URL regex. + +Heuristic patterns over full web.archive.org CDX index. For example, .edu +domains with user directories and a `.pdf` in the file path ("tilde" username +pattern). + +Random samples of entire Wayback corpus. For example, random samples filtered +by date, content type, TLD, etc. This would be true "trawling" over the entire +corpus. + + +## Other Ideas + +Could have a web archive spidering mode: starting from a seed, fetch multiple +captures (different captures), then extract outlinks from those, up to some +number of hops. An example application would be links to research group +webpages or author homepages, and to try to extract PDF links from CVs, etc. + |