status: in-progress Trawling for Unstructured Scholarly Web Content =============================================== ## Background and Motivation A long-term goal for sandcrawler has been the ability to pick through unstructured web archive content (or even non-web collection), identify potential in-scope research outputs, extract metadata for those outputs, and merge the content in to a catalog (fatcat). This process requires integration of many existing tools (HTML and PDF extraction; fuzzy bibliographic metadata matching; machine learning to identify in-scope content; etc), as well as high-level curration, targetting, and evaluation by human operators. The goal is to augment and improve the productivity of human operators as much as possible. This process will be similar to "ingest", which is where we start with a specific URL and have some additional context about the expected result (eg, content type, exernal identifier). Some differences with trawling are that we are start with a collection or context (instead of single URL); have little or no context about the content we are looking for; and may even be creating a new catalog entry, as opposed to matching to a known existing entry. ## Architecture The core operation is to take a resource and run a flowchart of processing steps on it, resulting in an overall status and possible related metadata. The common case is that the resource is a PDF or HTML coming from wayback (with contextual metadata about the capture), but we should be flexible to supporting more content types in the future, and should try to support plain files with no context as well. Some relatively simple wrapper code handles fetching resources and summarizing status/counts. Outside of the scope of sandcrawler, new fatcat code (importer or similar) will be needed to handle trawl results. It will probably make sense to pre-filter (with `jq` or `rg`) before passing results to fatcat. At this stage, trawl workers will probably be run manually. Some successful outputs (like GROBID, HTML metadata) would be written to existing kafka topics to be persisted, but there would not be any specific `trawl` SQL tables or automation. It will probably be helpful to have some kind of wrapper script that can run sandcrawler trawl processes, then filter and pipe the output into fatcat importer, all from a single invocation, while reporting results. TODO: - for HTML imports, do we fetch the full webcapture stuff and return that? ## Methods of Operation ### `cdx_file` An existing CDX file is provided on-disk locally. ### `cdx_api` Simplified variants: `cdx_domain`, `cdx_surt` Uses CDX API to download records matching the configured filters, then processes the file. Saves the CDX file intermediate result somewhere locally (working or tmp directory), with timestamp in the path, to make re-trying with `cdx_file` fast and easy. ### `archiveorg_web_collection` Uses `cdx_collection.py` (or similar) to fetch a full CDX list, by iterating over then process it. Saves the CDX file intermediate result somewhere locally (working or tmp directory), with timestamp in the path, to make re-trying with `cdx_file` fast and easy. ### Others - `archiveorg_file_collection`: fetch file list via archive.org metadata, then processes each ## Schema Per-resource results: hit (bool) indicates whether resource seems in scope and was processed successfully (roughly, status 'success', and status (str) success: fetched resource, ran processing, pa skip-cdx: filtered before even fetching resource skip-resource: filtered after fetching resource wayback-error (etc): problem fetching content_scope (str) filtered-{filtertype} article (etc) landing-page resource_type (str) pdf, html file_meta{} cdx{} revisit_cdx{} # below are resource_type specific grobid pdf_meta pdf_trio html_biblio (other heuristics and ML) High-level request: trawl_method: str cdx_file_path default_filters: bool resource_filters[] scope: str surt_prefix, domain, host, mimetype, size, datetime, resource_type, http_status value: any values[]: any min: any max: any biblio_context{}: set of expected/default values container_id release_type release_stage url_rel High-level summary / results: status request{}: the entire request object counts total_resources status{} content_scope{} resource_type{} ## Example Corpuses All PDFs (`application/pdf`) in web.archive.org from before the year 2000. Starting point would be a CDX list. Spidering crawls starting from a set of OA journal homepage URLs. Archive-It partner collections from research universities, particularly of their own .edu domains. Starting point would be an archive.org collection, from which WARC files or CDX lists can be accessed. General archive.org PDF collections, such as [ERIC](https://archive.org/details/ericarchive) or [Document Cloud](https://archive.org/details/documentcloud). Specific Journal or Publisher URL patterns. Starting point could be a domain, hostname, SURT prefix, and/or URL regex. Heuristic patterns over full web.archive.org CDX index. For example, .edu domains with user directories and a `.pdf` in the file path ("tilde" username pattern). Random samples of entire Wayback corpus. For example, random samples filtered by date, content type, TLD, etc. This would be true "trawling" over the entire corpus. ## Other Ideas Could have a web archive spidering mode: starting from a seed, fetch multiple captures (different captures), then extract outlinks from those, up to some number of hops. An example application would be links to research group webpages or author homepages, and to try to extract PDF links from CVs, etc.