'trawling' proposal (in progress)

author: Bryan Newbold <bnewbold@archive.org> 2022-01-27 17:55:40 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2022-01-27 17:55:40 -0800
commit: 6b59e1f4f08662ac9e6c3adb731af31e42f894a6 (patch)
tree: d13a4e9f09e2f89c14298874c937c033e6f8fee7
parent: c8e2462471a010e4ae368941b539e9404f3768fc (diff)
download: sandcrawler-6b59e1f4f08662ac9e6c3adb731af31e42f894a6.tar.gz
sandcrawler-6b59e1f4f08662ac9e6c3adb731af31e42f894a6.zip
1 files changed, 177 insertions, 0 deletions
diff --git a/proposals/2021-12-09_trawling.md b/proposals/2021-12-09_trawling.md
new file mode 100644
index 0000000..96c5f3f
--- /dev/null
+++ b/proposals/2021-12-09_trawling.md
@@ -0,0 +1,177 @@
+
+status: in-progress
+
+Trawling for Unstructured Scholarly Web Content
+===============================================
+
+## Background and Motivation
+
+A long-term goal for sandcrawler has been the ability to pick through
+unstructured web archive content (or even non-web collection), identify
+potential in-scope research outputs, extract metadata for those outputs, and
+merge the content in to a catalog (fatcat).
+
+This process requires integration of many existing tools (HTML and PDF
+extraction; fuzzy bibliographic metadata matching; machine learning to identify
+in-scope content; etc), as well as high-level curration, targetting, and
+evaluation by human operators. The goal is to augment and improve the
+productivity of human operators as much as possible.
+
+This process will be similar to "ingest", which is where we start with a
+specific URL and have some additional context about the expected result (eg,
+content type, exernal identifier). Some differences with trawling are that we
+are start with a collection or context (instead of single URL); have little or
+no context about the content we are looking for; and may even be creating a new
+catalog entry, as opposed to matching to a known existing entry.
+
+
+## Architecture
+
+The core operation is to take a resource and run a flowchart of processing
+steps on it, resulting in an overall status and possible related metadata. The
+common case is that the resource is a PDF or HTML coming from wayback (with
+contextual metadata about the capture), but we should be flexible to supporting
+more content types in the future, and should try to support plain files with no
+context as well.
+
+Some relatively simple wrapper code handles fetching resources and summarizing
+status/counts.
+
+Outside of the scope of sandcrawler, new fatcat code (importer or similar) will
+be needed to handle trawl results. It will probably make sense to pre-filter
+(with `jq` or `rg`) before passing results to fatcat.
+
+At this stage, trawl workers will probably be run manually. Some successful
+outputs (like GROBID, HTML metadata) would be written to existing kafka topics
+to be persisted, but there would not be any specific `trawl` SQL tables or
+automation.
+
+It will probably be helpful to have some kind of wrapper script that can run
+sandcrawler trawl processes, then filter and pipe the output into fatcat
+importer, all from a single invocation, while reporting results.
+
+TODO:
+- for HTML imports, do we fetch the full webcapture stuff and return that?
+
+
+## Methods of Operation
+
+### `cdx_file`
+
+An existing CDX file is provided on-disk locally.
+
+### `cdx_api`
+
+Simplified variants: `cdx_domain`, `cdx_surt`
+
+Uses CDX API to download records matching the configured filters, then processes the file.
+
+Saves the CDX file intermediate result somewhere locally (working or tmp
+directory), with timestamp in the path, to make re-trying with `cdx_file` fast
+and easy.
+
+
+### `archiveorg_web_collection`
+
+Uses `cdx_collection.py` (or similar) to fetch a full CDX list, by iterating over
+then process it.
+
+Saves the CDX file intermediate result somewhere locally (working or tmp
+directory), with timestamp in the path, to make re-trying with `cdx_file` fast
+and easy.
+
+### Others
+
+- `archiveorg_file_collection`: fetch file list via archive.org metadata, then processes each
+
+## Schema
+
+Per-resource results:
+
+    hit (bool)
+        indicates whether resource seems in scope and was processed successfully
+        (roughly, status 'success', and 
+    status (str)
+        success: fetched resource, ran processing, pa
+        skip-cdx: filtered before even fetching resource
+        skip-resource: filtered after fetching resource
+        wayback-error (etc): problem fetching
+    content_scope (str)
+        filtered-{filtertype}
+        article (etc)
+        landing-page
+    resource_type (str)
+        pdf, html
+    file_meta{}
+    cdx{}
+    revisit_cdx{}
+
+    # below are resource_type specific
+    grobid
+    pdf_meta
+    pdf_trio
+    html_biblio
+    (other heuristics and ML)
+
+High-level request:
+
+    trawl_method: str
+    cdx_file_path
+    default_filters: bool
+    resource_filters[]
+        scope: str
+            surt_prefix, domain, host, mimetype, size, datetime, resource_type, http_status
+        value: any
+        values[]: any
+        min: any
+        max: any
+    biblio_context{}: set of expected/default values
+        container_id
+        release_type
+        release_stage
+        url_rel
+
+High-level summary / results:
+
+    status
+    request{}: the entire request object
+    counts
+        total_resources
+        status{}
+        content_scope{}
+        resource_type{}
+
+## Example Corpuses
+
+All PDFs (`application/pdf`) in web.archive.org from before the year 2000.
+Starting point would be a CDX list.
+
+Spidering crawls starting from a set of OA journal homepage URLs.
+
+Archive-It partner collections from research universities, particularly of
+their own .edu domains. Starting point would be an archive.org collection, from
+which WARC files or CDX lists can be accessed.
+
+General archive.org PDF collections, such as
+[ERIC](https://archive.org/details/ericarchive) or
+[Document Cloud](https://archive.org/details/documentcloud).
+
+Specific Journal or Publisher URL patterns. Starting point could be a domain,
+hostname, SURT prefix, and/or URL regex.
+
+Heuristic patterns over full web.archive.org CDX index. For example, .edu
+domains with user directories and a `.pdf` in the file path ("tilde" username
+pattern).
+
+Random samples of entire Wayback corpus. For example, random samples filtered
+by date, content type, TLD, etc. This would be true "trawling" over the entire
+corpus.
+
+
+## Other Ideas
+
+Could have a web archive spidering mode: starting from a seed, fetch multiple
+captures (different captures), then extract outlinks from those, up to some
+number of hops. An example application would be links to research group
+webpages or author homepages, and to try to extract PDF links from CVs, etc.
+
author	Bryan Newbold <bnewbold@archive.org>	2022-01-27 17:55:40 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2022-01-27 17:55:40 -0800
commit	6b59e1f4f08662ac9e6c3adb731af31e42f894a6 (patch)
tree	d13a4e9f09e2f89c14298874c937c033e6f8fee7
parent	c8e2462471a010e4ae368941b539e9404f3768fc (diff)
download	sandcrawler-6b59e1f4f08662ac9e6c3adb731af31e42f894a6.tar.gz sandcrawler-6b59e1f4f08662ac9e6c3adb731af31e42f894a6.zip