diff options
Diffstat (limited to 'proposals')
-rw-r--r-- | proposals/2018_original_sandcrawler_rfc.md | 180 | ||||
-rw-r--r-- | proposals/2019_ingest.md | 2 | ||||
-rw-r--r-- | proposals/20200129_pdf_ingest.md | 2 | ||||
-rw-r--r-- | proposals/20200207_pdftrio.md | 5 | ||||
-rw-r--r-- | proposals/20201012_no_capture.md | 5 | ||||
-rw-r--r-- | proposals/20201103_xml_ingest.md | 19 | ||||
-rw-r--r-- | proposals/2020_pdf_meta_thumbnails.md | 2 | ||||
-rw-r--r-- | proposals/2021-04-22_crossref_db.md | 2 | ||||
-rw-r--r-- | proposals/2021-12-09_trawling.md | 180 | ||||
-rw-r--r-- | proposals/brainstorm/2021-debug_web_interface.md | 9 | ||||
-rw-r--r-- | proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md | 36 |
11 files changed, 418 insertions, 24 deletions
diff --git a/proposals/2018_original_sandcrawler_rfc.md b/proposals/2018_original_sandcrawler_rfc.md new file mode 100644 index 0000000..ecf7ab8 --- /dev/null +++ b/proposals/2018_original_sandcrawler_rfc.md @@ -0,0 +1,180 @@ + +**Title:** Journal Archiving Pipeline + +**Author:** Bryan Newbold <bnewbold@archive.org> + +**Date:** March 2018 + +**Status:** work-in-progress + +This is an RFC-style technical proposal for a journal crawling, archiving, +extracting, resolving, and cataloging pipeline. + +Design work funded by a Mellon Foundation grant in 2018. + +## Overview + +Let's start with data stores first: + +- crawled original fulltext (PDF, JATS, HTML) ends up in petabox/global-wayback +- file-level extracted fulltext and metadata is stored in HBase, with the hash + of the original file as the key +- cleaned metadata is stored in a "catalog" relational (SQL) database (probably + PostgreSQL or some hip scalable NewSQL thing compatible with Postgres or + MariaDB) + +**Resources:** back-of-the-envelope, around 100 TB petabox storage total (for +100 million PDF files); 10-20 TB HBase table total. Can start small. + + +All "system" (aka, pipeline) state (eg, "what work has been done") is ephemeral +and is rederived relatively easily (but might be cached for performance). + +The overall "top-down", metadata-driven cycle is: + +1. Partners and public sources provide metadata (for catalog) and seed lists + (for crawlers) +2. Crawlers pull in fulltext and HTTP/HTML metadata from the public web +3. Extractors parse raw fulltext files (PDFs) and store structured metadata (in + HBase) +4. Data Mungers match extracted metadata (from HBase) against the catalog, or + create new records if none found. + +In the "bottom up" cycle, batch jobs run as map/reduce jobs against the +catalog, HBase, global wayback, and partner metadata datasets to identify +potential new public or already-archived content to process, and pushes tasks +to the crawlers, extractors, and mungers. + +## Partner Metadata + +Periodic Luigi scripts run on a regular VM to pull in metadata from partners. +All metadata is saved to either petabox (for public stuff) or HDFS (for +restricted). Scripts process/munge the data and push directly to the catalog +(for trusted/authoritative sources like Crossref, ISSN, PubMed, DOAJ); others +extract seedlists and push to the crawlers ( + +**Resources:** 1 VM (could be a devbox), with a large attached disk (spinning +probably ok) + +## Crawling + +All fulltext content comes in from the public web via crawling, and all crawled +content ends up in global wayback. + +One or more VMs serve as perpetual crawlers, with multiple active ("perpetual") +Heritrix crawls operating with differing configuration. These could be +orchestrated (like h3), or just have the crawl jobs cut off and restarted every +year or so. + +In a starter configuration, there would be two crawl queues. One would target +direct PDF links, landing pages, author homepages, DOI redirects, etc. It would +process HTML and look for PDF outlinks, but wouldn't crawl recursively. + +HBase is used for de-dupe, with records (pointers) stored in WARCs. + +A second config would take seeds as entire journal websites, and would crawl +continuously. + +Other components of the system "push" tasks to the crawlers by copying schedule +files into the crawl action directories. + +WARCs would be uploaded into petabox via draintasker as usual, and CDX +derivation would be left to the derive process. Other processes are notified of +"new crawl content" being available when they see new unprocessed CDX files in +items from specific collections. draintasker could be configured to "cut" new +items every 24 hours at most to ensure this pipeline moves along regularly, or +we could come up with other hacks to get lower "latency" at this stage. + +**Resources:** 1-2 crawler VMs, each with a large attached disk (spinning) + +### De-Dupe Efficiency + +We would certainly feed CDX info from all bulk journal crawling into HBase +before any additional large crawling, to get that level of de-dupe. + +As to whether all GWB PDFs should be de-dupe against is a policy question: is +there something special about the journal-specific crawls that makes it worth +having second copies? Eg, if we had previously domain crawled and access is +restricted, we then wouldn't be allowed to provide researcher access to those +files... on the other hand, we could extract for researchers given that we +"refound" the content at a new URL? + +Only fulltext files (PDFs) would be de-duped against (by content), so we'd be +recrawling lots of HTML. Presumably this is a fraction of crawl data size; what +fraction? + +Watermarked files would be refreshed repeatedly from the same PDF, and even +extracted/processed repeatedly (because the hash would be different). This is +hard to de-dupe/skip, because we would want to catch "content drift" (changes +in files). + +## Extractors + +Off-the-shelf PDF extraction software runs on high-CPU VM nodes (probably +GROBID running on 1-2 data nodes, which have 30+ CPU cores and plenty of RAM +and network throughput). + +A hadoop streaming job (written in python) takes a CDX file as task input. It +filters for only PDFs, and then checks each line against HBase to see if it has +already been extracted. If it hasn't, the script downloads directly from +petabox using the full CDX info (bypassing wayback, which would be a +bottleneck). It optionally runs any "quick check" scripts to see if the PDF +should be skipped ("definitely not a scholarly work"), then if it looks Ok +submits the file over HTTP to the GROBID worker pool for extraction. The +results are pushed to HBase, and a short status line written to Hadoop. The +overall Hadoop job has a reduce phase that generates a human-meaningful report +of job status (eg, number of corrupt files) for monitoring. + +A side job as part of extracting can "score" the extracted metadata to flag +problems with GROBID, to be used as potential training data for improvement. + +**Resources:** 1-2 datanode VMs; hadoop cluster time. Needed up-front for +backlog processing; less CPU needed over time. + +## Matchers + +The matcher runs as a "scan" HBase map/reduce job over new (unprocessed) HBasej +rows. It pulls just the basic metadata (title, author, identifiers, abstract) +and calls the catalog API to identify potential match candidates. If no match +is found, and the metadata "look good" based on some filters (to remove, eg, +spam), works are inserted into the catalog (eg, for those works that don't have +globally available identifiers or other metadata; "long tail" and legacy +content). + +**Resources:** Hadoop cluster time + +## Catalog + +The catalog is a versioned relational database. All scripts interact with an +API server (instead of connecting directly to the database). It should be +reliable and low-latency for simple reads, so it can be relied on to provide a +public-facing API and have public web interfaces built on top. This is in +contrast to Hadoop, which for the most part could go down with no public-facing +impact (other than fulltext API queries). The catalog does not contain +copywritable material, but it does contain strong (verified) links to fulltext +content. Policy gets implemented here if necessary. + +A global "changelog" (append-only log) is used in the catalog to record every +change, allowing for easier replication (internal or external, to partners). As +little as possible is implemented in the catalog itself; instead helper and +cleanup bots use the API to propose and verify edits, similar to the wikidata +and git data models. + +Public APIs and any front-end services are built on the catalog. Elasticsearch +(for metadata or fulltext search) could build on top of the catalog. + +**Resources:** Unknown, but estimate 1+ TB of SSD storage each on 2 or more +database machines + +## Machine Learning and "Bottom Up" + +TBD. + +## Logistics + +Ansible is used to deploy all components. Luigi is used as a task scheduler for +batch jobs, with cron to initiate periodic tasks. Errors and actionable +problems are aggregated in Sentry. + +Logging, metrics, and other debugging and monitoring are TBD. + diff --git a/proposals/2019_ingest.md b/proposals/2019_ingest.md index c05c9df..768784f 100644 --- a/proposals/2019_ingest.md +++ b/proposals/2019_ingest.md @@ -1,5 +1,5 @@ -status: work-in-progress +status: deployed This document proposes structure and systems for ingesting (crawling) paper PDFs and other content as part of sandcrawler. diff --git a/proposals/20200129_pdf_ingest.md b/proposals/20200129_pdf_ingest.md index 620ed09..157607e 100644 --- a/proposals/20200129_pdf_ingest.md +++ b/proposals/20200129_pdf_ingest.md @@ -1,5 +1,5 @@ -status: planned +status: deployed 2020q1 Fulltext PDF Ingest Plan =================================== diff --git a/proposals/20200207_pdftrio.md b/proposals/20200207_pdftrio.md index 31a2db6..6f6443f 100644 --- a/proposals/20200207_pdftrio.md +++ b/proposals/20200207_pdftrio.md @@ -1,5 +1,8 @@ -status: in progress +status: deployed + +NOTE: while this has been used in production, as of December 2022 the results +are not used much in practice, and we don't score every PDF that comes along PDF Trio (ML Classification) ============================== diff --git a/proposals/20201012_no_capture.md b/proposals/20201012_no_capture.md index 27c14d1..7f6a1f5 100644 --- a/proposals/20201012_no_capture.md +++ b/proposals/20201012_no_capture.md @@ -1,5 +1,8 @@ -status: in-progress +status: work-in-progress + +NOTE: as of December 2022, bnewbold can't remember if this was fully +implemented or not. Storing no-capture missing URLs in `terminal_url` ================================================= diff --git a/proposals/20201103_xml_ingest.md b/proposals/20201103_xml_ingest.md index 25ec973..34e00b0 100644 --- a/proposals/20201103_xml_ingest.md +++ b/proposals/20201103_xml_ingest.md @@ -1,22 +1,5 @@ -status: wip - -TODO: -x XML fulltext URL extractor (based on HTML biblio metadata, not PDF url extractor) -x differential JATS XML and scielo XML from generic XML? - application/xml+jats is what fatcat is doing for abstracts - but it should be application/jats+xml? - application/tei+xml - if startswith "<article " and "<article-meta>" => JATS -x refactor ingest worker to be more general -x have ingest code publish body to kafka topic -x write a persist worker -/ create/configure kafka topic -- test everything locally -- fatcat: ingest tool to create requests -- fatcat: entity updates worker creates XML ingest requests for specific sources -- fatcat: ingest file import worker allows XML results -- ansible: deployment of persist worker +status: deployed XML Fulltext Ingest ==================== diff --git a/proposals/2020_pdf_meta_thumbnails.md b/proposals/2020_pdf_meta_thumbnails.md index f231a7f..141ece8 100644 --- a/proposals/2020_pdf_meta_thumbnails.md +++ b/proposals/2020_pdf_meta_thumbnails.md @@ -1,5 +1,5 @@ -status: work-in-progress +status: deployed New PDF derivatives: thumbnails, metadata, raw text =================================================== diff --git a/proposals/2021-04-22_crossref_db.md b/proposals/2021-04-22_crossref_db.md index bead7a4..1d4c3f8 100644 --- a/proposals/2021-04-22_crossref_db.md +++ b/proposals/2021-04-22_crossref_db.md @@ -1,5 +1,5 @@ -status: work-in-progress +status: deployed Crossref DOI Metadata in Sandcrawler DB ======================================= diff --git a/proposals/2021-12-09_trawling.md b/proposals/2021-12-09_trawling.md new file mode 100644 index 0000000..33b6b4c --- /dev/null +++ b/proposals/2021-12-09_trawling.md @@ -0,0 +1,180 @@ + +status: work-in-progress + +NOTE: as of December 2022, the implementation on these features haven't been +merged to the main branch. Development stalled in December 2021. + +Trawling for Unstructured Scholarly Web Content +=============================================== + +## Background and Motivation + +A long-term goal for sandcrawler has been the ability to pick through +unstructured web archive content (or even non-web collection), identify +potential in-scope research outputs, extract metadata for those outputs, and +merge the content in to a catalog (fatcat). + +This process requires integration of many existing tools (HTML and PDF +extraction; fuzzy bibliographic metadata matching; machine learning to identify +in-scope content; etc), as well as high-level curration, targetting, and +evaluation by human operators. The goal is to augment and improve the +productivity of human operators as much as possible. + +This process will be similar to "ingest", which is where we start with a +specific URL and have some additional context about the expected result (eg, +content type, exernal identifier). Some differences with trawling are that we +are start with a collection or context (instead of single URL); have little or +no context about the content we are looking for; and may even be creating a new +catalog entry, as opposed to matching to a known existing entry. + + +## Architecture + +The core operation is to take a resource and run a flowchart of processing +steps on it, resulting in an overall status and possible related metadata. The +common case is that the resource is a PDF or HTML coming from wayback (with +contextual metadata about the capture), but we should be flexible to supporting +more content types in the future, and should try to support plain files with no +context as well. + +Some relatively simple wrapper code handles fetching resources and summarizing +status/counts. + +Outside of the scope of sandcrawler, new fatcat code (importer or similar) will +be needed to handle trawl results. It will probably make sense to pre-filter +(with `jq` or `rg`) before passing results to fatcat. + +At this stage, trawl workers will probably be run manually. Some successful +outputs (like GROBID, HTML metadata) would be written to existing kafka topics +to be persisted, but there would not be any specific `trawl` SQL tables or +automation. + +It will probably be helpful to have some kind of wrapper script that can run +sandcrawler trawl processes, then filter and pipe the output into fatcat +importer, all from a single invocation, while reporting results. + +TODO: +- for HTML imports, do we fetch the full webcapture stuff and return that? + + +## Methods of Operation + +### `cdx_file` + +An existing CDX file is provided on-disk locally. + +### `cdx_api` + +Simplified variants: `cdx_domain`, `cdx_surt` + +Uses CDX API to download records matching the configured filters, then processes the file. + +Saves the CDX file intermediate result somewhere locally (working or tmp +directory), with timestamp in the path, to make re-trying with `cdx_file` fast +and easy. + + +### `archiveorg_web_collection` + +Uses `cdx_collection.py` (or similar) to fetch a full CDX list, by iterating over +then process it. + +Saves the CDX file intermediate result somewhere locally (working or tmp +directory), with timestamp in the path, to make re-trying with `cdx_file` fast +and easy. + +### Others + +- `archiveorg_file_collection`: fetch file list via archive.org metadata, then processes each + +## Schema + +Per-resource results: + + hit (bool) + indicates whether resource seems in scope and was processed successfully + (roughly, status 'success', and + status (str) + success: fetched resource, ran processing, pa + skip-cdx: filtered before even fetching resource + skip-resource: filtered after fetching resource + wayback-error (etc): problem fetching + content_scope (str) + filtered-{filtertype} + article (etc) + landing-page + resource_type (str) + pdf, html + file_meta{} + cdx{} + revisit_cdx{} + + # below are resource_type specific + grobid + pdf_meta + pdf_trio + html_biblio + (other heuristics and ML) + +High-level request: + + trawl_method: str + cdx_file_path + default_filters: bool + resource_filters[] + scope: str + surt_prefix, domain, host, mimetype, size, datetime, resource_type, http_status + value: any + values[]: any + min: any + max: any + biblio_context{}: set of expected/default values + container_id + release_type + release_stage + url_rel + +High-level summary / results: + + status + request{}: the entire request object + counts + total_resources + status{} + content_scope{} + resource_type{} + +## Example Corpuses + +All PDFs (`application/pdf`) in web.archive.org from before the year 2000. +Starting point would be a CDX list. + +Spidering crawls starting from a set of OA journal homepage URLs. + +Archive-It partner collections from research universities, particularly of +their own .edu domains. Starting point would be an archive.org collection, from +which WARC files or CDX lists can be accessed. + +General archive.org PDF collections, such as +[ERIC](https://archive.org/details/ericarchive) or +[Document Cloud](https://archive.org/details/documentcloud). + +Specific Journal or Publisher URL patterns. Starting point could be a domain, +hostname, SURT prefix, and/or URL regex. + +Heuristic patterns over full web.archive.org CDX index. For example, .edu +domains with user directories and a `.pdf` in the file path ("tilde" username +pattern). + +Random samples of entire Wayback corpus. For example, random samples filtered +by date, content type, TLD, etc. This would be true "trawling" over the entire +corpus. + + +## Other Ideas + +Could have a web archive spidering mode: starting from a seed, fetch multiple +captures (different captures), then extract outlinks from those, up to some +number of hops. An example application would be links to research group +webpages or author homepages, and to try to extract PDF links from CVs, etc. + diff --git a/proposals/brainstorm/2021-debug_web_interface.md b/proposals/brainstorm/2021-debug_web_interface.md new file mode 100644 index 0000000..442b439 --- /dev/null +++ b/proposals/brainstorm/2021-debug_web_interface.md @@ -0,0 +1,9 @@ + +status: brainstorm idea + +Simple internal-only web interface to help debug ingest issues. + +- paste a hash, URL, or identifier and get a display of "everything we know" about it +- enter a URL/SURT prefix and get aggregate stats (?) +- enter a domain/host/prefix and get recent attempts/results +- pre-computed periodic reports on ingest pipeline (?) diff --git a/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md new file mode 100644 index 0000000..b3ad447 --- /dev/null +++ b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md @@ -0,0 +1,36 @@ + +status: brainstorming + +We continue to see issues with heritrix3-based crawling. Would like to have an +option to switch to higher-throughput heritrix-based crawling. + +SPNv2 path would stick around at least for save-paper-now style ingest. + + +## Sketch + +Ingest requests are created continuously by fatcat, with daily spikes. + +Ingest workers run mostly in "bulk" mode, aka they don't make SPNv2 calls. +`no-capture` responses are recorded in sandcrawler SQL database. + +Periodically (daily?), a script queries for new no-capture results, filtered to +the most recent period. These are processed in a bit in to a URL list, then +converted to a heritrix frontier, and sent to crawlers. This could either be an +h3 instance (?), or simple `scp` to a running crawl directory. + +The crawler crawls, with usual landing page config, and draintasker runs. + +TODO: can we have draintasker/heritrix set a maximum WARC life? Like 6 hours? +or, target a smaller draintasker item size, so they get updated more frequently + +Another SQL script dumps ingest requests from the *previous* period, and +re-submits them for bulk-style ingest (by workers). + +The end result would be things getting crawled and updated within a couple +days. + + +## Sketch 2 + +Upload URL list to petabox item, wait for heritrix derive to run (!) |