aboutsummaryrefslogtreecommitdiffstats
path: root/proposals
diff options
context:
space:
mode:
Diffstat (limited to 'proposals')
-rw-r--r--proposals/2018_original_sandcrawler_rfc.md180
-rw-r--r--proposals/2019_ingest.md2
-rw-r--r--proposals/20200129_pdf_ingest.md2
-rw-r--r--proposals/20200207_pdftrio.md5
-rw-r--r--proposals/20201012_no_capture.md5
-rw-r--r--proposals/20201103_xml_ingest.md19
-rw-r--r--proposals/2020_pdf_meta_thumbnails.md2
-rw-r--r--proposals/2021-04-22_crossref_db.md2
-rw-r--r--proposals/2021-12-09_trawling.md180
-rw-r--r--proposals/brainstorm/2021-debug_web_interface.md9
-rw-r--r--proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md36
11 files changed, 418 insertions, 24 deletions
diff --git a/proposals/2018_original_sandcrawler_rfc.md b/proposals/2018_original_sandcrawler_rfc.md
new file mode 100644
index 0000000..ecf7ab8
--- /dev/null
+++ b/proposals/2018_original_sandcrawler_rfc.md
@@ -0,0 +1,180 @@
+
+**Title:** Journal Archiving Pipeline
+
+**Author:** Bryan Newbold <bnewbold@archive.org>
+
+**Date:** March 2018
+
+**Status:** work-in-progress
+
+This is an RFC-style technical proposal for a journal crawling, archiving,
+extracting, resolving, and cataloging pipeline.
+
+Design work funded by a Mellon Foundation grant in 2018.
+
+## Overview
+
+Let's start with data stores first:
+
+- crawled original fulltext (PDF, JATS, HTML) ends up in petabox/global-wayback
+- file-level extracted fulltext and metadata is stored in HBase, with the hash
+ of the original file as the key
+- cleaned metadata is stored in a "catalog" relational (SQL) database (probably
+ PostgreSQL or some hip scalable NewSQL thing compatible with Postgres or
+ MariaDB)
+
+**Resources:** back-of-the-envelope, around 100 TB petabox storage total (for
+100 million PDF files); 10-20 TB HBase table total. Can start small.
+
+
+All "system" (aka, pipeline) state (eg, "what work has been done") is ephemeral
+and is rederived relatively easily (but might be cached for performance).
+
+The overall "top-down", metadata-driven cycle is:
+
+1. Partners and public sources provide metadata (for catalog) and seed lists
+ (for crawlers)
+2. Crawlers pull in fulltext and HTTP/HTML metadata from the public web
+3. Extractors parse raw fulltext files (PDFs) and store structured metadata (in
+ HBase)
+4. Data Mungers match extracted metadata (from HBase) against the catalog, or
+ create new records if none found.
+
+In the "bottom up" cycle, batch jobs run as map/reduce jobs against the
+catalog, HBase, global wayback, and partner metadata datasets to identify
+potential new public or already-archived content to process, and pushes tasks
+to the crawlers, extractors, and mungers.
+
+## Partner Metadata
+
+Periodic Luigi scripts run on a regular VM to pull in metadata from partners.
+All metadata is saved to either petabox (for public stuff) or HDFS (for
+restricted). Scripts process/munge the data and push directly to the catalog
+(for trusted/authoritative sources like Crossref, ISSN, PubMed, DOAJ); others
+extract seedlists and push to the crawlers (
+
+**Resources:** 1 VM (could be a devbox), with a large attached disk (spinning
+probably ok)
+
+## Crawling
+
+All fulltext content comes in from the public web via crawling, and all crawled
+content ends up in global wayback.
+
+One or more VMs serve as perpetual crawlers, with multiple active ("perpetual")
+Heritrix crawls operating with differing configuration. These could be
+orchestrated (like h3), or just have the crawl jobs cut off and restarted every
+year or so.
+
+In a starter configuration, there would be two crawl queues. One would target
+direct PDF links, landing pages, author homepages, DOI redirects, etc. It would
+process HTML and look for PDF outlinks, but wouldn't crawl recursively.
+
+HBase is used for de-dupe, with records (pointers) stored in WARCs.
+
+A second config would take seeds as entire journal websites, and would crawl
+continuously.
+
+Other components of the system "push" tasks to the crawlers by copying schedule
+files into the crawl action directories.
+
+WARCs would be uploaded into petabox via draintasker as usual, and CDX
+derivation would be left to the derive process. Other processes are notified of
+"new crawl content" being available when they see new unprocessed CDX files in
+items from specific collections. draintasker could be configured to "cut" new
+items every 24 hours at most to ensure this pipeline moves along regularly, or
+we could come up with other hacks to get lower "latency" at this stage.
+
+**Resources:** 1-2 crawler VMs, each with a large attached disk (spinning)
+
+### De-Dupe Efficiency
+
+We would certainly feed CDX info from all bulk journal crawling into HBase
+before any additional large crawling, to get that level of de-dupe.
+
+As to whether all GWB PDFs should be de-dupe against is a policy question: is
+there something special about the journal-specific crawls that makes it worth
+having second copies? Eg, if we had previously domain crawled and access is
+restricted, we then wouldn't be allowed to provide researcher access to those
+files... on the other hand, we could extract for researchers given that we
+"refound" the content at a new URL?
+
+Only fulltext files (PDFs) would be de-duped against (by content), so we'd be
+recrawling lots of HTML. Presumably this is a fraction of crawl data size; what
+fraction?
+
+Watermarked files would be refreshed repeatedly from the same PDF, and even
+extracted/processed repeatedly (because the hash would be different). This is
+hard to de-dupe/skip, because we would want to catch "content drift" (changes
+in files).
+
+## Extractors
+
+Off-the-shelf PDF extraction software runs on high-CPU VM nodes (probably
+GROBID running on 1-2 data nodes, which have 30+ CPU cores and plenty of RAM
+and network throughput).
+
+A hadoop streaming job (written in python) takes a CDX file as task input. It
+filters for only PDFs, and then checks each line against HBase to see if it has
+already been extracted. If it hasn't, the script downloads directly from
+petabox using the full CDX info (bypassing wayback, which would be a
+bottleneck). It optionally runs any "quick check" scripts to see if the PDF
+should be skipped ("definitely not a scholarly work"), then if it looks Ok
+submits the file over HTTP to the GROBID worker pool for extraction. The
+results are pushed to HBase, and a short status line written to Hadoop. The
+overall Hadoop job has a reduce phase that generates a human-meaningful report
+of job status (eg, number of corrupt files) for monitoring.
+
+A side job as part of extracting can "score" the extracted metadata to flag
+problems with GROBID, to be used as potential training data for improvement.
+
+**Resources:** 1-2 datanode VMs; hadoop cluster time. Needed up-front for
+backlog processing; less CPU needed over time.
+
+## Matchers
+
+The matcher runs as a "scan" HBase map/reduce job over new (unprocessed) HBasej
+rows. It pulls just the basic metadata (title, author, identifiers, abstract)
+and calls the catalog API to identify potential match candidates. If no match
+is found, and the metadata "look good" based on some filters (to remove, eg,
+spam), works are inserted into the catalog (eg, for those works that don't have
+globally available identifiers or other metadata; "long tail" and legacy
+content).
+
+**Resources:** Hadoop cluster time
+
+## Catalog
+
+The catalog is a versioned relational database. All scripts interact with an
+API server (instead of connecting directly to the database). It should be
+reliable and low-latency for simple reads, so it can be relied on to provide a
+public-facing API and have public web interfaces built on top. This is in
+contrast to Hadoop, which for the most part could go down with no public-facing
+impact (other than fulltext API queries). The catalog does not contain
+copywritable material, but it does contain strong (verified) links to fulltext
+content. Policy gets implemented here if necessary.
+
+A global "changelog" (append-only log) is used in the catalog to record every
+change, allowing for easier replication (internal or external, to partners). As
+little as possible is implemented in the catalog itself; instead helper and
+cleanup bots use the API to propose and verify edits, similar to the wikidata
+and git data models.
+
+Public APIs and any front-end services are built on the catalog. Elasticsearch
+(for metadata or fulltext search) could build on top of the catalog.
+
+**Resources:** Unknown, but estimate 1+ TB of SSD storage each on 2 or more
+database machines
+
+## Machine Learning and "Bottom Up"
+
+TBD.
+
+## Logistics
+
+Ansible is used to deploy all components. Luigi is used as a task scheduler for
+batch jobs, with cron to initiate periodic tasks. Errors and actionable
+problems are aggregated in Sentry.
+
+Logging, metrics, and other debugging and monitoring are TBD.
+
diff --git a/proposals/2019_ingest.md b/proposals/2019_ingest.md
index c05c9df..768784f 100644
--- a/proposals/2019_ingest.md
+++ b/proposals/2019_ingest.md
@@ -1,5 +1,5 @@
-status: work-in-progress
+status: deployed
This document proposes structure and systems for ingesting (crawling) paper
PDFs and other content as part of sandcrawler.
diff --git a/proposals/20200129_pdf_ingest.md b/proposals/20200129_pdf_ingest.md
index 620ed09..157607e 100644
--- a/proposals/20200129_pdf_ingest.md
+++ b/proposals/20200129_pdf_ingest.md
@@ -1,5 +1,5 @@
-status: planned
+status: deployed
2020q1 Fulltext PDF Ingest Plan
===================================
diff --git a/proposals/20200207_pdftrio.md b/proposals/20200207_pdftrio.md
index 31a2db6..6f6443f 100644
--- a/proposals/20200207_pdftrio.md
+++ b/proposals/20200207_pdftrio.md
@@ -1,5 +1,8 @@
-status: in progress
+status: deployed
+
+NOTE: while this has been used in production, as of December 2022 the results
+are not used much in practice, and we don't score every PDF that comes along
PDF Trio (ML Classification)
==============================
diff --git a/proposals/20201012_no_capture.md b/proposals/20201012_no_capture.md
index 27c14d1..7f6a1f5 100644
--- a/proposals/20201012_no_capture.md
+++ b/proposals/20201012_no_capture.md
@@ -1,5 +1,8 @@
-status: in-progress
+status: work-in-progress
+
+NOTE: as of December 2022, bnewbold can't remember if this was fully
+implemented or not.
Storing no-capture missing URLs in `terminal_url`
=================================================
diff --git a/proposals/20201103_xml_ingest.md b/proposals/20201103_xml_ingest.md
index 25ec973..34e00b0 100644
--- a/proposals/20201103_xml_ingest.md
+++ b/proposals/20201103_xml_ingest.md
@@ -1,22 +1,5 @@
-status: wip
-
-TODO:
-x XML fulltext URL extractor (based on HTML biblio metadata, not PDF url extractor)
-x differential JATS XML and scielo XML from generic XML?
- application/xml+jats is what fatcat is doing for abstracts
- but it should be application/jats+xml?
- application/tei+xml
- if startswith "<article " and "<article-meta>" => JATS
-x refactor ingest worker to be more general
-x have ingest code publish body to kafka topic
-x write a persist worker
-/ create/configure kafka topic
-- test everything locally
-- fatcat: ingest tool to create requests
-- fatcat: entity updates worker creates XML ingest requests for specific sources
-- fatcat: ingest file import worker allows XML results
-- ansible: deployment of persist worker
+status: deployed
XML Fulltext Ingest
====================
diff --git a/proposals/2020_pdf_meta_thumbnails.md b/proposals/2020_pdf_meta_thumbnails.md
index f231a7f..141ece8 100644
--- a/proposals/2020_pdf_meta_thumbnails.md
+++ b/proposals/2020_pdf_meta_thumbnails.md
@@ -1,5 +1,5 @@
-status: work-in-progress
+status: deployed
New PDF derivatives: thumbnails, metadata, raw text
===================================================
diff --git a/proposals/2021-04-22_crossref_db.md b/proposals/2021-04-22_crossref_db.md
index bead7a4..1d4c3f8 100644
--- a/proposals/2021-04-22_crossref_db.md
+++ b/proposals/2021-04-22_crossref_db.md
@@ -1,5 +1,5 @@
-status: work-in-progress
+status: deployed
Crossref DOI Metadata in Sandcrawler DB
=======================================
diff --git a/proposals/2021-12-09_trawling.md b/proposals/2021-12-09_trawling.md
new file mode 100644
index 0000000..33b6b4c
--- /dev/null
+++ b/proposals/2021-12-09_trawling.md
@@ -0,0 +1,180 @@
+
+status: work-in-progress
+
+NOTE: as of December 2022, the implementation on these features haven't been
+merged to the main branch. Development stalled in December 2021.
+
+Trawling for Unstructured Scholarly Web Content
+===============================================
+
+## Background and Motivation
+
+A long-term goal for sandcrawler has been the ability to pick through
+unstructured web archive content (or even non-web collection), identify
+potential in-scope research outputs, extract metadata for those outputs, and
+merge the content in to a catalog (fatcat).
+
+This process requires integration of many existing tools (HTML and PDF
+extraction; fuzzy bibliographic metadata matching; machine learning to identify
+in-scope content; etc), as well as high-level curration, targetting, and
+evaluation by human operators. The goal is to augment and improve the
+productivity of human operators as much as possible.
+
+This process will be similar to "ingest", which is where we start with a
+specific URL and have some additional context about the expected result (eg,
+content type, exernal identifier). Some differences with trawling are that we
+are start with a collection or context (instead of single URL); have little or
+no context about the content we are looking for; and may even be creating a new
+catalog entry, as opposed to matching to a known existing entry.
+
+
+## Architecture
+
+The core operation is to take a resource and run a flowchart of processing
+steps on it, resulting in an overall status and possible related metadata. The
+common case is that the resource is a PDF or HTML coming from wayback (with
+contextual metadata about the capture), but we should be flexible to supporting
+more content types in the future, and should try to support plain files with no
+context as well.
+
+Some relatively simple wrapper code handles fetching resources and summarizing
+status/counts.
+
+Outside of the scope of sandcrawler, new fatcat code (importer or similar) will
+be needed to handle trawl results. It will probably make sense to pre-filter
+(with `jq` or `rg`) before passing results to fatcat.
+
+At this stage, trawl workers will probably be run manually. Some successful
+outputs (like GROBID, HTML metadata) would be written to existing kafka topics
+to be persisted, but there would not be any specific `trawl` SQL tables or
+automation.
+
+It will probably be helpful to have some kind of wrapper script that can run
+sandcrawler trawl processes, then filter and pipe the output into fatcat
+importer, all from a single invocation, while reporting results.
+
+TODO:
+- for HTML imports, do we fetch the full webcapture stuff and return that?
+
+
+## Methods of Operation
+
+### `cdx_file`
+
+An existing CDX file is provided on-disk locally.
+
+### `cdx_api`
+
+Simplified variants: `cdx_domain`, `cdx_surt`
+
+Uses CDX API to download records matching the configured filters, then processes the file.
+
+Saves the CDX file intermediate result somewhere locally (working or tmp
+directory), with timestamp in the path, to make re-trying with `cdx_file` fast
+and easy.
+
+
+### `archiveorg_web_collection`
+
+Uses `cdx_collection.py` (or similar) to fetch a full CDX list, by iterating over
+then process it.
+
+Saves the CDX file intermediate result somewhere locally (working or tmp
+directory), with timestamp in the path, to make re-trying with `cdx_file` fast
+and easy.
+
+### Others
+
+- `archiveorg_file_collection`: fetch file list via archive.org metadata, then processes each
+
+## Schema
+
+Per-resource results:
+
+ hit (bool)
+ indicates whether resource seems in scope and was processed successfully
+ (roughly, status 'success', and
+ status (str)
+ success: fetched resource, ran processing, pa
+ skip-cdx: filtered before even fetching resource
+ skip-resource: filtered after fetching resource
+ wayback-error (etc): problem fetching
+ content_scope (str)
+ filtered-{filtertype}
+ article (etc)
+ landing-page
+ resource_type (str)
+ pdf, html
+ file_meta{}
+ cdx{}
+ revisit_cdx{}
+
+ # below are resource_type specific
+ grobid
+ pdf_meta
+ pdf_trio
+ html_biblio
+ (other heuristics and ML)
+
+High-level request:
+
+ trawl_method: str
+ cdx_file_path
+ default_filters: bool
+ resource_filters[]
+ scope: str
+ surt_prefix, domain, host, mimetype, size, datetime, resource_type, http_status
+ value: any
+ values[]: any
+ min: any
+ max: any
+ biblio_context{}: set of expected/default values
+ container_id
+ release_type
+ release_stage
+ url_rel
+
+High-level summary / results:
+
+ status
+ request{}: the entire request object
+ counts
+ total_resources
+ status{}
+ content_scope{}
+ resource_type{}
+
+## Example Corpuses
+
+All PDFs (`application/pdf`) in web.archive.org from before the year 2000.
+Starting point would be a CDX list.
+
+Spidering crawls starting from a set of OA journal homepage URLs.
+
+Archive-It partner collections from research universities, particularly of
+their own .edu domains. Starting point would be an archive.org collection, from
+which WARC files or CDX lists can be accessed.
+
+General archive.org PDF collections, such as
+[ERIC](https://archive.org/details/ericarchive) or
+[Document Cloud](https://archive.org/details/documentcloud).
+
+Specific Journal or Publisher URL patterns. Starting point could be a domain,
+hostname, SURT prefix, and/or URL regex.
+
+Heuristic patterns over full web.archive.org CDX index. For example, .edu
+domains with user directories and a `.pdf` in the file path ("tilde" username
+pattern).
+
+Random samples of entire Wayback corpus. For example, random samples filtered
+by date, content type, TLD, etc. This would be true "trawling" over the entire
+corpus.
+
+
+## Other Ideas
+
+Could have a web archive spidering mode: starting from a seed, fetch multiple
+captures (different captures), then extract outlinks from those, up to some
+number of hops. An example application would be links to research group
+webpages or author homepages, and to try to extract PDF links from CVs, etc.
+
diff --git a/proposals/brainstorm/2021-debug_web_interface.md b/proposals/brainstorm/2021-debug_web_interface.md
new file mode 100644
index 0000000..442b439
--- /dev/null
+++ b/proposals/brainstorm/2021-debug_web_interface.md
@@ -0,0 +1,9 @@
+
+status: brainstorm idea
+
+Simple internal-only web interface to help debug ingest issues.
+
+- paste a hash, URL, or identifier and get a display of "everything we know" about it
+- enter a URL/SURT prefix and get aggregate stats (?)
+- enter a domain/host/prefix and get recent attempts/results
+- pre-computed periodic reports on ingest pipeline (?)
diff --git a/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md
new file mode 100644
index 0000000..b3ad447
--- /dev/null
+++ b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md
@@ -0,0 +1,36 @@
+
+status: brainstorming
+
+We continue to see issues with heritrix3-based crawling. Would like to have an
+option to switch to higher-throughput heritrix-based crawling.
+
+SPNv2 path would stick around at least for save-paper-now style ingest.
+
+
+## Sketch
+
+Ingest requests are created continuously by fatcat, with daily spikes.
+
+Ingest workers run mostly in "bulk" mode, aka they don't make SPNv2 calls.
+`no-capture` responses are recorded in sandcrawler SQL database.
+
+Periodically (daily?), a script queries for new no-capture results, filtered to
+the most recent period. These are processed in a bit in to a URL list, then
+converted to a heritrix frontier, and sent to crawlers. This could either be an
+h3 instance (?), or simple `scp` to a running crawl directory.
+
+The crawler crawls, with usual landing page config, and draintasker runs.
+
+TODO: can we have draintasker/heritrix set a maximum WARC life? Like 6 hours?
+or, target a smaller draintasker item size, so they get updated more frequently
+
+Another SQL script dumps ingest requests from the *previous* period, and
+re-submits them for bulk-style ingest (by workers).
+
+The end result would be things getting crawled and updated within a couple
+days.
+
+
+## Sketch 2
+
+Upload URL list to petabox item, wait for heritrix derive to run (!)