aboutsummaryrefslogtreecommitdiffstats
path: root/proposals
diff options
context:
space:
mode:
Diffstat (limited to 'proposals')
-rw-r--r--proposals/2018_original_sandcrawler_rfc.md180
-rw-r--r--proposals/2019_ingest.md6
-rw-r--r--proposals/20200129_pdf_ingest.md10
-rw-r--r--proposals/20200207_pdftrio.md5
-rw-r--r--proposals/20201012_no_capture.md7
-rw-r--r--proposals/20201103_xml_ingest.md21
-rw-r--r--proposals/2020_pdf_meta_thumbnails.md4
-rw-r--r--proposals/2020_seaweed_s3.md2
-rw-r--r--proposals/2021-04-22_crossref_db.md86
-rw-r--r--proposals/2021-09-09_component_ingest.md114
-rw-r--r--proposals/2021-09-09_fileset_ingest.md343
-rw-r--r--proposals/2021-09-13_src_ingest.md53
-rw-r--r--proposals/2021-09-21_spn_accounts.md14
-rw-r--r--proposals/2021-10-28_grobid_refs.md125
-rw-r--r--proposals/2021-12-09_trawling.md180
-rw-r--r--proposals/brainstorm/2021-debug_web_interface.md9
-rw-r--r--proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md36
17 files changed, 1162 insertions, 33 deletions
diff --git a/proposals/2018_original_sandcrawler_rfc.md b/proposals/2018_original_sandcrawler_rfc.md
new file mode 100644
index 0000000..ecf7ab8
--- /dev/null
+++ b/proposals/2018_original_sandcrawler_rfc.md
@@ -0,0 +1,180 @@
+
+**Title:** Journal Archiving Pipeline
+
+**Author:** Bryan Newbold <bnewbold@archive.org>
+
+**Date:** March 2018
+
+**Status:** work-in-progress
+
+This is an RFC-style technical proposal for a journal crawling, archiving,
+extracting, resolving, and cataloging pipeline.
+
+Design work funded by a Mellon Foundation grant in 2018.
+
+## Overview
+
+Let's start with data stores first:
+
+- crawled original fulltext (PDF, JATS, HTML) ends up in petabox/global-wayback
+- file-level extracted fulltext and metadata is stored in HBase, with the hash
+ of the original file as the key
+- cleaned metadata is stored in a "catalog" relational (SQL) database (probably
+ PostgreSQL or some hip scalable NewSQL thing compatible with Postgres or
+ MariaDB)
+
+**Resources:** back-of-the-envelope, around 100 TB petabox storage total (for
+100 million PDF files); 10-20 TB HBase table total. Can start small.
+
+
+All "system" (aka, pipeline) state (eg, "what work has been done") is ephemeral
+and is rederived relatively easily (but might be cached for performance).
+
+The overall "top-down", metadata-driven cycle is:
+
+1. Partners and public sources provide metadata (for catalog) and seed lists
+ (for crawlers)
+2. Crawlers pull in fulltext and HTTP/HTML metadata from the public web
+3. Extractors parse raw fulltext files (PDFs) and store structured metadata (in
+ HBase)
+4. Data Mungers match extracted metadata (from HBase) against the catalog, or
+ create new records if none found.
+
+In the "bottom up" cycle, batch jobs run as map/reduce jobs against the
+catalog, HBase, global wayback, and partner metadata datasets to identify
+potential new public or already-archived content to process, and pushes tasks
+to the crawlers, extractors, and mungers.
+
+## Partner Metadata
+
+Periodic Luigi scripts run on a regular VM to pull in metadata from partners.
+All metadata is saved to either petabox (for public stuff) or HDFS (for
+restricted). Scripts process/munge the data and push directly to the catalog
+(for trusted/authoritative sources like Crossref, ISSN, PubMed, DOAJ); others
+extract seedlists and push to the crawlers (
+
+**Resources:** 1 VM (could be a devbox), with a large attached disk (spinning
+probably ok)
+
+## Crawling
+
+All fulltext content comes in from the public web via crawling, and all crawled
+content ends up in global wayback.
+
+One or more VMs serve as perpetual crawlers, with multiple active ("perpetual")
+Heritrix crawls operating with differing configuration. These could be
+orchestrated (like h3), or just have the crawl jobs cut off and restarted every
+year or so.
+
+In a starter configuration, there would be two crawl queues. One would target
+direct PDF links, landing pages, author homepages, DOI redirects, etc. It would
+process HTML and look for PDF outlinks, but wouldn't crawl recursively.
+
+HBase is used for de-dupe, with records (pointers) stored in WARCs.
+
+A second config would take seeds as entire journal websites, and would crawl
+continuously.
+
+Other components of the system "push" tasks to the crawlers by copying schedule
+files into the crawl action directories.
+
+WARCs would be uploaded into petabox via draintasker as usual, and CDX
+derivation would be left to the derive process. Other processes are notified of
+"new crawl content" being available when they see new unprocessed CDX files in
+items from specific collections. draintasker could be configured to "cut" new
+items every 24 hours at most to ensure this pipeline moves along regularly, or
+we could come up with other hacks to get lower "latency" at this stage.
+
+**Resources:** 1-2 crawler VMs, each with a large attached disk (spinning)
+
+### De-Dupe Efficiency
+
+We would certainly feed CDX info from all bulk journal crawling into HBase
+before any additional large crawling, to get that level of de-dupe.
+
+As to whether all GWB PDFs should be de-dupe against is a policy question: is
+there something special about the journal-specific crawls that makes it worth
+having second copies? Eg, if we had previously domain crawled and access is
+restricted, we then wouldn't be allowed to provide researcher access to those
+files... on the other hand, we could extract for researchers given that we
+"refound" the content at a new URL?
+
+Only fulltext files (PDFs) would be de-duped against (by content), so we'd be
+recrawling lots of HTML. Presumably this is a fraction of crawl data size; what
+fraction?
+
+Watermarked files would be refreshed repeatedly from the same PDF, and even
+extracted/processed repeatedly (because the hash would be different). This is
+hard to de-dupe/skip, because we would want to catch "content drift" (changes
+in files).
+
+## Extractors
+
+Off-the-shelf PDF extraction software runs on high-CPU VM nodes (probably
+GROBID running on 1-2 data nodes, which have 30+ CPU cores and plenty of RAM
+and network throughput).
+
+A hadoop streaming job (written in python) takes a CDX file as task input. It
+filters for only PDFs, and then checks each line against HBase to see if it has
+already been extracted. If it hasn't, the script downloads directly from
+petabox using the full CDX info (bypassing wayback, which would be a
+bottleneck). It optionally runs any "quick check" scripts to see if the PDF
+should be skipped ("definitely not a scholarly work"), then if it looks Ok
+submits the file over HTTP to the GROBID worker pool for extraction. The
+results are pushed to HBase, and a short status line written to Hadoop. The
+overall Hadoop job has a reduce phase that generates a human-meaningful report
+of job status (eg, number of corrupt files) for monitoring.
+
+A side job as part of extracting can "score" the extracted metadata to flag
+problems with GROBID, to be used as potential training data for improvement.
+
+**Resources:** 1-2 datanode VMs; hadoop cluster time. Needed up-front for
+backlog processing; less CPU needed over time.
+
+## Matchers
+
+The matcher runs as a "scan" HBase map/reduce job over new (unprocessed) HBasej
+rows. It pulls just the basic metadata (title, author, identifiers, abstract)
+and calls the catalog API to identify potential match candidates. If no match
+is found, and the metadata "look good" based on some filters (to remove, eg,
+spam), works are inserted into the catalog (eg, for those works that don't have
+globally available identifiers or other metadata; "long tail" and legacy
+content).
+
+**Resources:** Hadoop cluster time
+
+## Catalog
+
+The catalog is a versioned relational database. All scripts interact with an
+API server (instead of connecting directly to the database). It should be
+reliable and low-latency for simple reads, so it can be relied on to provide a
+public-facing API and have public web interfaces built on top. This is in
+contrast to Hadoop, which for the most part could go down with no public-facing
+impact (other than fulltext API queries). The catalog does not contain
+copywritable material, but it does contain strong (verified) links to fulltext
+content. Policy gets implemented here if necessary.
+
+A global "changelog" (append-only log) is used in the catalog to record every
+change, allowing for easier replication (internal or external, to partners). As
+little as possible is implemented in the catalog itself; instead helper and
+cleanup bots use the API to propose and verify edits, similar to the wikidata
+and git data models.
+
+Public APIs and any front-end services are built on the catalog. Elasticsearch
+(for metadata or fulltext search) could build on top of the catalog.
+
+**Resources:** Unknown, but estimate 1+ TB of SSD storage each on 2 or more
+database machines
+
+## Machine Learning and "Bottom Up"
+
+TBD.
+
+## Logistics
+
+Ansible is used to deploy all components. Luigi is used as a task scheduler for
+batch jobs, with cron to initiate periodic tasks. Errors and actionable
+problems are aggregated in Sentry.
+
+Logging, metrics, and other debugging and monitoring are TBD.
+
diff --git a/proposals/2019_ingest.md b/proposals/2019_ingest.md
index c649809..768784f 100644
--- a/proposals/2019_ingest.md
+++ b/proposals/2019_ingest.md
@@ -1,5 +1,5 @@
-status: work-in-progress
+status: deployed
This document proposes structure and systems for ingesting (crawling) paper
PDFs and other content as part of sandcrawler.
@@ -84,7 +84,7 @@ HTML? Or both? Let's just recrawl.
*IngestRequest*
- `ingest_type`: required, one of `pdf`, `xml`, `html`, `dataset`. For
backwards compatibility, `file` should be interpreted as `pdf`. `pdf` and
- `xml` return file ingest respose; `html` and `dataset` not implemented but
+ `xml` return file ingest response; `html` and `dataset` not implemented but
would be webcapture (wayback) and fileset (archive.org item or wayback?).
In the future: `epub`, `video`, `git`, etc.
- `base_url`: required, where to start crawl process
@@ -258,7 +258,7 @@ and hacks to crawl publicly available papers. Related existing work includes
[unpaywall's crawler][unpaywall_crawl], LOCKSS extraction code, dissem.in's
efforts, zotero's bibliography extractor, etc. The "memento tracer" work is
also similar. Many of these are even in python! It would be great to reduce
-duplicated work and maintenance. An analagous system in the wild is youtube-dl
+duplicated work and maintenance. An analogous system in the wild is youtube-dl
for downloading video from many sources.
[unpaywall_crawl]: https://github.com/ourresearch/oadoi/blob/master/webpage.py
diff --git a/proposals/20200129_pdf_ingest.md b/proposals/20200129_pdf_ingest.md
index 9469217..157607e 100644
--- a/proposals/20200129_pdf_ingest.md
+++ b/proposals/20200129_pdf_ingest.md
@@ -1,5 +1,5 @@
-status: planned
+status: deployed
2020q1 Fulltext PDF Ingest Plan
===================================
@@ -27,7 +27,7 @@ There are a few million papers in fatacat which:
2. are known OA, usually because publication is Gold OA
3. don't have any fulltext PDF in fatcat
-As a detail, some of these "known OA" journals actually have embargos (aka,
+As a detail, some of these "known OA" journals actually have embargoes (aka,
they aren't true Gold OA). In particular, those marked via EZB OA "color", and
recent pubmed central ids.
@@ -104,7 +104,7 @@ Actions:
update ingest result table with status.
- fetch new MAG and unpaywall seedlists, transform to ingest requests, persist
into ingest request table. use SQL to dump only the *new* URLs (not seen in
- previous dumps) using the created timestamp, outputing new bulk ingest
+ previous dumps) using the created timestamp, outputting new bulk ingest
request lists. if possible, de-dupe between these two. then start bulk
heritrix crawls over these two long lists. Probably sharded over several
machines. Could also run serially (first one, then the other, with
@@ -133,7 +133,7 @@ We have run GROBID+glutton over basically all of these PDFs. We should be able
to do a SQL query to select PDFs that:
- have at least one known CDX row
-- GROBID processed successfuly and glutton matched to a fatcat release
+- GROBID processed successfully and glutton matched to a fatcat release
- do not have an existing fatcat file (based on sha1hex)
- output GROBID metadata, `file_meta`, and one or more CDX rows
@@ -161,7 +161,7 @@ Coding Tasks:
Actions:
- update `fatcat_file` sandcrawler table
-- check how many PDFs this might ammount to. both by uniq SHA1 and uniq
+- check how many PDFs this might amount to. both by uniq SHA1 and uniq
`fatcat_release` matches
- do some manual random QA verification to check that this method results in
quality content in fatcat
diff --git a/proposals/20200207_pdftrio.md b/proposals/20200207_pdftrio.md
index 31a2db6..6f6443f 100644
--- a/proposals/20200207_pdftrio.md
+++ b/proposals/20200207_pdftrio.md
@@ -1,5 +1,8 @@
-status: in progress
+status: deployed
+
+NOTE: while this has been used in production, as of December 2022 the results
+are not used much in practice, and we don't score every PDF that comes along
PDF Trio (ML Classification)
==============================
diff --git a/proposals/20201012_no_capture.md b/proposals/20201012_no_capture.md
index bb47ea2..7f6a1f5 100644
--- a/proposals/20201012_no_capture.md
+++ b/proposals/20201012_no_capture.md
@@ -1,5 +1,8 @@
-status: in-progress
+status: work-in-progress
+
+NOTE: as of December 2022, bnewbold can't remember if this was fully
+implemented or not.
Storing no-capture missing URLs in `terminal_url`
=================================================
@@ -29,7 +32,7 @@ The current status quo is to store the missing URL as the last element in the
pipeline that would read from the Kafka feed and extract them, but this would
be messy. Eg, re-ingesting would not update the old kafka messages, so we could
need some accounting of consumer group offsets after which missing URLs are
-truely missing.
+truly missing.
We could add a new `missing_url` database column and field to the JSON schema,
for this specific use case. This seems like unnecessary extra work.
diff --git a/proposals/20201103_xml_ingest.md b/proposals/20201103_xml_ingest.md
index 181cc11..34e00b0 100644
--- a/proposals/20201103_xml_ingest.md
+++ b/proposals/20201103_xml_ingest.md
@@ -1,22 +1,5 @@
-status: wip
-
-TODO:
-x XML fulltext URL extractor (based on HTML biblio metadata, not PDF url extractor)
-x differential JATS XML and scielo XML from generic XML?
- application/xml+jats is what fatcat is doing for abstracts
- but it should be application/jats+xml?
- application/tei+xml
- if startswith "<article " and "<article-meta>" => JATS
-x refactor ingest worker to be more general
-x have ingest code publish body to kafka topic
-x write a persist worker
-/ create/configure kafka topic
-- test everything locally
-- fatcat: ingest tool to create requests
-- fatcat: entity updates worker creates XML ingest requests for specific sources
-- fatcat: ingest file import worker allows XML results
-- ansible: deployment of persist worker
+status: deployed
XML Fulltext Ingest
====================
@@ -37,7 +20,7 @@ document. For recording in fatcat, the file metadata will be passed through.
For storing in Kafka and blob store (for downstream analysis), we will parse
the raw XML document (as "bytes") with an XML parser, then re-output with UTF-8
encoding. The hash of the *original* XML file will be used as the key for
-refering to this document. This is unintuitive, but similar to what we are
+referring to this document. This is unintuitive, but similar to what we are
doing with PDF and HTML documents (extracting in a useful format, but keeping
the original document's hash as a key).
diff --git a/proposals/2020_pdf_meta_thumbnails.md b/proposals/2020_pdf_meta_thumbnails.md
index 793d6b5..141ece8 100644
--- a/proposals/2020_pdf_meta_thumbnails.md
+++ b/proposals/2020_pdf_meta_thumbnails.md
@@ -1,5 +1,5 @@
-status: work-in-progress
+status: deployed
New PDF derivatives: thumbnails, metadata, raw text
===================================================
@@ -133,7 +133,7 @@ Deployment will involve:
Plan for processing/catchup is:
- test with COVID-19 PDF corpus
-- run extraction on all current fatcat files avaiable via IA
+- run extraction on all current fatcat files available via IA
- integrate with ingest pipeline for all new files
- run a batch catchup job over all GROBID-parsed files with no pdf meta
extracted, on basis of SQL table query
diff --git a/proposals/2020_seaweed_s3.md b/proposals/2020_seaweed_s3.md
index 5f4ff0b..677393b 100644
--- a/proposals/2020_seaweed_s3.md
+++ b/proposals/2020_seaweed_s3.md
@@ -316,7 +316,7 @@ grows very much with the number of volumes. Therefore, keep default volume size
and do not limit number of volumes `-volume.max 0` and do not use in-memory
index (rather leveldb)
-Status: done, 200M object upload via Python script sucessfully in about 6 days,
+Status: done, 200M object upload via Python script successfully in about 6 days,
memory usage was at a moderate 400M (~10% of RAM). Relatively constant
performance at about 400 `PutObject` requests/s (over 5 threads, each thread
was around 80 requests/s; then testing with 4 threads, each thread got to
diff --git a/proposals/2021-04-22_crossref_db.md b/proposals/2021-04-22_crossref_db.md
new file mode 100644
index 0000000..1d4c3f8
--- /dev/null
+++ b/proposals/2021-04-22_crossref_db.md
@@ -0,0 +1,86 @@
+
+status: deployed
+
+Crossref DOI Metadata in Sandcrawler DB
+=======================================
+
+Proposal is to have a local copy of Crossref API metadata records in
+sandcrawler DB, accessible by simple key lookup via postgrest.
+
+Initial goal is to include these in scholar work "bundles" (along with
+fulltext, etc), in particular as part of reference extraction pipeline. Around
+late 2020, many additional references became available via Crossref records,
+and have not been imported (updated) into fatcat. Reference storage in fatcat
+API is a scaling problem we would like to put off, so injecting content in this
+way is desirable.
+
+To start, working with a bulk dump made available by Crossref. In the future,
+might persist the daily feed to that we have a continuously up-to-date copy.
+
+Another application of Crossref-in-bundles is to identify overall scale of
+changes since initial Crossref metadata import.
+
+
+## Sandcrawler DB Schema
+
+The "updated" field in this case refers to the upstream timestamp, not the
+sandcrawler database update time.
+
+ CREATE TABLE IF NOT EXISTS crossref (
+ doi TEXT NOT NULL CHECK (octet_length(doi) >= 4 AND doi = LOWER(doi)),
+ indexed TIMESTAMP WITH TIME ZONE NOT NULL,
+ record JSON NOT NULL,
+ PRIMARY KEY(doi)
+ );
+
+For postgrest access, may need to also:
+
+ GRANT SELECT ON public.crossref TO web_anon;
+
+## SQL Backfill Command
+
+For an example file:
+
+ cat sample.json \
+ | jq -rc '[(.DOI | ascii_downcase), .indexed."date-time", (. | tostring)] | @tsv' \
+ | psql sandcrawler -c "COPY crossref (doi, indexed, record) FROM STDIN (DELIMITER E'\t');"
+
+For a full snapshot:
+
+ zcat crossref_public_data_file_2021_01.json.gz \
+ | pv -l \
+ | jq -rc '[(.DOI | ascii_downcase), .indexed."date-time", (. | tostring)] | @tsv' \
+ | psql sandcrawler -c "COPY crossref (doi, indexed, record) FROM STDIN (DELIMITER E'\t');"
+
+jq is the bottleneck (100% of a single CPU core).
+
+## Kafka Worker
+
+Pulls from the fatcat crossref ingest Kafka feed and persists into the crossref
+table.
+
+## SQL Table Disk Utilization
+
+An example backfill from early 2021, with about 120 million Crossref DOI
+records.
+
+Starting database size (with ingest running):
+
+ Filesystem Size Used Avail Use% Mounted on
+ /dev/vdb1 1.7T 896G 818G 53% /1
+
+ Size: 475.14G
+
+Ingest SQL command took:
+
+ 120M 15:06:08 [2.22k/s]
+ COPY 120684688
+
+After database size:
+
+ Filesystem Size Used Avail Use% Mounted on
+ /dev/vdb1 1.7T 1.2T 498G 71% /1
+
+ Size: 794.88G
+
+So about 320 GByte of disk.
diff --git a/proposals/2021-09-09_component_ingest.md b/proposals/2021-09-09_component_ingest.md
new file mode 100644
index 0000000..09dee4f
--- /dev/null
+++ b/proposals/2021-09-09_component_ingest.md
@@ -0,0 +1,114 @@
+
+File Ingest Mode: 'component'
+=============================
+
+A new ingest type for downloading individual files which are a subset of a
+complete work.
+
+Some publishers now assign DOIs to individual figures, supplements, and other
+"components" of an over release or document.
+
+Initial mimetypes to allow:
+
+- image/jpeg
+- image/tiff
+- image/png
+- image/gif
+- audio/mpeg
+- video/mp4
+- video/mpeg
+- text/plain
+- text/csv
+- application/json
+- application/xml
+- application/pdf
+- application/gzip
+- application/x-bzip
+- application/x-bzip2
+- application/zip
+- application/x-rar
+- application/x-7z-compressed
+- application/x-tar
+- application/vnd.ms-powerpoint
+- application/vnd.ms-excel
+- application/msword
+- application/vnd.openxmlformats-officedocument.wordprocessingml.document
+- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
+
+Intentionally not supporting:
+
+- text/html
+
+
+## Fatcat Changes
+
+In the file importer, allow the additional mimetypes for 'component' ingest.
+
+
+## Ingest Changes
+
+Allow additional terminal mimetypes for 'component' crawls.
+
+
+## Examples
+
+Hundreds of thousands: <https://fatcat.wiki/release/search?q=type%3Acomponent+in_ia%3Afalse>
+
+#### ACS Supplement File
+
+<https://doi.org/10.1021/acscatal.0c02627.s002>
+
+Redirects directly to .zip in browser. SPN is blocked by cookie check.
+
+#### Frontiers .docx Supplement
+
+<https://doi.org/10.3389/fpls.2019.01642.s001>
+
+Redirects to full article page. There is a pop-up for figshare, seems hard to process.
+
+#### Figshare Single FIle
+
+<https://doi.org/10.6084/m9.figshare.13646972.v1>
+
+As 'component' type in fatcat.
+
+Redirects to a landing page. Dataset ingest seems more appropriate for this entire domain.
+
+#### PeerJ supplement file
+
+<https://doi.org/10.7717/peerj.10257/supp-7>
+
+PeerJ is hard because it redirects to a single HTML page, which has links to
+supplements in the HTML. Perhaps a custom extractor will work.
+
+#### eLife
+
+<https://doi.org/10.7554/elife.38407.010>
+
+The current crawl mechanism makes it seemingly impossible to extract a specific
+supplement from the document as a whole.
+
+#### Zookeys
+
+<https://doi.org/10.3897/zookeys.895.38576.figure53>
+
+These are extract-able.
+
+#### OECD PDF Supplement
+
+<https://doi.org/10.1787/f08c6324-en>
+<https://www.oecd-ilibrary.org/trade/imports-of-services-billions-of-us-dollars_f08c6324-en>
+
+Has an Excel (.xls) link, great, but then paywall.
+
+#### Direct File Link
+
+<https://doi.org/10.1787/888934207500>
+
+This one is also OECD, but is a simple direct download.
+
+#### Protein Data Base (PDB) Entry
+
+<https://doi.org/10.2210/pdb6ls2/pdb>
+
+Multiple files; dataset/fileset more appropriate for these.
diff --git a/proposals/2021-09-09_fileset_ingest.md b/proposals/2021-09-09_fileset_ingest.md
new file mode 100644
index 0000000..65c9ccf
--- /dev/null
+++ b/proposals/2021-09-09_fileset_ingest.md
@@ -0,0 +1,343 @@
+
+status: implemented
+
+Fileset Ingest Pipeline (for Datasets)
+======================================
+
+Sandcrawler currently has ingest support for individual files saved as `file`
+entities in fatcat (xml and pdf ingest types) and HTML files with
+sub-components saved as `webcapture` entities in fatcat (html ingest type).
+
+This document describes extensions to this ingest system to flexibly support
+groups of files, which may be represented in fatcat as `fileset` entities. The
+main new ingest type is `dataset`.
+
+Compared to the existing ingest process, there are two major complications with
+datasets:
+
+- the ingest process often requires more than parsing HTML files, and will be
+ specific to individual platforms and host software packages
+- the storage backend and fatcat entity type is flexible: a dataset might be
+ represented by a single file, multiple files combined in to a single .zip
+ file, or multiple separate files; the data may get archived in wayback or in
+ an archive.org item
+
+The new concepts of "strategy" and "platform" are introduced to accommodate
+these complications.
+
+
+## Ingest Strategies
+
+The ingest strategy describes the fatcat entity type that will be output; the
+storage backend used; and whether an enclosing file format is used. The
+strategy to use can not be determined until the number and size of files is
+known. It is a function of file count, total file size, and publication
+platform.
+
+Strategy names are compact strings with the format
+`{storage_backend}-{fatcat_entity}`. A `-bundled` suffix after a `fileset`
+entity type indicates that metadata about multiple files is retained, but that
+in the storage backend only a single enclosing file (eg, `.zip`) will be
+stored.
+
+The supported strategies are:
+
+- `web-file`: single file of any type, stored in wayback, represented as fatcat `file`
+- `web-fileset`: multiple files of any type, stored in wayback, represented as fatcat `fileset`
+- `web-fileset-bundled`: single bundle file, stored in wayback, represented as fatcat `fileset`
+- `archiveorg-file`: single file of any type, stored in archive.org item, represented as fatcat `file`
+- `archiveorg-fileset`: multiple files of any type, stored in archive.org item, represented as fatcat `fileset`
+- `archiveorg-fileset-bundled`: single bundle file, stored in archive.org item, represented as fatcat `fileset`
+
+"Bundle" or "enclosing" files are things like .zip or .tar.gz. Not all .zip
+files are handled as bundles! Only when the transfer from the hosting platform
+is via a "download all as .zip" (or similar) do we consider a zipfile a
+"bundle" and index the interior files as a fileset.
+
+The term "bundle file" is used over "archive file" or "container file" to
+prevent confusion with the other use of those terms in the context of fatcat
+(container entities; archive; Internet Archive as an organization).
+
+The motivation for supporting both `web` and `archiveorg` is that `web` is
+somewhat simpler for small files, but `archiveorg` is better for larger groups
+of files (say more than 20) and larger total size (say more than 1 GByte total,
+or 128 MByte for any one file).
+
+The motivation for supporting "bundled" filesets is that there is only a single
+file to archive.
+
+
+## Ingest Pseudocode
+
+1. Determine `platform`, which may involve resolving redirects and crawling a landing page.
+
+ a. currently we always crawl the ingest `base_url`, capturing a platform landing page
+ b. we don't currently handle the case of `base_url` leading to a non-HTML
+ terminal resource. the `component` ingest type does handle this
+
+2. Use platform-specific methods to fetch manifest metadata and decide on an `ingest_strategy`.
+
+ a. depending on platform, may include access URLs for multiple strategies
+ (eg, URL for each file and a bundle URL), metadata about the item for, eg,
+ archive.org item upload, etc
+
+3. Use strategy-specific methods to archive all files in platform manifest, and verify manifest metadata.
+
+4. Summarize status and return structured result metadata.
+
+ a. if the strategy was `web-file` or `archiveorg-file`, potentially submit an
+ `ingest_file_result` object down the file ingest pipeline (Kafka topic and
+ later persist and fatcat import workers), with `dataset-file` ingest
+ type (or `{ingest_type}-file` more generally).
+
+New python types:
+
+ FilesetManifestFile
+ path: str
+ size: Optional[int]
+ md5: Optional[str]
+ sha1: Optional[str]
+ sha256: Optional[str]
+ mimetype: Optional[str]
+ extra: Optional[Dict[str, Any]]
+
+ status: Optional[str]
+ platform_url: Optional[str]
+ terminal_url: Optional[str]
+ terminal_dt: Optional[str]
+
+ FilesetPlatformItem
+ platform_name: str
+ platform_status: str
+ platform_domain: Optional[str]
+ platform_id: Optional[str]
+ manifest: Optional[List[FilesetManifestFile]]
+ archiveorg_item_name: Optional[str]
+ archiveorg_item_meta
+ web_base_url
+ web_bundle_url
+
+ ArchiveStrategyResult
+ ingest_strategy: str
+ status: str
+ manifest: List[FilesetManifestFile]
+ file_file_meta: Optional[dict]
+ file_terminal: Optional[dict]
+ file_cdx: Optional[dict]
+ bundle_file_meta: Optional[dict]
+ bundle_terminal: Optional[dict]
+ bundle_cdx: Optional[dict]
+ bundle_archiveorg_path: Optional[dict]
+
+New python APIs/classes:
+
+ FilesetPlatformHelper
+ match_request(request, resource, html_biblio) -> bool
+ does the request and landing page metadata indicate a match for this platform?
+ process_request(request, resource, html_biblio) -> FilesetPlatformItem
+ do API requests, parsing, etc to fetch metadata and access URLs for this fileset/dataset. platform-specific
+ chose_strategy(item: FilesetPlatformItem) -> IngestStrategy
+ select an archive strategy for the given fileset/dataset
+
+ FilesetIngestStrategy
+ check_existing(item: FilesetPlatformItem) -> Optional[ArchiveStrategyResult]
+ check the given backend for an existing capture/archive; if found, return result
+ process(item: FilesetPlatformItem) -> ArchiveStrategyResult
+ perform an actual archival capture
+
+## Limits and Failure Modes
+
+- `too-large-size`: total size of the fileset is too large for archiving.
+ initial limit is 64 GBytes, controlled by `max_total_size` parameter.
+- `too-many-files`: number of files (and thus file-level metadata) is too
+ large. initial limit is 200, controlled by `max_file_count` parameter.
+- `platform-scope / FilesetPlatformScopeError`: for when `base_url` leads to a
+ valid platform, which could be found via API or parsing, but has the wrong
+ scope. Eg, tried to fetch a dataset, but got a DOI which represents all
+ versions of the dataset, not a specific version.
+- `platform-restricted`/`PlatformRestrictedError`: for, eg, embargoes
+- `platform-404`: got to a landing page, and seemed like in-scope, but no
+ platform record found anyways
+
+
+## New Sandcrawler Code and Worker
+
+ sandcrawler-ingest-fileset-worker@{1..6} (or up to 1..12 later)
+
+Worker consumes from ingest request topic, produces to fileset ingest results,
+and optionally produces to file ingest results.
+
+ sandcrawler-persist-ingest-fileset-worker@1
+
+Simply writes fileset ingest rows to SQL.
+
+
+## New Fatcat Worker and Code Changes
+
+ fatcat-import-ingest-fileset-worker
+
+This importer is modeled on file and web worker. Filters for `success` with
+strategy of `*-fileset*`.
+
+Existing `fatcat-import-ingest-file-worker` should be updated to allow
+`dataset` single-file imports, with largely same behavior and semantics as
+current importer (`component` mode).
+
+Existing fatcat transforms, and possibly even elasticsearch schemas, should be
+updated to include fileset status and `in_ia` flag for dataset type releases.
+
+Existing entity updates worker submits `dataset` type ingests to ingest request
+topic.
+
+
+## Ingest Result Schema
+
+Common with file results, and mostly relating to landing page HTML:
+
+ hit: bool
+ status: str
+ success
+ success-existing
+ success-file (for `web-file` or `archiveorg-file` only)
+ request: object
+ terminal: object
+ file_meta: object
+ cdx: object
+ revisit_cdx: object
+ html_biblio: object
+
+Additional fileset-specific fields:
+
+ manifest: list of objects
+ platform_name: str
+ platform_domain: str
+ platform_id: str
+ platform_base_url: str
+ ingest_strategy: str
+ archiveorg_item_name: str (optional, only for `archiveorg-*` strategies)
+ file_count: int
+ total_size: int
+ fileset_bundle (optional, only for `*-fileset-bundle` strategy)
+ file_meta
+ cdx
+ revisit_cdx
+ terminal
+ archiveorg_bundle_path
+ fileset_file (optional, only for `*-file` strategy)
+ file_meta
+ terminal
+ cdx
+ revisit_cdx
+
+If the strategy was `web-file` or `archiveorg-file` and the status is
+`success-file`, then an ingest file result will also be published to
+`sandcrawler-ENV.ingest-file-results`, using the same ingest type and fields as
+regular ingest.
+
+
+All fileset ingest results get published to ingest-fileset-result.
+
+Existing sandcrawler persist workers also subscribe to this topic and persist
+status and landing page terminal info to tables just like with file ingest.
+GROBID, HTML, and other metadata is not persisted in this path.
+
+If the ingest strategy was a single file (`*-file`), then an ingest file is
+also published to the ingest-file-result topic, with the `fileset_file`
+metadata, and ingest type `dataset-file`. This should only happen on success
+condition.
+
+
+## New SQL Tables
+
+Note that this table *complements* `ingest_file_result`, doesn't replace it.
+`ingest_file_result` could more accurately be called `ingest_result`.
+
+ CREATE TABLE IF NOT EXISTS ingest_fileset_platform (
+ ingest_type TEXT NOT NULL CHECK (octet_length(ingest_type) >= 1),
+ base_url TEXT NOT NULL CHECK (octet_length(base_url) >= 1),
+ updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
+ hit BOOLEAN NOT NULL,
+ status TEXT CHECK (octet_length(status) >= 1),
+
+ platform_name TEXT NOT NULL CHECK (octet_length(platform_name) >= 1),
+ platform_domain TEXT NOT NULL CHECK (octet_length(platform_domain) >= 1),
+ platform_id TEXT NOT NULL CHECK (octet_length(platform_id) >= 1),
+ ingest_strategy TEXT CHECK (octet_length(ingest_strategy) >= 1),
+ total_size BIGINT,
+ file_count BIGINT,
+ archiveorg_item_name TEXT CHECK (octet_length(archiveorg_item_name) >= 1),
+
+ archiveorg_item_bundle_path TEXT CHECK (octet_length(archiveorg_item_bundle_path) >= 1),
+ web_bundle_url TEXT CHECK (octet_length(web_bundle_url) >= 1),
+ web_bundle_dt TEXT CHECK (octet_length(web_bundle_dt) = 14),
+
+ manifest JSONB,
+ -- list, similar to fatcat fileset manifest, plus extra:
+ -- status (str)
+ -- path (str)
+ -- size (int)
+ -- md5 (str)
+ -- sha1 (str)
+ -- sha256 (str)
+ -- mimetype (str)
+ -- extra (dict)
+ -- platform_url (str)
+ -- terminal_url (str)
+ -- terminal_dt (str)
+
+ PRIMARY KEY (ingest_type, base_url)
+ );
+ CREATE INDEX ingest_fileset_platform_name_domain_id_idx ON ingest_fileset_platform(platform_name, platform_domain, platform_id);
+
+Persist worker should only insert in to this table if `platform_name` is
+identified.
+
+## New Kafka Topic
+
+ sandcrawler-ENV.ingest-fileset-results 6x, no retention limit
+
+
+## Implementation Plan
+
+First implement ingest worker, including platform and strategy helpers, and
+test those as simple stdin/stdout CLI tools in sandcrawler repo to validate
+this proposal.
+
+Second implement fatcat importer and test locally and/or in QA.
+
+Lastly implement infrastructure, automation, and other "glue":
+
+- SQL schema
+- persist worker
+
+
+## Design Note: Single-File Datasets
+
+Should datasets and other groups of files which only contain a single file get
+imported as a fatcat `file` or `fileset`? This can be broken down further as
+documents (single PDF) vs other individual files.
+
+Advantages of `file`:
+
+- handles case of article PDFs being marked as dataset accidentally
+- `file` entities get de-duplicated with simple lookup (eg, on `sha1`)
+- conceptually simpler if individual files are `file` entity
+- easier to download individual files
+
+Advantages of `fileset`:
+
+- conceptually simpler if all `dataset` entities have `fileset` form factor
+- code path is simpler: one fewer strategy, and less complexity of sending
+ files down separate import path
+- metadata about platform is retained
+- would require no modification of existing fatcat file importer
+- fatcat import of archive.org of `file` is not actually implemented yet?
+
+Decision is to do individual files. Fatcat fileset import worker should reject
+single-file (and empty) manifest filesets. Fatcat file import worker should
+accept all mimetypes for `dataset-file` (similar to `component`).
+
+
+## Example Entities
+
+See `notes/dataset_examples.txt`
diff --git a/proposals/2021-09-13_src_ingest.md b/proposals/2021-09-13_src_ingest.md
new file mode 100644
index 0000000..470827a
--- /dev/null
+++ b/proposals/2021-09-13_src_ingest.md
@@ -0,0 +1,53 @@
+
+File Ingest Mode: 'src'
+=======================
+
+Ingest type for "source" of works in document form. For example, tarballs of
+LaTeX source and figures, as published on arxiv.org and Pubmed Central.
+
+For now, presumption is that this would be a single file (`file` entity in
+fatcat).
+
+Initial mimetypes to allow:
+
+- text/x-tex
+- application/xml
+- application/gzip
+- application/x-bzip
+- application/x-bzip2
+- application/zip
+- application/x-tar
+- application/msword
+- application/vnd.openxmlformats-officedocument.wordprocessingml.document
+
+
+## Fatcat Changes
+
+In the file importer, allow the additional mimetypes for 'src' ingest.
+
+Might keep ingest disabled on the fatcat side, at least initially. Eg, until
+there is some scope of "file scope", or other ways of treating 'src' tarballs
+separate from PDFs or other fulltext formats.
+
+
+## Ingest Changes
+
+Allow additional terminal mimetypes for 'src' crawls.
+
+
+## Examples
+
+ arxiv:2109.00954v1
+ fatcat:release_akzp2lgqjbcbhpoeoitsj5k5hy
+ https://arxiv.org/format/2109.00954v1
+ https://arxiv.org/e-print/2109.00954v1
+
+ arxiv:1912.03397v2
+ https://arxiv.org/format/1912.03397v2
+ https://arxiv.org/e-print/1912.03397v2
+ NOT: https://arxiv.org/pdf/1912.03397v2
+
+ pmcid:PMC3767916
+ https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/08/03/PMC3767916.tar.gz
+
+For PMC, will need to use one of the .csv file lists to get the digit prefixes.
diff --git a/proposals/2021-09-21_spn_accounts.md b/proposals/2021-09-21_spn_accounts.md
new file mode 100644
index 0000000..e41c162
--- /dev/null
+++ b/proposals/2021-09-21_spn_accounts.md
@@ -0,0 +1,14 @@
+
+Formalization of SPNv2 API requests from fatcat/sandcrawler
+
+Create two new system accounts, one for regular/daily ingest requests, one for
+priority requests (save-paper-now or as a flag with things like fatcat-ingest;
+"interactive"). These accounts should have @archive.org emails. Request the
+daily one to have the current rate limit as bnewbold@archive.org account; the
+priority queue can have less.
+
+Create new ingest kafka queues from scratch, one for priority and one for
+regular. Chose sizes carefully, probably keep 24x for the regular and do 6x or
+so (small) for priority queue.
+
+Deploy new priority workers; reconfigure/deploy broadly.
diff --git a/proposals/2021-10-28_grobid_refs.md b/proposals/2021-10-28_grobid_refs.md
new file mode 100644
index 0000000..1fc79b6
--- /dev/null
+++ b/proposals/2021-10-28_grobid_refs.md
@@ -0,0 +1,125 @@
+
+GROBID References in Sandcrawler DB
+===================================
+
+Want to start processing "unstructured" raw references coming from upstream
+metadata sources (distinct from upstream fulltext sources, like PDFs or JATS
+XML), and save the results in sandcrawler DB. From there, they will get pulled
+in to fatcat-scholar "intermediate bundles" and included in reference exports.
+
+The initial use case for this is to parse "unstructured" references deposited
+in Crossref, and include them in refcat.
+
+
+## Schema and Semantics
+
+The output JSON/dict schema for parsed references follows that of
+`grobid_tei_xml` version 0.1.x, for the `GrobidBiblio` field. The
+`unstructured` field that was parsed is included in the output, though it may
+not be byte-for-byte exact (see below). One notable change from the past (eg,
+older GROBID-parsed references) is that author `name` is now `full_name`. New
+fields include `editors` (same schema as `authors`), `book_title`, and
+`series_title`.
+
+The overall output schema matches that of the `grobid_refs` SQL table:
+
+ source: string, lower-case. eg 'crossref'
+ source_id: string, eg '10.1145/3366650.3366668'
+ source_ts: optional timestamp (full ISO datetime with timezone (eg, `Z`
+ suffix), which identifies version of upstream metadata
+ refs_json: JSON, list of `GrobidBiblio` JSON objects
+
+References are re-processed on a per-article (or per-release) basis. All the
+references for an article are handled as a batch and output as a batch. If
+there are no upstream references, row with `ref_json` as empty list may be
+returned.
+
+Not all upstream references get re-parsed, even if an 'unstructured' field is
+available. If 'unstructured' is not available, no row is ever output. For
+example, if a reference includes `unstructured` (raw citation string), but also
+has structured metadata for authors, title, year, and journal name, we might
+not re-parse the `unstructured` string. Whether to re-parse is evaulated on a
+per-reference basis. This behavior may change over time.
+
+`unstructured` strings may be pre-processed before being submitted to GROBID.
+This is because many sources have systemic encoding issues. GROBID itself may
+also do some modification of the input citation string before returning it in
+the output. This means the `unstructured` string is not a reliable way to map
+between specific upstream references and parsed references. Instead, the `id`
+field (str) of `GrobidBiblio` gets set to any upstream "key" or "index"
+identifier used to track individual references. If there is only a numeric
+index, the `id` is that number as a string.
+
+The `key` or `id` may need to be woven back in to the ref objects manually,
+because GROBID `processCitationList` takes just a list of raw strings, with no
+attached reference-level key or id.
+
+
+## New SQL Table and View
+
+We may want to do re-parsing of references from sources other than `crossref`,
+so there is a generic `grobid_refs` table. But it is also common to fetch both
+the crossref metadata and any re-parsed references together, so as a convenience
+there is a PostgreSQL view (virtual table) that includes both a crossref
+metadata record and parsed citations, if available. If downstream code cares a
+lot about having the refs and record be in sync, the `source_ts` field on
+`grobid_refs` can be matched against the `indexed` column of `crossref` (or the
+`.indexed.date-time` JSON field in the record itself).
+
+Remember that DOIs should always be lower-cased before querying, inserting,
+comparing, etc.
+
+ CREATE TABLE IF NOT EXISTS grobid_refs (
+ source TEXT NOT NULL CHECK (octet_length(source) >= 1),
+ source_id TEXT NOT NULL CHECK (octet_length(source_id) >= 1),
+ source_ts TIMESTAMP WITH TIME ZONE,
+ updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
+ refs_json JSON NOT NULL,
+ PRIMARY KEY(source, source_id)
+ );
+
+ CREATE OR REPLACE VIEW crossref_with_refs (doi, indexed, record, source_ts, refs_json) AS
+ SELECT
+ crossref.doi as doi,
+ crossref.indexed as indexed,
+ crossref.record as record,
+ grobid_refs.source_ts as source_ts,
+ grobid_refs.refs_json as refs_json
+ FROM crossref
+ LEFT JOIN grobid_refs ON
+ grobid_refs.source_id = crossref.doi
+ AND grobid_refs.source = 'crossref';
+
+Both `grobid_refs` and `crossref_with_refs` will be exposed through postgrest.
+
+
+## New Workers / Tools
+
+For simplicity, to start, a single worker with consume from
+`fatcat-prod.api-crossref`, process citations with GROBID (if necessary), and
+insert to both `crossref` and `grobid_refs` tables. This worker will run
+locally on the machine with sandcrawler-db.
+
+Another tool will support taking large chunks of Crossref JSON (as lines),
+filter them, process with GROBID, and print JSON to stdout, in the
+`grobid_refs` JSON schema.
+
+
+## Task Examples
+
+Command to process crossref records with refs tool:
+
+ cat crossref_sample.json \
+ | parallel -j5 --linebuffer --round-robin --pipe ./grobid_tool.py parse-crossref-refs - \
+ | pv -l \
+ > crossref_sample.parsed.json
+
+ # => 10.0k 0:00:27 [ 368 /s]
+
+Load directly in to postgres (after tables have been created):
+
+ cat crossref_sample.parsed.json \
+ | jq -rc '[.source, .source_id, .source_ts, (.refs_json | tostring)] | @tsv' \
+ | psql sandcrawler -c "COPY grobid_refs (source, source_id, source_ts, refs_json) FROM STDIN (DELIMITER E'\t');"
+
+ # => COPY 9999
diff --git a/proposals/2021-12-09_trawling.md b/proposals/2021-12-09_trawling.md
new file mode 100644
index 0000000..33b6b4c
--- /dev/null
+++ b/proposals/2021-12-09_trawling.md
@@ -0,0 +1,180 @@
+
+status: work-in-progress
+
+NOTE: as of December 2022, the implementation on these features haven't been
+merged to the main branch. Development stalled in December 2021.
+
+Trawling for Unstructured Scholarly Web Content
+===============================================
+
+## Background and Motivation
+
+A long-term goal for sandcrawler has been the ability to pick through
+unstructured web archive content (or even non-web collection), identify
+potential in-scope research outputs, extract metadata for those outputs, and
+merge the content in to a catalog (fatcat).
+
+This process requires integration of many existing tools (HTML and PDF
+extraction; fuzzy bibliographic metadata matching; machine learning to identify
+in-scope content; etc), as well as high-level curration, targetting, and
+evaluation by human operators. The goal is to augment and improve the
+productivity of human operators as much as possible.
+
+This process will be similar to "ingest", which is where we start with a
+specific URL and have some additional context about the expected result (eg,
+content type, exernal identifier). Some differences with trawling are that we
+are start with a collection or context (instead of single URL); have little or
+no context about the content we are looking for; and may even be creating a new
+catalog entry, as opposed to matching to a known existing entry.
+
+
+## Architecture
+
+The core operation is to take a resource and run a flowchart of processing
+steps on it, resulting in an overall status and possible related metadata. The
+common case is that the resource is a PDF or HTML coming from wayback (with
+contextual metadata about the capture), but we should be flexible to supporting
+more content types in the future, and should try to support plain files with no
+context as well.
+
+Some relatively simple wrapper code handles fetching resources and summarizing
+status/counts.
+
+Outside of the scope of sandcrawler, new fatcat code (importer or similar) will
+be needed to handle trawl results. It will probably make sense to pre-filter
+(with `jq` or `rg`) before passing results to fatcat.
+
+At this stage, trawl workers will probably be run manually. Some successful
+outputs (like GROBID, HTML metadata) would be written to existing kafka topics
+to be persisted, but there would not be any specific `trawl` SQL tables or
+automation.
+
+It will probably be helpful to have some kind of wrapper script that can run
+sandcrawler trawl processes, then filter and pipe the output into fatcat
+importer, all from a single invocation, while reporting results.
+
+TODO:
+- for HTML imports, do we fetch the full webcapture stuff and return that?
+
+
+## Methods of Operation
+
+### `cdx_file`
+
+An existing CDX file is provided on-disk locally.
+
+### `cdx_api`
+
+Simplified variants: `cdx_domain`, `cdx_surt`
+
+Uses CDX API to download records matching the configured filters, then processes the file.
+
+Saves the CDX file intermediate result somewhere locally (working or tmp
+directory), with timestamp in the path, to make re-trying with `cdx_file` fast
+and easy.
+
+
+### `archiveorg_web_collection`
+
+Uses `cdx_collection.py` (or similar) to fetch a full CDX list, by iterating over
+then process it.
+
+Saves the CDX file intermediate result somewhere locally (working or tmp
+directory), with timestamp in the path, to make re-trying with `cdx_file` fast
+and easy.
+
+### Others
+
+- `archiveorg_file_collection`: fetch file list via archive.org metadata, then processes each
+
+## Schema
+
+Per-resource results:
+
+ hit (bool)
+ indicates whether resource seems in scope and was processed successfully
+ (roughly, status 'success', and
+ status (str)
+ success: fetched resource, ran processing, pa
+ skip-cdx: filtered before even fetching resource
+ skip-resource: filtered after fetching resource
+ wayback-error (etc): problem fetching
+ content_scope (str)
+ filtered-{filtertype}
+ article (etc)
+ landing-page
+ resource_type (str)
+ pdf, html
+ file_meta{}
+ cdx{}
+ revisit_cdx{}
+
+ # below are resource_type specific
+ grobid
+ pdf_meta
+ pdf_trio
+ html_biblio
+ (other heuristics and ML)
+
+High-level request:
+
+ trawl_method: str
+ cdx_file_path
+ default_filters: bool
+ resource_filters[]
+ scope: str
+ surt_prefix, domain, host, mimetype, size, datetime, resource_type, http_status
+ value: any
+ values[]: any
+ min: any
+ max: any
+ biblio_context{}: set of expected/default values
+ container_id
+ release_type
+ release_stage
+ url_rel
+
+High-level summary / results:
+
+ status
+ request{}: the entire request object
+ counts
+ total_resources
+ status{}
+ content_scope{}
+ resource_type{}
+
+## Example Corpuses
+
+All PDFs (`application/pdf`) in web.archive.org from before the year 2000.
+Starting point would be a CDX list.
+
+Spidering crawls starting from a set of OA journal homepage URLs.
+
+Archive-It partner collections from research universities, particularly of
+their own .edu domains. Starting point would be an archive.org collection, from
+which WARC files or CDX lists can be accessed.
+
+General archive.org PDF collections, such as
+[ERIC](https://archive.org/details/ericarchive) or
+[Document Cloud](https://archive.org/details/documentcloud).
+
+Specific Journal or Publisher URL patterns. Starting point could be a domain,
+hostname, SURT prefix, and/or URL regex.
+
+Heuristic patterns over full web.archive.org CDX index. For example, .edu
+domains with user directories and a `.pdf` in the file path ("tilde" username
+pattern).
+
+Random samples of entire Wayback corpus. For example, random samples filtered
+by date, content type, TLD, etc. This would be true "trawling" over the entire
+corpus.
+
+
+## Other Ideas
+
+Could have a web archive spidering mode: starting from a seed, fetch multiple
+captures (different captures), then extract outlinks from those, up to some
+number of hops. An example application would be links to research group
+webpages or author homepages, and to try to extract PDF links from CVs, etc.
+
diff --git a/proposals/brainstorm/2021-debug_web_interface.md b/proposals/brainstorm/2021-debug_web_interface.md
new file mode 100644
index 0000000..442b439
--- /dev/null
+++ b/proposals/brainstorm/2021-debug_web_interface.md
@@ -0,0 +1,9 @@
+
+status: brainstorm idea
+
+Simple internal-only web interface to help debug ingest issues.
+
+- paste a hash, URL, or identifier and get a display of "everything we know" about it
+- enter a URL/SURT prefix and get aggregate stats (?)
+- enter a domain/host/prefix and get recent attempts/results
+- pre-computed periodic reports on ingest pipeline (?)
diff --git a/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md
new file mode 100644
index 0000000..b3ad447
--- /dev/null
+++ b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md
@@ -0,0 +1,36 @@
+
+status: brainstorming
+
+We continue to see issues with heritrix3-based crawling. Would like to have an
+option to switch to higher-throughput heritrix-based crawling.
+
+SPNv2 path would stick around at least for save-paper-now style ingest.
+
+
+## Sketch
+
+Ingest requests are created continuously by fatcat, with daily spikes.
+
+Ingest workers run mostly in "bulk" mode, aka they don't make SPNv2 calls.
+`no-capture` responses are recorded in sandcrawler SQL database.
+
+Periodically (daily?), a script queries for new no-capture results, filtered to
+the most recent period. These are processed in a bit in to a URL list, then
+converted to a heritrix frontier, and sent to crawlers. This could either be an
+h3 instance (?), or simple `scp` to a running crawl directory.
+
+The crawler crawls, with usual landing page config, and draintasker runs.
+
+TODO: can we have draintasker/heritrix set a maximum WARC life? Like 6 hours?
+or, target a smaller draintasker item size, so they get updated more frequently
+
+Another SQL script dumps ingest requests from the *previous* period, and
+re-submits them for bulk-style ingest (by workers).
+
+The end result would be things getting crawled and updated within a couple
+days.
+
+
+## Sketch 2
+
+Upload URL list to petabox item, wait for heritrix derive to run (!)