aboutsummaryrefslogtreecommitdiffstats
path: root/proposals
diff options
context:
space:
mode:
Diffstat (limited to 'proposals')
-rw-r--r--proposals/2018_original_sandcrawler_rfc.md180
-rw-r--r--proposals/2019_ingest.md6
-rw-r--r--proposals/20200129_pdf_ingest.md10
-rw-r--r--proposals/20200207_pdftrio.md5
-rw-r--r--proposals/20200211_nsq.md79
-rw-r--r--proposals/20201012_no_capture.md39
-rw-r--r--proposals/20201026_html_ingest.md129
-rw-r--r--proposals/20201103_xml_ingest.md64
-rw-r--r--proposals/2020_pdf_meta_thumbnails.md328
-rw-r--r--proposals/2020_seaweed_s3.md426
-rw-r--r--proposals/2021-04-22_crossref_db.md86
-rw-r--r--proposals/2021-09-09_component_ingest.md114
-rw-r--r--proposals/2021-09-09_fileset_ingest.md343
-rw-r--r--proposals/2021-09-13_src_ingest.md53
-rw-r--r--proposals/2021-09-21_spn_accounts.md14
-rw-r--r--proposals/2021-10-28_grobid_refs.md125
-rw-r--r--proposals/2021-12-09_trawling.md180
-rw-r--r--proposals/brainstorm/2021-debug_web_interface.md9
-rw-r--r--proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md36
19 files changed, 2217 insertions, 9 deletions
diff --git a/proposals/2018_original_sandcrawler_rfc.md b/proposals/2018_original_sandcrawler_rfc.md
new file mode 100644
index 0000000..ecf7ab8
--- /dev/null
+++ b/proposals/2018_original_sandcrawler_rfc.md
@@ -0,0 +1,180 @@
+
+**Title:** Journal Archiving Pipeline
+
+**Author:** Bryan Newbold <bnewbold@archive.org>
+
+**Date:** March 2018
+
+**Status:** work-in-progress
+
+This is an RFC-style technical proposal for a journal crawling, archiving,
+extracting, resolving, and cataloging pipeline.
+
+Design work funded by a Mellon Foundation grant in 2018.
+
+## Overview
+
+Let's start with data stores first:
+
+- crawled original fulltext (PDF, JATS, HTML) ends up in petabox/global-wayback
+- file-level extracted fulltext and metadata is stored in HBase, with the hash
+ of the original file as the key
+- cleaned metadata is stored in a "catalog" relational (SQL) database (probably
+ PostgreSQL or some hip scalable NewSQL thing compatible with Postgres or
+ MariaDB)
+
+**Resources:** back-of-the-envelope, around 100 TB petabox storage total (for
+100 million PDF files); 10-20 TB HBase table total. Can start small.
+
+
+All "system" (aka, pipeline) state (eg, "what work has been done") is ephemeral
+and is rederived relatively easily (but might be cached for performance).
+
+The overall "top-down", metadata-driven cycle is:
+
+1. Partners and public sources provide metadata (for catalog) and seed lists
+ (for crawlers)
+2. Crawlers pull in fulltext and HTTP/HTML metadata from the public web
+3. Extractors parse raw fulltext files (PDFs) and store structured metadata (in
+ HBase)
+4. Data Mungers match extracted metadata (from HBase) against the catalog, or
+ create new records if none found.
+
+In the "bottom up" cycle, batch jobs run as map/reduce jobs against the
+catalog, HBase, global wayback, and partner metadata datasets to identify
+potential new public or already-archived content to process, and pushes tasks
+to the crawlers, extractors, and mungers.
+
+## Partner Metadata
+
+Periodic Luigi scripts run on a regular VM to pull in metadata from partners.
+All metadata is saved to either petabox (for public stuff) or HDFS (for
+restricted). Scripts process/munge the data and push directly to the catalog
+(for trusted/authoritative sources like Crossref, ISSN, PubMed, DOAJ); others
+extract seedlists and push to the crawlers (
+
+**Resources:** 1 VM (could be a devbox), with a large attached disk (spinning
+probably ok)
+
+## Crawling
+
+All fulltext content comes in from the public web via crawling, and all crawled
+content ends up in global wayback.
+
+One or more VMs serve as perpetual crawlers, with multiple active ("perpetual")
+Heritrix crawls operating with differing configuration. These could be
+orchestrated (like h3), or just have the crawl jobs cut off and restarted every
+year or so.
+
+In a starter configuration, there would be two crawl queues. One would target
+direct PDF links, landing pages, author homepages, DOI redirects, etc. It would
+process HTML and look for PDF outlinks, but wouldn't crawl recursively.
+
+HBase is used for de-dupe, with records (pointers) stored in WARCs.
+
+A second config would take seeds as entire journal websites, and would crawl
+continuously.
+
+Other components of the system "push" tasks to the crawlers by copying schedule
+files into the crawl action directories.
+
+WARCs would be uploaded into petabox via draintasker as usual, and CDX
+derivation would be left to the derive process. Other processes are notified of
+"new crawl content" being available when they see new unprocessed CDX files in
+items from specific collections. draintasker could be configured to "cut" new
+items every 24 hours at most to ensure this pipeline moves along regularly, or
+we could come up with other hacks to get lower "latency" at this stage.
+
+**Resources:** 1-2 crawler VMs, each with a large attached disk (spinning)
+
+### De-Dupe Efficiency
+
+We would certainly feed CDX info from all bulk journal crawling into HBase
+before any additional large crawling, to get that level of de-dupe.
+
+As to whether all GWB PDFs should be de-dupe against is a policy question: is
+there something special about the journal-specific crawls that makes it worth
+having second copies? Eg, if we had previously domain crawled and access is
+restricted, we then wouldn't be allowed to provide researcher access to those
+files... on the other hand, we could extract for researchers given that we
+"refound" the content at a new URL?
+
+Only fulltext files (PDFs) would be de-duped against (by content), so we'd be
+recrawling lots of HTML. Presumably this is a fraction of crawl data size; what
+fraction?
+
+Watermarked files would be refreshed repeatedly from the same PDF, and even
+extracted/processed repeatedly (because the hash would be different). This is
+hard to de-dupe/skip, because we would want to catch "content drift" (changes
+in files).
+
+## Extractors
+
+Off-the-shelf PDF extraction software runs on high-CPU VM nodes (probably
+GROBID running on 1-2 data nodes, which have 30+ CPU cores and plenty of RAM
+and network throughput).
+
+A hadoop streaming job (written in python) takes a CDX file as task input. It
+filters for only PDFs, and then checks each line against HBase to see if it has
+already been extracted. If it hasn't, the script downloads directly from
+petabox using the full CDX info (bypassing wayback, which would be a
+bottleneck). It optionally runs any "quick check" scripts to see if the PDF
+should be skipped ("definitely not a scholarly work"), then if it looks Ok
+submits the file over HTTP to the GROBID worker pool for extraction. The
+results are pushed to HBase, and a short status line written to Hadoop. The
+overall Hadoop job has a reduce phase that generates a human-meaningful report
+of job status (eg, number of corrupt files) for monitoring.
+
+A side job as part of extracting can "score" the extracted metadata to flag
+problems with GROBID, to be used as potential training data for improvement.
+
+**Resources:** 1-2 datanode VMs; hadoop cluster time. Needed up-front for
+backlog processing; less CPU needed over time.
+
+## Matchers
+
+The matcher runs as a "scan" HBase map/reduce job over new (unprocessed) HBasej
+rows. It pulls just the basic metadata (title, author, identifiers, abstract)
+and calls the catalog API to identify potential match candidates. If no match
+is found, and the metadata "look good" based on some filters (to remove, eg,
+spam), works are inserted into the catalog (eg, for those works that don't have
+globally available identifiers or other metadata; "long tail" and legacy
+content).
+
+**Resources:** Hadoop cluster time
+
+## Catalog
+
+The catalog is a versioned relational database. All scripts interact with an
+API server (instead of connecting directly to the database). It should be
+reliable and low-latency for simple reads, so it can be relied on to provide a
+public-facing API and have public web interfaces built on top. This is in
+contrast to Hadoop, which for the most part could go down with no public-facing
+impact (other than fulltext API queries). The catalog does not contain
+copywritable material, but it does contain strong (verified) links to fulltext
+content. Policy gets implemented here if necessary.
+
+A global "changelog" (append-only log) is used in the catalog to record every
+change, allowing for easier replication (internal or external, to partners). As
+little as possible is implemented in the catalog itself; instead helper and
+cleanup bots use the API to propose and verify edits, similar to the wikidata
+and git data models.
+
+Public APIs and any front-end services are built on the catalog. Elasticsearch
+(for metadata or fulltext search) could build on top of the catalog.
+
+**Resources:** Unknown, but estimate 1+ TB of SSD storage each on 2 or more
+database machines
+
+## Machine Learning and "Bottom Up"
+
+TBD.
+
+## Logistics
+
+Ansible is used to deploy all components. Luigi is used as a task scheduler for
+batch jobs, with cron to initiate periodic tasks. Errors and actionable
+problems are aggregated in Sentry.
+
+Logging, metrics, and other debugging and monitoring are TBD.
+
diff --git a/proposals/2019_ingest.md b/proposals/2019_ingest.md
index c649809..768784f 100644
--- a/proposals/2019_ingest.md
+++ b/proposals/2019_ingest.md
@@ -1,5 +1,5 @@
-status: work-in-progress
+status: deployed
This document proposes structure and systems for ingesting (crawling) paper
PDFs and other content as part of sandcrawler.
@@ -84,7 +84,7 @@ HTML? Or both? Let's just recrawl.
*IngestRequest*
- `ingest_type`: required, one of `pdf`, `xml`, `html`, `dataset`. For
backwards compatibility, `file` should be interpreted as `pdf`. `pdf` and
- `xml` return file ingest respose; `html` and `dataset` not implemented but
+ `xml` return file ingest response; `html` and `dataset` not implemented but
would be webcapture (wayback) and fileset (archive.org item or wayback?).
In the future: `epub`, `video`, `git`, etc.
- `base_url`: required, where to start crawl process
@@ -258,7 +258,7 @@ and hacks to crawl publicly available papers. Related existing work includes
[unpaywall's crawler][unpaywall_crawl], LOCKSS extraction code, dissem.in's
efforts, zotero's bibliography extractor, etc. The "memento tracer" work is
also similar. Many of these are even in python! It would be great to reduce
-duplicated work and maintenance. An analagous system in the wild is youtube-dl
+duplicated work and maintenance. An analogous system in the wild is youtube-dl
for downloading video from many sources.
[unpaywall_crawl]: https://github.com/ourresearch/oadoi/blob/master/webpage.py
diff --git a/proposals/20200129_pdf_ingest.md b/proposals/20200129_pdf_ingest.md
index 9469217..157607e 100644
--- a/proposals/20200129_pdf_ingest.md
+++ b/proposals/20200129_pdf_ingest.md
@@ -1,5 +1,5 @@
-status: planned
+status: deployed
2020q1 Fulltext PDF Ingest Plan
===================================
@@ -27,7 +27,7 @@ There are a few million papers in fatacat which:
2. are known OA, usually because publication is Gold OA
3. don't have any fulltext PDF in fatcat
-As a detail, some of these "known OA" journals actually have embargos (aka,
+As a detail, some of these "known OA" journals actually have embargoes (aka,
they aren't true Gold OA). In particular, those marked via EZB OA "color", and
recent pubmed central ids.
@@ -104,7 +104,7 @@ Actions:
update ingest result table with status.
- fetch new MAG and unpaywall seedlists, transform to ingest requests, persist
into ingest request table. use SQL to dump only the *new* URLs (not seen in
- previous dumps) using the created timestamp, outputing new bulk ingest
+ previous dumps) using the created timestamp, outputting new bulk ingest
request lists. if possible, de-dupe between these two. then start bulk
heritrix crawls over these two long lists. Probably sharded over several
machines. Could also run serially (first one, then the other, with
@@ -133,7 +133,7 @@ We have run GROBID+glutton over basically all of these PDFs. We should be able
to do a SQL query to select PDFs that:
- have at least one known CDX row
-- GROBID processed successfuly and glutton matched to a fatcat release
+- GROBID processed successfully and glutton matched to a fatcat release
- do not have an existing fatcat file (based on sha1hex)
- output GROBID metadata, `file_meta`, and one or more CDX rows
@@ -161,7 +161,7 @@ Coding Tasks:
Actions:
- update `fatcat_file` sandcrawler table
-- check how many PDFs this might ammount to. both by uniq SHA1 and uniq
+- check how many PDFs this might amount to. both by uniq SHA1 and uniq
`fatcat_release` matches
- do some manual random QA verification to check that this method results in
quality content in fatcat
diff --git a/proposals/20200207_pdftrio.md b/proposals/20200207_pdftrio.md
index 31a2db6..6f6443f 100644
--- a/proposals/20200207_pdftrio.md
+++ b/proposals/20200207_pdftrio.md
@@ -1,5 +1,8 @@
-status: in progress
+status: deployed
+
+NOTE: while this has been used in production, as of December 2022 the results
+are not used much in practice, and we don't score every PDF that comes along
PDF Trio (ML Classification)
==============================
diff --git a/proposals/20200211_nsq.md b/proposals/20200211_nsq.md
new file mode 100644
index 0000000..6aa885b
--- /dev/null
+++ b/proposals/20200211_nsq.md
@@ -0,0 +1,79 @@
+
+status: planned
+
+In short, Kafka is not working well as a job task scheduler, and I want to try
+NSQ as a medium-term solution to that problem.
+
+
+## Motivation
+
+Thinking of setting up NSQ to use for scheduling distributed work, to replace
+kafka for some topics. for example, "regrobid" requests where we enqueue
+millions of, basically, CDX lines, and want to process on dozens of cores or
+multiple machines. or file ingest backfill. results would still go to kafka (to
+persist), and pipelines like DOI harvest -> import -> elasticsearch would still
+be kafka
+
+The pain point with kafka is having dozens of workers on tasks that take more
+than a couple seconds per task. we could keep tweaking kafka and writing weird
+consumer group things to handle this, but I think it will never work very well.
+NSQ supports re-queues with delay (eg, on failure, defer to re-process later),
+allows many workers to connect and leave with no disruption, messages don't
+have to be processed in order, and has a very simple enqueue API (HTTP POST).
+
+The slowish tasks we have now are file ingest (wayback and/or SPNv2 +
+GROBID) and re-GROBID. In the near future will also have ML backlog to go
+through.
+
+Throughput isn't much of a concern as tasks take 10+ seconds each.
+
+
+## Specific Plan
+
+Continue publishing ingest requests to Kafka topic. Have a new persist worker
+consume from this topic and push to request table (but not result table) using
+`ON CONFLICT DO NOTHING`. Have a new single-process kafka consumer pull from
+the topic and push to NSQ. This consumer monitors NSQ and doesn't push too many
+requests (eg, 1k maximum). NSQ could potentially even run as in-memory mode.
+New worker/pusher class that acts as an NSQ client, possibly with parallelism.
+
+*Clean* NSQ shutdown/restart always persists data locally to disk.
+
+Unclean shutdown (eg, power failure) would mean NSQ might have lost state.
+Because we are persisting requests to sandcrawler-db, cleanup is simple:
+re-enqueue all requests from the past N days with null result or result older
+than M days.
+
+Still need multiple kafka and NSQ topics to have priority queues (eg, bulk,
+platform-specific).
+
+To start, have a single static NSQ host; don't need nsqlookupd. Could use
+wbgrp-svc506 (datanode VM with SSD, lots of CPU and RAM).
+
+To move hosts, simply restart the kafka pusher pointing at the new NSQ host.
+When the old host's queue is empty, restart the workers to consume from the new
+host, and destroy the old NSQ host.
+
+
+## Alternatives
+
+Work arounds i've done to date have been using the `grobid_tool.py` or
+`ingest_tool.py` JSON input modes to pipe JSON task files (millions of lines)
+through GNU/parallel. I guess GNU/parallel's distributed mode is also an option
+here.
+
+Other things that could be used:
+
+**celery**: popular, many features. need to run separate redis, no disk persistence (?)
+
+**disque**: need to run redis, no disk persistence (?) <https://github.com/antirez/disque>
+
+**gearman**: <http://gearman.org/> no disk persistence (?)
+
+
+## Old Notes
+
+TBD if would want to switch ingest requests from fatcat -> sandcrawler over,
+and have the continuous ingests run out of NSQ, or keep using kafka for that.
+currently can only do up to 10x parallelism or so with SPNv2, so that isn't a
+scaling pain point
diff --git a/proposals/20201012_no_capture.md b/proposals/20201012_no_capture.md
new file mode 100644
index 0000000..7f6a1f5
--- /dev/null
+++ b/proposals/20201012_no_capture.md
@@ -0,0 +1,39 @@
+
+status: work-in-progress
+
+NOTE: as of December 2022, bnewbold can't remember if this was fully
+implemented or not.
+
+Storing no-capture missing URLs in `terminal_url`
+=================================================
+
+Currently, when the bulk-mode ingest code terminates with a `no-capture`
+status, the missing URL (which is not in GWB CDX) is not stored in
+sandcrawler-db. This proposed change is to include it in the existing
+`terminal_url` database column, with the `terminal_status_code` and
+`terminal_dt` columns empty.
+
+The implementation is rather simple:
+
+- CDX lookup code path should save the *actual* final missing URL (`next_url`
+ after redirects) in the result object's `terminal_url` field
+- ensure that this field gets passed through all the way to the database on the
+ `no-capture` code path
+
+This change does change the semantics of the `terminal_url` field somewhat, and
+could break existing assumptions, so it is being documented in this proposal
+document.
+
+
+## Alternatives
+
+The current status quo is to store the missing URL as the last element in the
+"hops" field of the JSON structure. We could keep this and have a convoluted
+pipeline that would read from the Kafka feed and extract them, but this would
+be messy. Eg, re-ingesting would not update the old kafka messages, so we could
+need some accounting of consumer group offsets after which missing URLs are
+truly missing.
+
+We could add a new `missing_url` database column and field to the JSON schema,
+for this specific use case. This seems like unnecessary extra work.
+
diff --git a/proposals/20201026_html_ingest.md b/proposals/20201026_html_ingest.md
new file mode 100644
index 0000000..785471b
--- /dev/null
+++ b/proposals/20201026_html_ingest.md
@@ -0,0 +1,129 @@
+
+status: deployed
+
+HTML Ingest Pipeline
+========================
+
+Basic goal: given an ingest request of type 'html', output an object (JSON)
+which could be imported into fatcat.
+
+Should work with things like (scholarly) blog posts, micropubs, registrations,
+protocols. Doesn't need to work with everything to start. "Platform" sites
+(like youtube, figshare, etc) will probably be a different ingest worker.
+
+A current unknown is what the expected size of this metadata is. Both in number
+of documents and amount of metadata per document.
+
+Example HTML articles to start testing:
+
+- complex distill article: <https://distill.pub/2020/bayesian-optimization/>
+- old HTML journal: <http://web.archive.org/web/20081120141926fw_/http://www.mundanebehavior.org/issues/v5n1/rosen.htm>
+- NIH pub: <https://www.nlm.nih.gov/pubs/techbull/ja02/ja02_locatorplus_merge.html>
+- first mondays (OJS): <https://firstmonday.org/ojs/index.php/fm/article/view/10274/9729>
+- d-lib: <http://www.dlib.org/dlib/july17/williams/07williams.html>
+
+
+## Ingest Process
+
+Follow base URL to terminal document, which is assumed to be a status=200 HTML document.
+
+Verify that terminal document is fulltext. Extract both metadata and fulltext.
+
+Extract list of sub-resources. Filter out unwanted (eg favicon, analytics,
+unnecessary), apply a sanity limit. Convert to fully qualified URLs. For each
+sub-resource, fetch down to the terminal resource, and compute hashes/metadata.
+
+Open questions:
+
+- will probably want to parallelize sub-resource fetching. async?
+- behavior when failure fetching sub-resources
+
+
+## Ingest Result Schema
+
+JSON should be basically compatible with existing `ingest_file_result` objects,
+with some new sub-objects.
+
+Overall object (`IngestWebResult`):
+
+- `status`: str
+- `hit`: bool
+- `error_message`: optional, if an error
+- `hops`: optional, array of URLs
+- `cdx`: optional; single CDX row of primary HTML document
+- `terminal`: optional; same as ingest result
+ - `terminal_url`
+ - `terminal_dt`
+ - `terminal_status_code`
+ - `terminal_sha1hex`
+- `request`: optional but usually present; ingest request object, verbatim
+- `file_meta`: optional; file metadata about primary HTML document
+- `html_biblio`: optional; extracted biblio metadata from primary HTML document
+- `scope`: optional; detected/guessed scope (fulltext, etc)
+- `html_resources`: optional; array of sub-resources. primary HTML is not included
+- `html_body`: optional; just the status code and some metadata is passed through;
+ actual document would go through a different KafkaTopic
+ - `status`: str
+ - `agent`: str, eg "trafilatura/0.4"
+ - `tei_xml`: optional, str
+ - `word_count`: optional, str
+
+
+## New SQL Tables
+
+`html_meta`
+ sha1hex (primary key)
+ updated (of SQL row)
+ status
+ scope
+ has_teixml
+ has_thumbnail
+ word_count (from teixml fulltext)
+ biblio (JSON)
+ resources (JSON)
+
+Also writes to `ingest_file_result`, `file_meta`, and `cdx`, all only for the base HTML document.
+
+Note: needed to enable postgrest access to this table (for scholar worker).
+
+
+## Fatcat API Wants
+
+Would be nice to have lookup by SURT+timestamp, and/or by sha1hex of terminal base file.
+
+`hide` option for cdx rows; also for fileset equivalent.
+
+
+## New Workers
+
+Could reuse existing worker, have code branch depending on type of ingest.
+
+ingest file worker
+ => same as existing worker, because could be calling SPN
+
+persist result
+ => same as existing worker; adds persisting various HTML metadata
+
+persist html text
+ => talks to seaweedfs
+
+
+## New Kafka Topics
+
+HTML ingest result topic (webcapture-ish)
+
+sandcrawler-ENV.html-teixml
+ JSON wrapping TEI-XML (same as other fulltext topics)
+ key compaction and content compression enabled
+
+JSON schema:
+
+- `key` and `sha1hex`: str; used as kafka key
+- `status`: str
+- `tei_xml`: str, optional
+- `word_count`: int, optional
+
+## New S3/SeaweedFS Content
+
+`sandcrawler` bucket, `html` folder, `.tei.xml` suffix.
+
diff --git a/proposals/20201103_xml_ingest.md b/proposals/20201103_xml_ingest.md
new file mode 100644
index 0000000..34e00b0
--- /dev/null
+++ b/proposals/20201103_xml_ingest.md
@@ -0,0 +1,64 @@
+
+status: deployed
+
+XML Fulltext Ingest
+====================
+
+This document details changes to include XML fulltext ingest in the same way
+that we currently ingest PDF fulltext.
+
+Currently this will just fetch the single XML document, which is often lacking
+figures, tables, and other required files.
+
+## Text Encoding
+
+Because we would like to treat XML as a string in a couple contexts, but XML
+can have multiple encodings (indicated in an XML header), we are in a bit of a
+bind. Simply parsing into unicode and then re-encoding as UTF-8 could result in
+a header/content mismatch. Any form of re-encoding will change the hash of the
+document. For recording in fatcat, the file metadata will be passed through.
+For storing in Kafka and blob store (for downstream analysis), we will parse
+the raw XML document (as "bytes") with an XML parser, then re-output with UTF-8
+encoding. The hash of the *original* XML file will be used as the key for
+referring to this document. This is unintuitive, but similar to what we are
+doing with PDF and HTML documents (extracting in a useful format, but keeping
+the original document's hash as a key).
+
+Unclear if we need to do this re-encode process for XML documents already in
+UTF-8 encoding.
+
+## Ingest Worker
+
+Could either re-use HTML metadata extractor to fetch XML fulltext links, or
+fork that code off to a separate method, like the PDF fulltext URL extractor.
+
+Hopefully can re-use almost all of the PDF pipeline code, by making that ingest
+worker class more generic and subclassing it.
+
+Result objects are treated the same as PDF ingest results: the result object
+has context about status, and if successful, file metadata and CDX row of the
+terminal object.
+
+TODO: should it be assumed that XML fulltext will end up in S3 bucket? or
+should there be an `xml_meta` SQL table tracking this, like we have for PDFs
+and HTML?
+
+TODO: should we detect and specify the XML schema better? Eg, indicate if JATS.
+
+
+## Persist Pipeline
+
+### Kafka Topic
+
+sandcrawler-ENV.xml-doc
+ similar to other fulltext topics; JSON wrapping the XML
+ key compaction, content compression
+
+### S3/SeaweedFS
+
+`sandcrawler` bucket, `xml` folder. Extension could depend on sub-type of XML?
+
+### Persist Worker
+
+New S3-only worker that pulls from kafka topic and pushes to S3. Works
+basically the same as PDF persist in S3-only mode, or like pdf-text worker.
diff --git a/proposals/2020_pdf_meta_thumbnails.md b/proposals/2020_pdf_meta_thumbnails.md
new file mode 100644
index 0000000..141ece8
--- /dev/null
+++ b/proposals/2020_pdf_meta_thumbnails.md
@@ -0,0 +1,328 @@
+
+status: deployed
+
+New PDF derivatives: thumbnails, metadata, raw text
+===================================================
+
+To support scholar.archive.org (fulltext search) and other downstream uses of
+fatcat, want to extract from many PDFs:
+
+- pdf structured metadata
+- thumbnail images
+- raw extracted text
+
+A single worker should extract all of these fields, and publish in to two kafka
+streams. Separate persist workers consume from the streams and push in to SQL
+and/or seaweedfs.
+
+Additionally, this extraction should happen automatically for newly-crawled
+PDFs as part of the ingest pipeline. When possible, checks should be run
+against the existing SQL table to avoid duplication of processing.
+
+
+## PDF Metadata and Text
+
+Kafka topic (name: `sandcrawler-ENV.pdf-text`; 12x partitions; gzip
+compression) JSON schema:
+
+ sha1hex (string; used as key)
+ status (string)
+ text (string)
+ page0_thumbnail (boolean)
+ meta_xml (string)
+ pdf_info (object)
+ pdf_extra (object)
+ word_count
+ file_meta (object)
+ source (object)
+
+For the SQL table we should have columns for metadata fields that are *always*
+saved, and put a subset of other interesting fields in a JSON blob. We don't
+need all metadata fields in SQL. Full metadata/info will always be available in
+Kafka, and we don't want SQL table size to explode. Schema:
+
+ CREATE TABLE IF NOT EXISTS pdf_meta (
+ sha1hex TEXT PRIMARY KEY CHECK (octet_length(sha1hex) = 40),
+ updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
+ status TEXT CHECK (octet_length(status) >= 1) NOT NULL,
+ has_page0_thumbnail BOOLEAN NOT NULL,
+ page_count INT CHECK (page_count >= 0),
+ word_count INT CHECK (word_count >= 0),
+ page0_height REAL CHECK (page0_height >= 0),
+ page0_width REAL CHECK (page0_width >= 0),
+ permanent_id TEXT CHECK (octet_length(permanent_id) >= 1),
+ pdf_created TIMESTAMP WITH TIME ZONE,
+ pdf_version TEXT CHECK (octet_length(pdf_version) >= 1),
+ metadata JSONB
+ -- maybe some analysis of available fields?
+ -- metadata JSON fields:
+ -- title
+ -- subject
+ -- author
+ -- creator
+ -- producer
+ -- CrossMarkDomains
+ -- doi
+ -- form
+ -- encrypted
+ );
+
+
+## Thumbnail Images
+
+Kafka Schema is raw image bytes as message body; sha1sum of PDF as the key. No
+compression, 12x partitions.
+
+Kafka topic name is `sandcrawler-ENV.pdf-thumbnail-SIZE-TYPE` (eg,
+`sandcrawler-qa.pdf-thumbnail-180px-jpg`). Thus, topic name contains the
+"metadata" of thumbail size/shape.
+
+Have decided to use JPEG thumbnails, 180px wide (and max 300px high, though
+width restriction is almost always the limiting factor). This size matches that
+used on archive.org, and is slightly larger than the thumbnails currently used
+on scholar.archive.org prototype. We intend to tweak the scholar.archive.org
+CSS to use the full/raw thumbnail image at max desktop size. At this size it
+would be difficult (though maybe not impossible?) to extract text (other than
+large-font titles).
+
+
+### Implementation
+
+We use the `poppler` CPP library (wrapper for python) to extract and convert everything.
+
+Some example usage of the `python-poppler` library:
+
+ import poppler
+ from PIL import Image
+
+ pdf = poppler.load_from_file("/home/bnewbold/10.1038@s41551-020-0534-9.pdf")
+ pdf.pdf_id
+ page = pdf.create_page(0)
+ page.page_rect().width
+
+ renderer = poppler.PageRenderer()
+ full_page = renderer.render_page(page)
+ img = Image.frombuffer("RGBA", (full_page.width, full_page.height), full_page.data, 'raw', "RGBA")
+ img.thumbnail((180,300), Image.BICUBIC)
+ img.save("something.jpg")
+
+## Deployment and Infrastructure
+
+Deployment will involve:
+
+- sandcrawler DB SQL table
+ => guesstimate size 100 GByte for hundreds of PDFs
+- postgrest/SQL access to new table for internal HTTP API hits
+- seaweedfs raw text folder
+ => reuse existing bucket with GROBID XML; same access restrictions on content
+- seaweedfs thumbnail bucket
+ => new bucket for this world-public content
+- public nginx access to seaweed thumbnail bucket
+- extraction work queue kafka topic
+ => same schema/semantics as ungrobided
+- text/metadata kafka topic
+- thumbnail kafka topic
+- text/metadata persist worker(s)
+ => from kafka; metadata to SQL database; text to seaweedfs blob store
+- thumbnail persist worker
+ => from kafka to seaweedfs blob store
+- pdf extraction worker pool
+ => very similar to GROBID worker pool
+- ansible roles for all of the above
+
+Plan for processing/catchup is:
+
+- test with COVID-19 PDF corpus
+- run extraction on all current fatcat files available via IA
+- integrate with ingest pipeline for all new files
+- run a batch catchup job over all GROBID-parsed files with no pdf meta
+ extracted, on basis of SQL table query
+
+## Appendix: Thumbnail Size and Format Experimentation
+
+Using 190 PDFs from `/data/pdfs/random_crawl/files` on my laptop to test.
+
+TODO: actually, 4x images failed to convert with pdftocairo; this throws off
+"mean" sizes by a small amount.
+
+ time ls | parallel -j1 pdftocairo -singlefile -scale-to 200 -png {} /tmp/test-png/{}.png
+ real 0m29.314s
+ user 0m26.794s
+ sys 0m2.484s
+ => missing: 4
+ => min: 0.8k
+ => max: 57K
+ => mean: 16.4K
+ => total: 3120K
+
+ time ls | parallel -j1 pdftocairo -singlefile -scale-to 200 -jpeg {} /tmp/test-jpeg/{}.jpg
+ real 0m26.289s
+ user 0m24.022s
+ sys 0m2.490s
+ => missing: 4
+ => min: 1.2K
+ => max: 13K
+ => mean: 8.02k
+ => total: 1524K
+
+ time ls | parallel -j1 pdftocairo -singlefile -scale-to 200 -jpeg -jpegopt optimize=y,quality=80 {} /tmp/test-jpeg2/{}.jpg
+ real 0m27.401s
+ user 0m24.941s
+ sys 0m2.519s
+ => missing: 4
+ => min: 577
+ => max: 14K
+ => mean:
+ => total: 1540K
+
+ time ls | parallel -j1 convert -resize 200x200 {}[0] /tmp/magick-png/{}.png
+ => missing: 4
+ real 1m19.399s
+ user 1m17.150s
+ sys 0m6.322s
+ => min: 1.1K
+ => max: 325K
+ => mean:
+ => total: 8476K
+
+ time ls | parallel -j1 convert -resize 200x200 {}[0] /tmp/magick-jpeg/{}.jpg
+ real 1m21.766s
+ user 1m17.040s
+ sys 0m7.155s
+ => total: 3484K
+
+NOTE: the following `pdf_thumbnail.py` images are somewhat smaller than the above
+jpg and pngs (max 180px wide, not 200px wide)
+
+ time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-png/{}.png
+ real 0m48.198s
+ user 0m42.997s
+ sys 0m4.509s
+ => missing: 2; 2x additional stub images
+ => total: 5904K
+
+ time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-jpg/{}.jpg
+ real 0m45.252s
+ user 0m41.232s
+ sys 0m4.273s
+ => min: 1.4K
+ => max: 16K
+ => mean: ~9.3KByte
+ => total: 1772K
+
+ time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-jpg-360/{}.jpg
+ real 0m48.639s
+ user 0m44.121s
+ sys 0m4.568s
+ => mean: ~28k
+ => total: 5364K (3x of 180px batch)
+
+ quality=95
+ time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-jpg2-360/{}.jpg
+ real 0m49.407s
+ user 0m44.607s
+ sys 0m4.869s
+ => total: 9812K
+
+ quality=95
+ time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-jpg2-180/{}.jpg
+ real 0m45.901s
+ user 0m41.486s
+ sys 0m4.591s
+ => mean: 16.4K
+ => total: 3116K
+
+At the 180px size, the difference between default and quality=95 seems
+indistinguishable visually to me, but is more than a doubling of file size.
+Also tried at 300px and seems near-indistinguishable there as well.
+
+At a mean of 10 Kbytes per file:
+
+ 10 million -> 100 GBytes
+ 100 million -> 1 Tbyte
+
+Older COVID-19 thumbnails were about 400px wide:
+
+ pdftocairo -png -singlefile -scale-to-x 400 -scale-to-y -1
+
+Display on scholar-qa.archive.org is about 135x181px
+
+archive.org does 180px wide
+
+Unclear if we should try to do double resolution for high DPI screens (eg,
+apple "retina").
+
+Using same size as archive.org probably makes the most sense: max 180px wide,
+preserve aspect ratio. And jpeg improvement seems worth it.
+
+#### Merlijn notes
+
+From work on optimizing microfilm thumbnail images:
+
+ When possible, generate a thumbnail that fits well on the screen of the
+ user. Always creating a large thumbnail will result in the browsers
+ downscaling them, leading to fuzzy text. If it’s not possible, then create
+ the pick the resolution you’d want to support (1.5x or 2x scaling) and
+ create thumbnails of that size, but also apply the other recommendations
+ below - especially a sharpening filter.
+
+ Use bicubic or lanczos interpolation. Bilinear and nearest neighbour are
+ not OK.
+
+ For text, consider applying a sharpening filter. Not a strong one, but some
+ sharpening can definitely help.
+
+
+## Appendix: PDF Info Fields
+
+From `pdfinfo` manpage:
+
+ The ´Info' dictionary contains the following values:
+
+ title
+ subject
+ keywords
+ author
+ creator
+ producer
+ creation date
+ modification date
+
+ In addition, the following information is printed:
+
+ tagged (yes/no)
+ form (AcroForm / XFA / none)
+ javascript (yes/no)
+ page count
+ encrypted flag (yes/no)
+ print and copy permissions (if encrypted)
+ page size
+ file size
+ linearized (yes/no)
+ PDF version
+ metadata (only if requested)
+
+For an example file, the output looks like:
+
+ Title: A mountable toilet system for personalized health monitoring via the analysis of excreta
+ Subject: Nature Biomedical Engineering, doi:10.1038/s41551-020-0534-9
+ Keywords:
+ Author: Seung-min Park
+ Creator: Springer
+ CreationDate: Thu Mar 26 01:26:57 2020 PDT
+ ModDate: Thu Mar 26 01:28:06 2020 PDT
+ Tagged: no
+ UserProperties: no
+ Suspects: no
+ Form: AcroForm
+ JavaScript: no
+ Pages: 14
+ Encrypted: no
+ Page size: 595.276 x 790.866 pts
+ Page rot: 0
+ File size: 6104749 bytes
+ Optimized: yes
+ PDF version: 1.4
+
+For context on the `pdf_id` fields ("original" and "updated"), read:
+<https://web.hypothes.is/blog/synchronizing-annotations-between-local-and-remote-pdfs/>
diff --git a/proposals/2020_seaweed_s3.md b/proposals/2020_seaweed_s3.md
new file mode 100644
index 0000000..677393b
--- /dev/null
+++ b/proposals/2020_seaweed_s3.md
@@ -0,0 +1,426 @@
+# Notes on seaweedfs
+
+> 2020-04-28, martin@archive.org
+
+Currently (04/2020) [minio](https://github.com/minio/minio) is used to store
+output from PDF analysis for [fatcat](https://fatcat.wiki) (e.g. from
+[grobid](https://grobid.readthedocs.io/en/latest/)). The file checksum (sha1)
+serves as key, values are blobs of XML or JSON.
+
+Problem: minio inserts slowed down after inserting 80M or more objects.
+
+Summary: I did four test runs, three failed, one (testrun-4) succeeded.
+
+* [testrun-4](https://git.archive.org/webgroup/sandcrawler/-/blob/master/proposals/2020_seaweed_s3.md#testrun-4)
+
+So far, in a non-distributed mode, the project looks usable. Added 200M objects
+(about 550G) in 6 days. Full CPU load, 400M RAM usage, constant insert times.
+
+----
+
+Details (03/2020) / @bnewbold, slack
+
+> the sandcrawler XML data store (currently on aitio) is grinding to a halt, I
+> think because despite tuning minio+ext4+hdd just doesn't work. current at 2.6
+> TiB of data (each document compressed with snappy) and 87,403,183 objects.
+
+> this doesn't impact ingest processing (because content is queued and archived
+> in kafka), but does impact processing and analysis
+
+> it is possible that the other load on aitio is making this worse, but I did
+> an experiment with dumping to a 16 TB disk that slowed way down after about
+> 50 million files also. some people on the internet said to just not worry
+> about these huge file counts on modern filesystems, but i've debugged a bit
+> and I think it is a bad idea after all
+
+Possible solutions
+
+* putting content in fake WARCs and trying to do something like CDX
+* deploy CEPH object store (or swift, or any other off-the-shelf object store)
+* try putting the files in postgres tables, mongodb, cassandra, etc: these are
+ not designed for hundreds of millions of ~50 KByte XML documents (5 - 500
+ KByte range)
+* try to find or adapt an open source tool like Haystack, Facebook's solution
+ to this engineering problem. eg:
+ https://engineering.linkedin.com/blog/2016/05/introducing-and-open-sourcing-ambry---linkedins-new-distributed-
+
+----
+
+The following are notes gathered during a few test runs of seaweedfs in 04/2020
+on wbgrp-svc170.us.archive.org (4 core E5-2620 v4, 4GB RAM).
+
+----
+
+## Setup
+
+There are frequent [releases](https://github.com/chrislusf/seaweedfs/releases)
+but for the test, we used a build off master branch.
+
+Directions for configuring AWS CLI for seaweedfs:
+[https://github.com/chrislusf/seaweedfs/wiki/AWS-CLI-with-SeaweedFS](https://github.com/chrislusf/seaweedfs/wiki/AWS-CLI-with-SeaweedFS).
+
+### Build the binary
+
+Using development version (requires a [Go installation](https://golang.org/dl/)).
+
+```
+$ git clone git@github.com:chrislusf/seaweedfs.git # 11f5a6d9
+$ cd seaweedfs
+$ make
+$ ls -lah weed/weed
+-rwxr-xr-x 1 tir tir 55M Apr 17 16:57 weed
+
+$ git rev-parse HEAD
+11f5a6d91346e5f3cbf3b46e0a660e231c5c2998
+
+$ sha1sum weed/weed
+a7f8f0b49e6183da06fc2d1411c7a0714a2cc96b
+```
+
+A single, 55M binary emerges after a few seconds. The binary contains
+subcommands to run different parts of seaweed, e.g. master or volume servers,
+filer and commands for maintenance tasks, like backup and compaction.
+
+To *deploy*, just copy this binary to the destination.
+
+### Quickstart with S3
+
+Assuming `weed` binary is in PATH.
+
+Start a master and volume server (over /tmp, most likely) and the S3 API with a single command:
+
+```
+$ weed -server s3
+...
+Start Seaweed Master 30GB 1.74 at 0.0.0.0:9333
+...
+Store started on dir: /tmp with 0 volumes max 7
+Store started on dir: /tmp with 0 ec shards
+Volume server start with seed master nodes: [localhost:9333]
+...
+Start Seaweed S3 API Server 30GB 1.74 at http port 8333
+...
+```
+
+Install the [AWS
+CLI](https://github.com/chrislusf/seaweedfs/wiki/AWS-CLI-with-SeaweedFS).
+Create a bucket.
+
+```
+$ aws --endpoint-url http://localhost:8333 s3 mb s3://sandcrawler-dev
+make_bucket: sandcrawler-dev
+```
+
+List buckets.
+
+```
+$ aws --endpoint-url http://localhost:8333 s3 ls
+2020-04-17 17:44:39 sandcrawler-dev
+```
+
+Create a dummy file.
+
+```
+$ echo "blob" > 12340d9a4a4f710ecf03b127051814385e83ff08.tei.xml
+```
+
+Upload.
+
+```
+$ aws --endpoint-url http://localhost:8333 s3 cp 12340d9a4a4f710ecf03b127051814385e83ff08.tei.xml s3://sandcrawler-dev
+upload: ./12340d9a4a4f710ecf03b127051814385e83ff08.tei.xml to s3://sandcrawler-dev/12340d9a4a4f710ecf03b127051814385e83ff08.tei.xml
+```
+
+List.
+
+```
+$ aws --endpoint-url http://localhost:8333 s3 ls s3://sandcrawler-dev
+2020-04-17 17:50:35 5 12340d9a4a4f710ecf03b127051814385e83ff08.tei.xml
+```
+
+Stream to stdout.
+
+```
+$ aws --endpoint-url http://localhost:8333 s3 cp s3://sandcrawler-dev/12340d9a4a4f710ecf03b127051814385e83ff08.tei.xml -
+blob
+```
+
+Drop the bucket.
+
+```
+$ aws --endpoint-url http://localhost:8333 s3 rm --recursive s3://sandcrawler-dev
+```
+
+### Builtin benchmark
+
+The project comes with a builtin benchmark command.
+
+```
+$ weed benchmark
+```
+
+I encountered an error like
+[#181](https://github.com/chrislusf/seaweedfs/issues/181), "no free volume
+left" - when trying to start the benchmark after the S3 ops. A restart or a restart with `-volume.max 100` helped.
+
+```
+$ weed server -s3 -volume.max 100
+```
+
+### Listing volumes
+
+```
+$ weed shell
+> volume.list
+Topology volume:15/112757 active:8 free:112742 remote:0 volumeSizeLimit:100 MB
+ DataCenter DefaultDataCenter volume:15/112757 active:8 free:112742 remote:0
+ Rack DefaultRack volume:15/112757 active:8 free:112742 remote:0
+ DataNode localhost:8080 volume:15/112757 active:8 free:112742 remote:0
+ volume id:1 size:105328040 collection:"test" file_count:33933 version:3 modified_at_second:1587215730
+ volume id:2 size:106268552 collection:"test" file_count:34236 version:3 modified_at_second:1587215730
+ volume id:3 size:106290280 collection:"test" file_count:34243 version:3 modified_at_second:1587215730
+ volume id:4 size:105815368 collection:"test" file_count:34090 version:3 modified_at_second:1587215730
+ volume id:5 size:105660168 collection:"test" file_count:34040 version:3 modified_at_second:1587215730
+ volume id:6 size:106296488 collection:"test" file_count:34245 version:3 modified_at_second:1587215730
+ volume id:7 size:105753288 collection:"test" file_count:34070 version:3 modified_at_second:1587215730
+ volume id:8 size:7746408 file_count:12 version:3 modified_at_second:1587215764
+ volume id:9 size:10438760 collection:"test" file_count:3363 version:3 modified_at_second:1587215788
+ volume id:10 size:10240104 collection:"test" file_count:3299 version:3 modified_at_second:1587215788
+ volume id:11 size:10258728 collection:"test" file_count:3305 version:3 modified_at_second:1587215788
+ volume id:12 size:10240104 collection:"test" file_count:3299 version:3 modified_at_second:1587215788
+ volume id:13 size:10112840 collection:"test" file_count:3258 version:3 modified_at_second:1587215788
+ volume id:14 size:10190440 collection:"test" file_count:3283 version:3 modified_at_second:1587215788
+ volume id:15 size:10112840 collection:"test" file_count:3258 version:3 modified_at_second:1587215788
+ DataNode localhost:8080 total size:820752408 file_count:261934
+ Rack DefaultRack total size:820752408 file_count:261934
+ DataCenter DefaultDataCenter total size:820752408 file_count:261934
+total size:820752408 file_count:261934
+```
+
+### Custom S3 benchmark
+
+To simulate the use case of S3 for 100-500M small files (grobid xml, pdftotext,
+...), I created a synthetic benchmark.
+
+* [https://gist.github.com/miku/6f3fee974ba82083325c2f24c912b47b](https://gist.github.com/miku/6f3fee974ba82083325c2f24c912b47b)
+
+We just try to fill up the datastore with millions of 5k blobs.
+
+----
+
+### testrun-1
+
+Small set, just to run. Status: done. Learned that the default in-memory volume
+index grows too quickly for the 4GB RAM machine.
+
+```
+$ weed server -dir /tmp/martin-seaweedfs-testrun-1 -s3 -volume.max 512 -master.volumeSizeLimitMB 100
+```
+
+* https://github.com/chrislusf/seaweedfs/issues/498 -- RAM
+* at 10M files, we already consume ~1G
+
+```
+-volume.index string
+ Choose [memory|leveldb|leveldbMedium|leveldbLarge] mode for memory~performance balance. (default "memory")
+```
+
+### testrun-2
+
+200M 5k objects, in-memory volume index. Status: done. Observed: After 18M
+objects the 512 100MB volumes are exhausted and seaweedfs will not accept any
+new data.
+
+```
+$ weed server -dir /tmp/martin-seaweedfs-testrun-2 -s3 -volume.max 512 -master.volumeSizeLimitMB 100
+...
+I0418 12:01:43 1622 volume_loading.go:104] loading index /tmp/martin-seaweedfs-testrun-2/test_511.idx to memory
+I0418 12:01:43 1622 store.go:122] add volume 511
+I0418 12:01:43 1622 volume_layout.go:243] Volume 511 becomes writable
+I0418 12:01:43 1622 volume_growth.go:224] Created Volume 511 on topo:DefaultDataCenter:DefaultRack:localhost:8080
+I0418 12:01:43 1622 master_grpc_server.go:158] master send to master@[::1]:45084: url:"localhost:8080" public_url:"localhost:8080" new_vids:511
+I0418 12:01:43 1622 master_grpc_server.go:158] master send to filer@::1:18888: url:"localhost:8080" public_url:"localhost:8080" new_vids:511
+I0418 12:01:43 1622 store.go:118] In dir /tmp/martin-seaweedfs-testrun-2 adds volume:512 collection:test replicaPlacement:000 ttl:
+I0418 12:01:43 1622 volume_loading.go:104] loading index /tmp/martin-seaweedfs-testrun-2/test_512.idx to memory
+I0418 12:01:43 1622 store.go:122] add volume 512
+I0418 12:01:43 1622 volume_layout.go:243] Volume 512 becomes writable
+I0418 12:01:43 1622 master_grpc_server.go:158] master send to master@[::1]:45084: url:"localhost:8080" public_url:"localhost:8080" new_vids:512
+I0418 12:01:43 1622 master_grpc_server.go:158] master send to filer@::1:18888: url:"localhost:8080" public_url:"localhost:8080" new_vids:512
+I0418 12:01:43 1622 volume_growth.go:224] Created Volume 512 on topo:DefaultDataCenter:DefaultRack:localhost:8080
+I0418 12:01:43 1622 node.go:82] topo failed to pick 1 from 0 node candidates
+I0418 12:01:43 1622 volume_growth.go:88] create 7 volume, created 2: No enough data node found!
+I0418 12:04:30 1622 volume_layout.go:231] Volume 511 becomes unwritable
+I0418 12:04:30 1622 volume_layout.go:231] Volume 512 becomes unwritable
+E0418 12:04:30 1622 filer_server_handlers_write.go:69] failing to assign a file id: rpc error: code = Unknown desc = No free volumes left!
+I0418 12:04:30 1622 filer_server_handlers_write.go:120] fail to allocate volume for /buckets/test/k43731970, collection:test, datacenter:
+E0418 12:04:30 1622 filer_server_handlers_write.go:69] failing to assign a file id: rpc error: code = Unknown desc = No free volumes left!
+E0418 12:04:30 1622 filer_server_handlers_write.go:69] failing to assign a file id: rpc error: code = Unknown desc = No free volumes left!
+E0418 12:04:30 1622 filer_server_handlers_write.go:69] failing to assign a file id: rpc error: code = Unknown desc = No free volumes left!
+E0418 12:04:30 1622 filer_server_handlers_write.go:69] failing to assign a file id: rpc error: code = Unknown desc = No free volumes left!
+I0418 12:04:30 1622 masterclient.go:88] filer failed to receive from localhost:9333: rpc error: code = Unavailable desc = transport is closing
+I0418 12:04:30 1622 master_grpc_server.go:276] - client filer@::1:18888
+```
+
+Inserted about 18M docs, then:
+
+```
+worker-0 @3720000 45475.13 81.80
+worker-1 @3730000 45525.00 81.93
+worker-3 @3720000 45525.76 81.71
+worker-4 @3720000 45527.22 81.71
+Process Process-1:
+Traceback (most recent call last):
+ File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
+ self.run()
+ File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
+ self._target(*self._args, **self._kwargs)
+ File "s3test.py", line 42, in insert_keys
+ s3.Bucket(bucket).put_object(Key=key, Body=data)
+ File "/home/martin/.virtualenvs/6f3fee974ba82083325c2f24c912b47b/lib/python3.5/site-packages/boto3/resources/factory.py", line 520, in do_action
+ response = action(self, *args, **kwargs)
+ File "/home/martin/.virtualenvs/6f3fee974ba82083325c2f24c912b47b/lib/python3.5/site-packages/boto3/resources/action.py", line 83, in __call__
+ response = getattr(parent.meta.client, operation_name)(**params)
+ File "/home/martin/.virtualenvs/6f3fee974ba82083325c2f24c912b47b/lib/python3.5/site-packages/botocore/client.py", line 316, in _api_call
+ return self._make_api_call(operation_name, kwargs)
+ File "/home/martin/.virtualenvs/6f3fee974ba82083325c2f24c912b47b/lib/python3.5/site-packages/botocore/client.py", line 626, in _make_api_call
+ raise error_class(parsed_response, operation_name)
+botocore.exceptions.ClientError: An error occurred (InternalError) when calling the PutObject operation (reached max retries: 4): We encountered an internal error, please try again.
+
+real 759m30.034s
+user 1962m47.487s
+sys 105m21.113s
+```
+
+Sustained 400 S3 puts/s, RAM usage 41% of a 4G machine. 56G on disk.
+
+> No free volumes left! Failed to allocate bucket for /buckets/test/k163721819
+
+### testrun-3
+
+* use leveldb, leveldbLarge
+* try "auto" volumes
+* Status: done. Observed: rapid memory usage increase.
+
+```
+$ weed server -dir /tmp/martin-seaweedfs-testrun-3 -s3 -volume.max 0 -volume.index=leveldbLarge -filer=false -master.volumeSizeLimitMB 100
+```
+
+Observations: memory usage grows rapidly, soon at 15%.
+
+Note-to-self: [https://github.com/chrislusf/seaweedfs/wiki/Optimization](https://github.com/chrislusf/seaweedfs/wiki/Optimization)
+
+### testrun-4
+
+The default volume size is 30G (and cannot be more at the moment), and RAM
+grows very much with the number of volumes. Therefore, keep default volume size
+and do not limit number of volumes `-volume.max 0` and do not use in-memory
+index (rather leveldb)
+
+Status: done, 200M object upload via Python script successfully in about 6 days,
+memory usage was at a moderate 400M (~10% of RAM). Relatively constant
+performance at about 400 `PutObject` requests/s (over 5 threads, each thread
+was around 80 requests/s; then testing with 4 threads, each thread got to
+around 100 requests/s).
+
+```
+$ weed server -dir /tmp/martin-seaweedfs-testrun-4 -s3 -volume.max 0 -volume.index=leveldb
+```
+
+The test script command was (40M files per worker, 5 workers).
+
+```
+$ time python s3test.py -n 40000000 -w 5 2> s3test.4.log
+...
+
+real 8454m33.695s
+user 21318m23.094s
+sys 1128m32.293s
+```
+
+The test script adds keys from `k0...k199999999`.
+
+```
+$ aws --endpoint-url http://localhost:8333 s3 ls s3://test | head -20
+2020-04-19 09:27:13 5000 k0
+2020-04-19 09:27:13 5000 k1
+2020-04-19 09:27:13 5000 k10
+2020-04-19 09:27:15 5000 k100
+2020-04-19 09:27:26 5000 k1000
+2020-04-19 09:29:15 5000 k10000
+2020-04-19 09:47:49 5000 k100000
+2020-04-19 12:54:03 5000 k1000000
+2020-04-20 20:14:10 5000 k10000000
+2020-04-22 07:33:46 5000 k100000000
+2020-04-22 07:33:46 5000 k100000001
+2020-04-22 07:33:46 5000 k100000002
+2020-04-22 07:33:46 5000 k100000003
+2020-04-22 07:33:46 5000 k100000004
+2020-04-22 07:33:46 5000 k100000005
+2020-04-22 07:33:46 5000 k100000006
+2020-04-22 07:33:46 5000 k100000007
+2020-04-22 07:33:46 5000 k100000008
+2020-04-22 07:33:46 5000 k100000009
+2020-04-20 20:14:10 5000 k10000001
+```
+
+Glance at stats.
+
+```
+$ du -hs /tmp/martin-seaweedfs-testrun-4
+596G /tmp/martin-seaweedfs-testrun-4
+
+$ find . /tmp/martin-seaweedfs-testrun-4 | wc -l
+5104
+
+$ ps --pid $(pidof weed) -o pid,tid,class,stat,vsz,rss,comm
+ PID TID CLS STAT VSZ RSS COMMAND
+32194 32194 TS Sl+ 1966964 491644 weed
+
+$ ls -1 /proc/$(pidof weed)/fd | wc -l
+192
+
+$ free -m
+ total used free shared buff/cache available
+Mem: 3944 534 324 39 3086 3423
+Swap: 4094 27 4067
+```
+
+### Note on restart
+
+When stopping (CTRL-C) and restarting `weed` it will take about 10 seconds to
+get the S3 API server back up, but another minute or two, until seaweedfs
+inspects all existing volumes and indices.
+
+In that gap, requests to S3 will look like internal server errors.
+
+```
+$ aws --endpoint-url http://localhost:8333 s3 cp s3://test/k100 -
+download failed: s3://test/k100 to - An error occurred (500) when calling the
+GetObject operation (reached max retries: 4): Internal Server Error
+```
+
+### Read benchmark
+
+Reading via command line `aws` client is a bit slow at first sight (3-5s).
+
+```
+$ time aws --endpoint-url http://localhost:8333 s3 cp s3://test/k123456789 -
+ppbhjgzkrrgwagmjsuwhqcwqzmefybeopqz [...]
+
+real 0m5.839s
+user 0m0.898s
+sys 0m0.293s
+```
+
+#### Single process random reads
+
+* via [s3read.go](https://gist.github.com/miku/6f3fee974ba82083325c2f24c912b47b#file-s3read-go)
+
+Running 1000 random reads takes 49s.
+
+#### Concurrent random reads
+
+* 80000 request with 8 parallel processes: 7m41.973968488s, so about 170 objects/s)
+* seen up to 760 keys/s reads for 8 workers
+* weed will utilize all cores, so more cpus could result in higher read throughput
+* RAM usage can increase (seen up to 20% of 4G RAM), then descrease (GC) back to 5%, depending on query load
diff --git a/proposals/2021-04-22_crossref_db.md b/proposals/2021-04-22_crossref_db.md
new file mode 100644
index 0000000..1d4c3f8
--- /dev/null
+++ b/proposals/2021-04-22_crossref_db.md
@@ -0,0 +1,86 @@
+
+status: deployed
+
+Crossref DOI Metadata in Sandcrawler DB
+=======================================
+
+Proposal is to have a local copy of Crossref API metadata records in
+sandcrawler DB, accessible by simple key lookup via postgrest.
+
+Initial goal is to include these in scholar work "bundles" (along with
+fulltext, etc), in particular as part of reference extraction pipeline. Around
+late 2020, many additional references became available via Crossref records,
+and have not been imported (updated) into fatcat. Reference storage in fatcat
+API is a scaling problem we would like to put off, so injecting content in this
+way is desirable.
+
+To start, working with a bulk dump made available by Crossref. In the future,
+might persist the daily feed to that we have a continuously up-to-date copy.
+
+Another application of Crossref-in-bundles is to identify overall scale of
+changes since initial Crossref metadata import.
+
+
+## Sandcrawler DB Schema
+
+The "updated" field in this case refers to the upstream timestamp, not the
+sandcrawler database update time.
+
+ CREATE TABLE IF NOT EXISTS crossref (
+ doi TEXT NOT NULL CHECK (octet_length(doi) >= 4 AND doi = LOWER(doi)),
+ indexed TIMESTAMP WITH TIME ZONE NOT NULL,
+ record JSON NOT NULL,
+ PRIMARY KEY(doi)
+ );
+
+For postgrest access, may need to also:
+
+ GRANT SELECT ON public.crossref TO web_anon;
+
+## SQL Backfill Command
+
+For an example file:
+
+ cat sample.json \
+ | jq -rc '[(.DOI | ascii_downcase), .indexed."date-time", (. | tostring)] | @tsv' \
+ | psql sandcrawler -c "COPY crossref (doi, indexed, record) FROM STDIN (DELIMITER E'\t');"
+
+For a full snapshot:
+
+ zcat crossref_public_data_file_2021_01.json.gz \
+ | pv -l \
+ | jq -rc '[(.DOI | ascii_downcase), .indexed."date-time", (. | tostring)] | @tsv' \
+ | psql sandcrawler -c "COPY crossref (doi, indexed, record) FROM STDIN (DELIMITER E'\t');"
+
+jq is the bottleneck (100% of a single CPU core).
+
+## Kafka Worker
+
+Pulls from the fatcat crossref ingest Kafka feed and persists into the crossref
+table.
+
+## SQL Table Disk Utilization
+
+An example backfill from early 2021, with about 120 million Crossref DOI
+records.
+
+Starting database size (with ingest running):
+
+ Filesystem Size Used Avail Use% Mounted on
+ /dev/vdb1 1.7T 896G 818G 53% /1
+
+ Size: 475.14G
+
+Ingest SQL command took:
+
+ 120M 15:06:08 [2.22k/s]
+ COPY 120684688
+
+After database size:
+
+ Filesystem Size Used Avail Use% Mounted on
+ /dev/vdb1 1.7T 1.2T 498G 71% /1
+
+ Size: 794.88G
+
+So about 320 GByte of disk.
diff --git a/proposals/2021-09-09_component_ingest.md b/proposals/2021-09-09_component_ingest.md
new file mode 100644
index 0000000..09dee4f
--- /dev/null
+++ b/proposals/2021-09-09_component_ingest.md
@@ -0,0 +1,114 @@
+
+File Ingest Mode: 'component'
+=============================
+
+A new ingest type for downloading individual files which are a subset of a
+complete work.
+
+Some publishers now assign DOIs to individual figures, supplements, and other
+"components" of an over release or document.
+
+Initial mimetypes to allow:
+
+- image/jpeg
+- image/tiff
+- image/png
+- image/gif
+- audio/mpeg
+- video/mp4
+- video/mpeg
+- text/plain
+- text/csv
+- application/json
+- application/xml
+- application/pdf
+- application/gzip
+- application/x-bzip
+- application/x-bzip2
+- application/zip
+- application/x-rar
+- application/x-7z-compressed
+- application/x-tar
+- application/vnd.ms-powerpoint
+- application/vnd.ms-excel
+- application/msword
+- application/vnd.openxmlformats-officedocument.wordprocessingml.document
+- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
+
+Intentionally not supporting:
+
+- text/html
+
+
+## Fatcat Changes
+
+In the file importer, allow the additional mimetypes for 'component' ingest.
+
+
+## Ingest Changes
+
+Allow additional terminal mimetypes for 'component' crawls.
+
+
+## Examples
+
+Hundreds of thousands: <https://fatcat.wiki/release/search?q=type%3Acomponent+in_ia%3Afalse>
+
+#### ACS Supplement File
+
+<https://doi.org/10.1021/acscatal.0c02627.s002>
+
+Redirects directly to .zip in browser. SPN is blocked by cookie check.
+
+#### Frontiers .docx Supplement
+
+<https://doi.org/10.3389/fpls.2019.01642.s001>
+
+Redirects to full article page. There is a pop-up for figshare, seems hard to process.
+
+#### Figshare Single FIle
+
+<https://doi.org/10.6084/m9.figshare.13646972.v1>
+
+As 'component' type in fatcat.
+
+Redirects to a landing page. Dataset ingest seems more appropriate for this entire domain.
+
+#### PeerJ supplement file
+
+<https://doi.org/10.7717/peerj.10257/supp-7>
+
+PeerJ is hard because it redirects to a single HTML page, which has links to
+supplements in the HTML. Perhaps a custom extractor will work.
+
+#### eLife
+
+<https://doi.org/10.7554/elife.38407.010>
+
+The current crawl mechanism makes it seemingly impossible to extract a specific
+supplement from the document as a whole.
+
+#### Zookeys
+
+<https://doi.org/10.3897/zookeys.895.38576.figure53>
+
+These are extract-able.
+
+#### OECD PDF Supplement
+
+<https://doi.org/10.1787/f08c6324-en>
+<https://www.oecd-ilibrary.org/trade/imports-of-services-billions-of-us-dollars_f08c6324-en>
+
+Has an Excel (.xls) link, great, but then paywall.
+
+#### Direct File Link
+
+<https://doi.org/10.1787/888934207500>
+
+This one is also OECD, but is a simple direct download.
+
+#### Protein Data Base (PDB) Entry
+
+<https://doi.org/10.2210/pdb6ls2/pdb>
+
+Multiple files; dataset/fileset more appropriate for these.
diff --git a/proposals/2021-09-09_fileset_ingest.md b/proposals/2021-09-09_fileset_ingest.md
new file mode 100644
index 0000000..65c9ccf
--- /dev/null
+++ b/proposals/2021-09-09_fileset_ingest.md
@@ -0,0 +1,343 @@
+
+status: implemented
+
+Fileset Ingest Pipeline (for Datasets)
+======================================
+
+Sandcrawler currently has ingest support for individual files saved as `file`
+entities in fatcat (xml and pdf ingest types) and HTML files with
+sub-components saved as `webcapture` entities in fatcat (html ingest type).
+
+This document describes extensions to this ingest system to flexibly support
+groups of files, which may be represented in fatcat as `fileset` entities. The
+main new ingest type is `dataset`.
+
+Compared to the existing ingest process, there are two major complications with
+datasets:
+
+- the ingest process often requires more than parsing HTML files, and will be
+ specific to individual platforms and host software packages
+- the storage backend and fatcat entity type is flexible: a dataset might be
+ represented by a single file, multiple files combined in to a single .zip
+ file, or multiple separate files; the data may get archived in wayback or in
+ an archive.org item
+
+The new concepts of "strategy" and "platform" are introduced to accommodate
+these complications.
+
+
+## Ingest Strategies
+
+The ingest strategy describes the fatcat entity type that will be output; the
+storage backend used; and whether an enclosing file format is used. The
+strategy to use can not be determined until the number and size of files is
+known. It is a function of file count, total file size, and publication
+platform.
+
+Strategy names are compact strings with the format
+`{storage_backend}-{fatcat_entity}`. A `-bundled` suffix after a `fileset`
+entity type indicates that metadata about multiple files is retained, but that
+in the storage backend only a single enclosing file (eg, `.zip`) will be
+stored.
+
+The supported strategies are:
+
+- `web-file`: single file of any type, stored in wayback, represented as fatcat `file`
+- `web-fileset`: multiple files of any type, stored in wayback, represented as fatcat `fileset`
+- `web-fileset-bundled`: single bundle file, stored in wayback, represented as fatcat `fileset`
+- `archiveorg-file`: single file of any type, stored in archive.org item, represented as fatcat `file`
+- `archiveorg-fileset`: multiple files of any type, stored in archive.org item, represented as fatcat `fileset`
+- `archiveorg-fileset-bundled`: single bundle file, stored in archive.org item, represented as fatcat `fileset`
+
+"Bundle" or "enclosing" files are things like .zip or .tar.gz. Not all .zip
+files are handled as bundles! Only when the transfer from the hosting platform
+is via a "download all as .zip" (or similar) do we consider a zipfile a
+"bundle" and index the interior files as a fileset.
+
+The term "bundle file" is used over "archive file" or "container file" to
+prevent confusion with the other use of those terms in the context of fatcat
+(container entities; archive; Internet Archive as an organization).
+
+The motivation for supporting both `web` and `archiveorg` is that `web` is
+somewhat simpler for small files, but `archiveorg` is better for larger groups
+of files (say more than 20) and larger total size (say more than 1 GByte total,
+or 128 MByte for any one file).
+
+The motivation for supporting "bundled" filesets is that there is only a single
+file to archive.
+
+
+## Ingest Pseudocode
+
+1. Determine `platform`, which may involve resolving redirects and crawling a landing page.
+
+ a. currently we always crawl the ingest `base_url`, capturing a platform landing page
+ b. we don't currently handle the case of `base_url` leading to a non-HTML
+ terminal resource. the `component` ingest type does handle this
+
+2. Use platform-specific methods to fetch manifest metadata and decide on an `ingest_strategy`.
+
+ a. depending on platform, may include access URLs for multiple strategies
+ (eg, URL for each file and a bundle URL), metadata about the item for, eg,
+ archive.org item upload, etc
+
+3. Use strategy-specific methods to archive all files in platform manifest, and verify manifest metadata.
+
+4. Summarize status and return structured result metadata.
+
+ a. if the strategy was `web-file` or `archiveorg-file`, potentially submit an
+ `ingest_file_result` object down the file ingest pipeline (Kafka topic and
+ later persist and fatcat import workers), with `dataset-file` ingest
+ type (or `{ingest_type}-file` more generally).
+
+New python types:
+
+ FilesetManifestFile
+ path: str
+ size: Optional[int]
+ md5: Optional[str]
+ sha1: Optional[str]
+ sha256: Optional[str]
+ mimetype: Optional[str]
+ extra: Optional[Dict[str, Any]]
+
+ status: Optional[str]
+ platform_url: Optional[str]
+ terminal_url: Optional[str]
+ terminal_dt: Optional[str]
+
+ FilesetPlatformItem
+ platform_name: str
+ platform_status: str
+ platform_domain: Optional[str]
+ platform_id: Optional[str]
+ manifest: Optional[List[FilesetManifestFile]]
+ archiveorg_item_name: Optional[str]
+ archiveorg_item_meta
+ web_base_url
+ web_bundle_url
+
+ ArchiveStrategyResult
+ ingest_strategy: str
+ status: str
+ manifest: List[FilesetManifestFile]
+ file_file_meta: Optional[dict]
+ file_terminal: Optional[dict]
+ file_cdx: Optional[dict]
+ bundle_file_meta: Optional[dict]
+ bundle_terminal: Optional[dict]
+ bundle_cdx: Optional[dict]
+ bundle_archiveorg_path: Optional[dict]
+
+New python APIs/classes:
+
+ FilesetPlatformHelper
+ match_request(request, resource, html_biblio) -> bool
+ does the request and landing page metadata indicate a match for this platform?
+ process_request(request, resource, html_biblio) -> FilesetPlatformItem
+ do API requests, parsing, etc to fetch metadata and access URLs for this fileset/dataset. platform-specific
+ chose_strategy(item: FilesetPlatformItem) -> IngestStrategy
+ select an archive strategy for the given fileset/dataset
+
+ FilesetIngestStrategy
+ check_existing(item: FilesetPlatformItem) -> Optional[ArchiveStrategyResult]
+ check the given backend for an existing capture/archive; if found, return result
+ process(item: FilesetPlatformItem) -> ArchiveStrategyResult
+ perform an actual archival capture
+
+## Limits and Failure Modes
+
+- `too-large-size`: total size of the fileset is too large for archiving.
+ initial limit is 64 GBytes, controlled by `max_total_size` parameter.
+- `too-many-files`: number of files (and thus file-level metadata) is too
+ large. initial limit is 200, controlled by `max_file_count` parameter.
+- `platform-scope / FilesetPlatformScopeError`: for when `base_url` leads to a
+ valid platform, which could be found via API or parsing, but has the wrong
+ scope. Eg, tried to fetch a dataset, but got a DOI which represents all
+ versions of the dataset, not a specific version.
+- `platform-restricted`/`PlatformRestrictedError`: for, eg, embargoes
+- `platform-404`: got to a landing page, and seemed like in-scope, but no
+ platform record found anyways
+
+
+## New Sandcrawler Code and Worker
+
+ sandcrawler-ingest-fileset-worker@{1..6} (or up to 1..12 later)
+
+Worker consumes from ingest request topic, produces to fileset ingest results,
+and optionally produces to file ingest results.
+
+ sandcrawler-persist-ingest-fileset-worker@1
+
+Simply writes fileset ingest rows to SQL.
+
+
+## New Fatcat Worker and Code Changes
+
+ fatcat-import-ingest-fileset-worker
+
+This importer is modeled on file and web worker. Filters for `success` with
+strategy of `*-fileset*`.
+
+Existing `fatcat-import-ingest-file-worker` should be updated to allow
+`dataset` single-file imports, with largely same behavior and semantics as
+current importer (`component` mode).
+
+Existing fatcat transforms, and possibly even elasticsearch schemas, should be
+updated to include fileset status and `in_ia` flag for dataset type releases.
+
+Existing entity updates worker submits `dataset` type ingests to ingest request
+topic.
+
+
+## Ingest Result Schema
+
+Common with file results, and mostly relating to landing page HTML:
+
+ hit: bool
+ status: str
+ success
+ success-existing
+ success-file (for `web-file` or `archiveorg-file` only)
+ request: object
+ terminal: object
+ file_meta: object
+ cdx: object
+ revisit_cdx: object
+ html_biblio: object
+
+Additional fileset-specific fields:
+
+ manifest: list of objects
+ platform_name: str
+ platform_domain: str
+ platform_id: str
+ platform_base_url: str
+ ingest_strategy: str
+ archiveorg_item_name: str (optional, only for `archiveorg-*` strategies)
+ file_count: int
+ total_size: int
+ fileset_bundle (optional, only for `*-fileset-bundle` strategy)
+ file_meta
+ cdx
+ revisit_cdx
+ terminal
+ archiveorg_bundle_path
+ fileset_file (optional, only for `*-file` strategy)
+ file_meta
+ terminal
+ cdx
+ revisit_cdx
+
+If the strategy was `web-file` or `archiveorg-file` and the status is
+`success-file`, then an ingest file result will also be published to
+`sandcrawler-ENV.ingest-file-results`, using the same ingest type and fields as
+regular ingest.
+
+
+All fileset ingest results get published to ingest-fileset-result.
+
+Existing sandcrawler persist workers also subscribe to this topic and persist
+status and landing page terminal info to tables just like with file ingest.
+GROBID, HTML, and other metadata is not persisted in this path.
+
+If the ingest strategy was a single file (`*-file`), then an ingest file is
+also published to the ingest-file-result topic, with the `fileset_file`
+metadata, and ingest type `dataset-file`. This should only happen on success
+condition.
+
+
+## New SQL Tables
+
+Note that this table *complements* `ingest_file_result`, doesn't replace it.
+`ingest_file_result` could more accurately be called `ingest_result`.
+
+ CREATE TABLE IF NOT EXISTS ingest_fileset_platform (
+ ingest_type TEXT NOT NULL CHECK (octet_length(ingest_type) >= 1),
+ base_url TEXT NOT NULL CHECK (octet_length(base_url) >= 1),
+ updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
+ hit BOOLEAN NOT NULL,
+ status TEXT CHECK (octet_length(status) >= 1),
+
+ platform_name TEXT NOT NULL CHECK (octet_length(platform_name) >= 1),
+ platform_domain TEXT NOT NULL CHECK (octet_length(platform_domain) >= 1),
+ platform_id TEXT NOT NULL CHECK (octet_length(platform_id) >= 1),
+ ingest_strategy TEXT CHECK (octet_length(ingest_strategy) >= 1),
+ total_size BIGINT,
+ file_count BIGINT,
+ archiveorg_item_name TEXT CHECK (octet_length(archiveorg_item_name) >= 1),
+
+ archiveorg_item_bundle_path TEXT CHECK (octet_length(archiveorg_item_bundle_path) >= 1),
+ web_bundle_url TEXT CHECK (octet_length(web_bundle_url) >= 1),
+ web_bundle_dt TEXT CHECK (octet_length(web_bundle_dt) = 14),
+
+ manifest JSONB,
+ -- list, similar to fatcat fileset manifest, plus extra:
+ -- status (str)
+ -- path (str)
+ -- size (int)
+ -- md5 (str)
+ -- sha1 (str)
+ -- sha256 (str)
+ -- mimetype (str)
+ -- extra (dict)
+ -- platform_url (str)
+ -- terminal_url (str)
+ -- terminal_dt (str)
+
+ PRIMARY KEY (ingest_type, base_url)
+ );
+ CREATE INDEX ingest_fileset_platform_name_domain_id_idx ON ingest_fileset_platform(platform_name, platform_domain, platform_id);
+
+Persist worker should only insert in to this table if `platform_name` is
+identified.
+
+## New Kafka Topic
+
+ sandcrawler-ENV.ingest-fileset-results 6x, no retention limit
+
+
+## Implementation Plan
+
+First implement ingest worker, including platform and strategy helpers, and
+test those as simple stdin/stdout CLI tools in sandcrawler repo to validate
+this proposal.
+
+Second implement fatcat importer and test locally and/or in QA.
+
+Lastly implement infrastructure, automation, and other "glue":
+
+- SQL schema
+- persist worker
+
+
+## Design Note: Single-File Datasets
+
+Should datasets and other groups of files which only contain a single file get
+imported as a fatcat `file` or `fileset`? This can be broken down further as
+documents (single PDF) vs other individual files.
+
+Advantages of `file`:
+
+- handles case of article PDFs being marked as dataset accidentally
+- `file` entities get de-duplicated with simple lookup (eg, on `sha1`)
+- conceptually simpler if individual files are `file` entity
+- easier to download individual files
+
+Advantages of `fileset`:
+
+- conceptually simpler if all `dataset` entities have `fileset` form factor
+- code path is simpler: one fewer strategy, and less complexity of sending
+ files down separate import path
+- metadata about platform is retained
+- would require no modification of existing fatcat file importer
+- fatcat import of archive.org of `file` is not actually implemented yet?
+
+Decision is to do individual files. Fatcat fileset import worker should reject
+single-file (and empty) manifest filesets. Fatcat file import worker should
+accept all mimetypes for `dataset-file` (similar to `component`).
+
+
+## Example Entities
+
+See `notes/dataset_examples.txt`
diff --git a/proposals/2021-09-13_src_ingest.md b/proposals/2021-09-13_src_ingest.md
new file mode 100644
index 0000000..470827a
--- /dev/null
+++ b/proposals/2021-09-13_src_ingest.md
@@ -0,0 +1,53 @@
+
+File Ingest Mode: 'src'
+=======================
+
+Ingest type for "source" of works in document form. For example, tarballs of
+LaTeX source and figures, as published on arxiv.org and Pubmed Central.
+
+For now, presumption is that this would be a single file (`file` entity in
+fatcat).
+
+Initial mimetypes to allow:
+
+- text/x-tex
+- application/xml
+- application/gzip
+- application/x-bzip
+- application/x-bzip2
+- application/zip
+- application/x-tar
+- application/msword
+- application/vnd.openxmlformats-officedocument.wordprocessingml.document
+
+
+## Fatcat Changes
+
+In the file importer, allow the additional mimetypes for 'src' ingest.
+
+Might keep ingest disabled on the fatcat side, at least initially. Eg, until
+there is some scope of "file scope", or other ways of treating 'src' tarballs
+separate from PDFs or other fulltext formats.
+
+
+## Ingest Changes
+
+Allow additional terminal mimetypes for 'src' crawls.
+
+
+## Examples
+
+ arxiv:2109.00954v1
+ fatcat:release_akzp2lgqjbcbhpoeoitsj5k5hy
+ https://arxiv.org/format/2109.00954v1
+ https://arxiv.org/e-print/2109.00954v1
+
+ arxiv:1912.03397v2
+ https://arxiv.org/format/1912.03397v2
+ https://arxiv.org/e-print/1912.03397v2
+ NOT: https://arxiv.org/pdf/1912.03397v2
+
+ pmcid:PMC3767916
+ https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/08/03/PMC3767916.tar.gz
+
+For PMC, will need to use one of the .csv file lists to get the digit prefixes.
diff --git a/proposals/2021-09-21_spn_accounts.md b/proposals/2021-09-21_spn_accounts.md
new file mode 100644
index 0000000..e41c162
--- /dev/null
+++ b/proposals/2021-09-21_spn_accounts.md
@@ -0,0 +1,14 @@
+
+Formalization of SPNv2 API requests from fatcat/sandcrawler
+
+Create two new system accounts, one for regular/daily ingest requests, one for
+priority requests (save-paper-now or as a flag with things like fatcat-ingest;
+"interactive"). These accounts should have @archive.org emails. Request the
+daily one to have the current rate limit as bnewbold@archive.org account; the
+priority queue can have less.
+
+Create new ingest kafka queues from scratch, one for priority and one for
+regular. Chose sizes carefully, probably keep 24x for the regular and do 6x or
+so (small) for priority queue.
+
+Deploy new priority workers; reconfigure/deploy broadly.
diff --git a/proposals/2021-10-28_grobid_refs.md b/proposals/2021-10-28_grobid_refs.md
new file mode 100644
index 0000000..1fc79b6
--- /dev/null
+++ b/proposals/2021-10-28_grobid_refs.md
@@ -0,0 +1,125 @@
+
+GROBID References in Sandcrawler DB
+===================================
+
+Want to start processing "unstructured" raw references coming from upstream
+metadata sources (distinct from upstream fulltext sources, like PDFs or JATS
+XML), and save the results in sandcrawler DB. From there, they will get pulled
+in to fatcat-scholar "intermediate bundles" and included in reference exports.
+
+The initial use case for this is to parse "unstructured" references deposited
+in Crossref, and include them in refcat.
+
+
+## Schema and Semantics
+
+The output JSON/dict schema for parsed references follows that of
+`grobid_tei_xml` version 0.1.x, for the `GrobidBiblio` field. The
+`unstructured` field that was parsed is included in the output, though it may
+not be byte-for-byte exact (see below). One notable change from the past (eg,
+older GROBID-parsed references) is that author `name` is now `full_name`. New
+fields include `editors` (same schema as `authors`), `book_title`, and
+`series_title`.
+
+The overall output schema matches that of the `grobid_refs` SQL table:
+
+ source: string, lower-case. eg 'crossref'
+ source_id: string, eg '10.1145/3366650.3366668'
+ source_ts: optional timestamp (full ISO datetime with timezone (eg, `Z`
+ suffix), which identifies version of upstream metadata
+ refs_json: JSON, list of `GrobidBiblio` JSON objects
+
+References are re-processed on a per-article (or per-release) basis. All the
+references for an article are handled as a batch and output as a batch. If
+there are no upstream references, row with `ref_json` as empty list may be
+returned.
+
+Not all upstream references get re-parsed, even if an 'unstructured' field is
+available. If 'unstructured' is not available, no row is ever output. For
+example, if a reference includes `unstructured` (raw citation string), but also
+has structured metadata for authors, title, year, and journal name, we might
+not re-parse the `unstructured` string. Whether to re-parse is evaulated on a
+per-reference basis. This behavior may change over time.
+
+`unstructured` strings may be pre-processed before being submitted to GROBID.
+This is because many sources have systemic encoding issues. GROBID itself may
+also do some modification of the input citation string before returning it in
+the output. This means the `unstructured` string is not a reliable way to map
+between specific upstream references and parsed references. Instead, the `id`
+field (str) of `GrobidBiblio` gets set to any upstream "key" or "index"
+identifier used to track individual references. If there is only a numeric
+index, the `id` is that number as a string.
+
+The `key` or `id` may need to be woven back in to the ref objects manually,
+because GROBID `processCitationList` takes just a list of raw strings, with no
+attached reference-level key or id.
+
+
+## New SQL Table and View
+
+We may want to do re-parsing of references from sources other than `crossref`,
+so there is a generic `grobid_refs` table. But it is also common to fetch both
+the crossref metadata and any re-parsed references together, so as a convenience
+there is a PostgreSQL view (virtual table) that includes both a crossref
+metadata record and parsed citations, if available. If downstream code cares a
+lot about having the refs and record be in sync, the `source_ts` field on
+`grobid_refs` can be matched against the `indexed` column of `crossref` (or the
+`.indexed.date-time` JSON field in the record itself).
+
+Remember that DOIs should always be lower-cased before querying, inserting,
+comparing, etc.
+
+ CREATE TABLE IF NOT EXISTS grobid_refs (
+ source TEXT NOT NULL CHECK (octet_length(source) >= 1),
+ source_id TEXT NOT NULL CHECK (octet_length(source_id) >= 1),
+ source_ts TIMESTAMP WITH TIME ZONE,
+ updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
+ refs_json JSON NOT NULL,
+ PRIMARY KEY(source, source_id)
+ );
+
+ CREATE OR REPLACE VIEW crossref_with_refs (doi, indexed, record, source_ts, refs_json) AS
+ SELECT
+ crossref.doi as doi,
+ crossref.indexed as indexed,
+ crossref.record as record,
+ grobid_refs.source_ts as source_ts,
+ grobid_refs.refs_json as refs_json
+ FROM crossref
+ LEFT JOIN grobid_refs ON
+ grobid_refs.source_id = crossref.doi
+ AND grobid_refs.source = 'crossref';
+
+Both `grobid_refs` and `crossref_with_refs` will be exposed through postgrest.
+
+
+## New Workers / Tools
+
+For simplicity, to start, a single worker with consume from
+`fatcat-prod.api-crossref`, process citations with GROBID (if necessary), and
+insert to both `crossref` and `grobid_refs` tables. This worker will run
+locally on the machine with sandcrawler-db.
+
+Another tool will support taking large chunks of Crossref JSON (as lines),
+filter them, process with GROBID, and print JSON to stdout, in the
+`grobid_refs` JSON schema.
+
+
+## Task Examples
+
+Command to process crossref records with refs tool:
+
+ cat crossref_sample.json \
+ | parallel -j5 --linebuffer --round-robin --pipe ./grobid_tool.py parse-crossref-refs - \
+ | pv -l \
+ > crossref_sample.parsed.json
+
+ # => 10.0k 0:00:27 [ 368 /s]
+
+Load directly in to postgres (after tables have been created):
+
+ cat crossref_sample.parsed.json \
+ | jq -rc '[.source, .source_id, .source_ts, (.refs_json | tostring)] | @tsv' \
+ | psql sandcrawler -c "COPY grobid_refs (source, source_id, source_ts, refs_json) FROM STDIN (DELIMITER E'\t');"
+
+ # => COPY 9999
diff --git a/proposals/2021-12-09_trawling.md b/proposals/2021-12-09_trawling.md
new file mode 100644
index 0000000..33b6b4c
--- /dev/null
+++ b/proposals/2021-12-09_trawling.md
@@ -0,0 +1,180 @@
+
+status: work-in-progress
+
+NOTE: as of December 2022, the implementation on these features haven't been
+merged to the main branch. Development stalled in December 2021.
+
+Trawling for Unstructured Scholarly Web Content
+===============================================
+
+## Background and Motivation
+
+A long-term goal for sandcrawler has been the ability to pick through
+unstructured web archive content (or even non-web collection), identify
+potential in-scope research outputs, extract metadata for those outputs, and
+merge the content in to a catalog (fatcat).
+
+This process requires integration of many existing tools (HTML and PDF
+extraction; fuzzy bibliographic metadata matching; machine learning to identify
+in-scope content; etc), as well as high-level curration, targetting, and
+evaluation by human operators. The goal is to augment and improve the
+productivity of human operators as much as possible.
+
+This process will be similar to "ingest", which is where we start with a
+specific URL and have some additional context about the expected result (eg,
+content type, exernal identifier). Some differences with trawling are that we
+are start with a collection or context (instead of single URL); have little or
+no context about the content we are looking for; and may even be creating a new
+catalog entry, as opposed to matching to a known existing entry.
+
+
+## Architecture
+
+The core operation is to take a resource and run a flowchart of processing
+steps on it, resulting in an overall status and possible related metadata. The
+common case is that the resource is a PDF or HTML coming from wayback (with
+contextual metadata about the capture), but we should be flexible to supporting
+more content types in the future, and should try to support plain files with no
+context as well.
+
+Some relatively simple wrapper code handles fetching resources and summarizing
+status/counts.
+
+Outside of the scope of sandcrawler, new fatcat code (importer or similar) will
+be needed to handle trawl results. It will probably make sense to pre-filter
+(with `jq` or `rg`) before passing results to fatcat.
+
+At this stage, trawl workers will probably be run manually. Some successful
+outputs (like GROBID, HTML metadata) would be written to existing kafka topics
+to be persisted, but there would not be any specific `trawl` SQL tables or
+automation.
+
+It will probably be helpful to have some kind of wrapper script that can run
+sandcrawler trawl processes, then filter and pipe the output into fatcat
+importer, all from a single invocation, while reporting results.
+
+TODO:
+- for HTML imports, do we fetch the full webcapture stuff and return that?
+
+
+## Methods of Operation
+
+### `cdx_file`
+
+An existing CDX file is provided on-disk locally.
+
+### `cdx_api`
+
+Simplified variants: `cdx_domain`, `cdx_surt`
+
+Uses CDX API to download records matching the configured filters, then processes the file.
+
+Saves the CDX file intermediate result somewhere locally (working or tmp
+directory), with timestamp in the path, to make re-trying with `cdx_file` fast
+and easy.
+
+
+### `archiveorg_web_collection`
+
+Uses `cdx_collection.py` (or similar) to fetch a full CDX list, by iterating over
+then process it.
+
+Saves the CDX file intermediate result somewhere locally (working or tmp
+directory), with timestamp in the path, to make re-trying with `cdx_file` fast
+and easy.
+
+### Others
+
+- `archiveorg_file_collection`: fetch file list via archive.org metadata, then processes each
+
+## Schema
+
+Per-resource results:
+
+ hit (bool)
+ indicates whether resource seems in scope and was processed successfully
+ (roughly, status 'success', and
+ status (str)
+ success: fetched resource, ran processing, pa
+ skip-cdx: filtered before even fetching resource
+ skip-resource: filtered after fetching resource
+ wayback-error (etc): problem fetching
+ content_scope (str)
+ filtered-{filtertype}
+ article (etc)
+ landing-page
+ resource_type (str)
+ pdf, html
+ file_meta{}
+ cdx{}
+ revisit_cdx{}
+
+ # below are resource_type specific
+ grobid
+ pdf_meta
+ pdf_trio
+ html_biblio
+ (other heuristics and ML)
+
+High-level request:
+
+ trawl_method: str
+ cdx_file_path
+ default_filters: bool
+ resource_filters[]
+ scope: str
+ surt_prefix, domain, host, mimetype, size, datetime, resource_type, http_status
+ value: any
+ values[]: any
+ min: any
+ max: any
+ biblio_context{}: set of expected/default values
+ container_id
+ release_type
+ release_stage
+ url_rel
+
+High-level summary / results:
+
+ status
+ request{}: the entire request object
+ counts
+ total_resources
+ status{}
+ content_scope{}
+ resource_type{}
+
+## Example Corpuses
+
+All PDFs (`application/pdf`) in web.archive.org from before the year 2000.
+Starting point would be a CDX list.
+
+Spidering crawls starting from a set of OA journal homepage URLs.
+
+Archive-It partner collections from research universities, particularly of
+their own .edu domains. Starting point would be an archive.org collection, from
+which WARC files or CDX lists can be accessed.
+
+General archive.org PDF collections, such as
+[ERIC](https://archive.org/details/ericarchive) or
+[Document Cloud](https://archive.org/details/documentcloud).
+
+Specific Journal or Publisher URL patterns. Starting point could be a domain,
+hostname, SURT prefix, and/or URL regex.
+
+Heuristic patterns over full web.archive.org CDX index. For example, .edu
+domains with user directories and a `.pdf` in the file path ("tilde" username
+pattern).
+
+Random samples of entire Wayback corpus. For example, random samples filtered
+by date, content type, TLD, etc. This would be true "trawling" over the entire
+corpus.
+
+
+## Other Ideas
+
+Could have a web archive spidering mode: starting from a seed, fetch multiple
+captures (different captures), then extract outlinks from those, up to some
+number of hops. An example application would be links to research group
+webpages or author homepages, and to try to extract PDF links from CVs, etc.
+
diff --git a/proposals/brainstorm/2021-debug_web_interface.md b/proposals/brainstorm/2021-debug_web_interface.md
new file mode 100644
index 0000000..442b439
--- /dev/null
+++ b/proposals/brainstorm/2021-debug_web_interface.md
@@ -0,0 +1,9 @@
+
+status: brainstorm idea
+
+Simple internal-only web interface to help debug ingest issues.
+
+- paste a hash, URL, or identifier and get a display of "everything we know" about it
+- enter a URL/SURT prefix and get aggregate stats (?)
+- enter a domain/host/prefix and get recent attempts/results
+- pre-computed periodic reports on ingest pipeline (?)
diff --git a/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md
new file mode 100644
index 0000000..b3ad447
--- /dev/null
+++ b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md
@@ -0,0 +1,36 @@
+
+status: brainstorming
+
+We continue to see issues with heritrix3-based crawling. Would like to have an
+option to switch to higher-throughput heritrix-based crawling.
+
+SPNv2 path would stick around at least for save-paper-now style ingest.
+
+
+## Sketch
+
+Ingest requests are created continuously by fatcat, with daily spikes.
+
+Ingest workers run mostly in "bulk" mode, aka they don't make SPNv2 calls.
+`no-capture` responses are recorded in sandcrawler SQL database.
+
+Periodically (daily?), a script queries for new no-capture results, filtered to
+the most recent period. These are processed in a bit in to a URL list, then
+converted to a heritrix frontier, and sent to crawlers. This could either be an
+h3 instance (?), or simple `scp` to a running crawl directory.
+
+The crawler crawls, with usual landing page config, and draintasker runs.
+
+TODO: can we have draintasker/heritrix set a maximum WARC life? Like 6 hours?
+or, target a smaller draintasker item size, so they get updated more frequently
+
+Another SQL script dumps ingest requests from the *previous* period, and
+re-submits them for bulk-style ingest (by workers).
+
+The end result would be things getting crawled and updated within a couple
+days.
+
+
+## Sketch 2
+
+Upload URL list to petabox item, wait for heritrix derive to run (!)