diff options
Diffstat (limited to 'proposals')
-rw-r--r-- | proposals/2018_original_sandcrawler_rfc.md | 180 | ||||
-rw-r--r-- | proposals/2019_ingest.md | 6 | ||||
-rw-r--r-- | proposals/20200129_pdf_ingest.md | 10 | ||||
-rw-r--r-- | proposals/20200207_pdftrio.md | 5 | ||||
-rw-r--r-- | proposals/20200211_nsq.md | 79 | ||||
-rw-r--r-- | proposals/20201012_no_capture.md | 39 | ||||
-rw-r--r-- | proposals/20201026_html_ingest.md | 129 | ||||
-rw-r--r-- | proposals/20201103_xml_ingest.md | 64 | ||||
-rw-r--r-- | proposals/2020_pdf_meta_thumbnails.md | 328 | ||||
-rw-r--r-- | proposals/2020_seaweed_s3.md | 426 | ||||
-rw-r--r-- | proposals/2021-04-22_crossref_db.md | 86 | ||||
-rw-r--r-- | proposals/2021-09-09_component_ingest.md | 114 | ||||
-rw-r--r-- | proposals/2021-09-09_fileset_ingest.md | 343 | ||||
-rw-r--r-- | proposals/2021-09-13_src_ingest.md | 53 | ||||
-rw-r--r-- | proposals/2021-09-21_spn_accounts.md | 14 | ||||
-rw-r--r-- | proposals/2021-10-28_grobid_refs.md | 125 | ||||
-rw-r--r-- | proposals/2021-12-09_trawling.md | 180 | ||||
-rw-r--r-- | proposals/brainstorm/2021-debug_web_interface.md | 9 | ||||
-rw-r--r-- | proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md | 36 |
19 files changed, 2217 insertions, 9 deletions
diff --git a/proposals/2018_original_sandcrawler_rfc.md b/proposals/2018_original_sandcrawler_rfc.md new file mode 100644 index 0000000..ecf7ab8 --- /dev/null +++ b/proposals/2018_original_sandcrawler_rfc.md @@ -0,0 +1,180 @@ + +**Title:** Journal Archiving Pipeline + +**Author:** Bryan Newbold <bnewbold@archive.org> + +**Date:** March 2018 + +**Status:** work-in-progress + +This is an RFC-style technical proposal for a journal crawling, archiving, +extracting, resolving, and cataloging pipeline. + +Design work funded by a Mellon Foundation grant in 2018. + +## Overview + +Let's start with data stores first: + +- crawled original fulltext (PDF, JATS, HTML) ends up in petabox/global-wayback +- file-level extracted fulltext and metadata is stored in HBase, with the hash + of the original file as the key +- cleaned metadata is stored in a "catalog" relational (SQL) database (probably + PostgreSQL or some hip scalable NewSQL thing compatible with Postgres or + MariaDB) + +**Resources:** back-of-the-envelope, around 100 TB petabox storage total (for +100 million PDF files); 10-20 TB HBase table total. Can start small. + + +All "system" (aka, pipeline) state (eg, "what work has been done") is ephemeral +and is rederived relatively easily (but might be cached for performance). + +The overall "top-down", metadata-driven cycle is: + +1. Partners and public sources provide metadata (for catalog) and seed lists + (for crawlers) +2. Crawlers pull in fulltext and HTTP/HTML metadata from the public web +3. Extractors parse raw fulltext files (PDFs) and store structured metadata (in + HBase) +4. Data Mungers match extracted metadata (from HBase) against the catalog, or + create new records if none found. + +In the "bottom up" cycle, batch jobs run as map/reduce jobs against the +catalog, HBase, global wayback, and partner metadata datasets to identify +potential new public or already-archived content to process, and pushes tasks +to the crawlers, extractors, and mungers. + +## Partner Metadata + +Periodic Luigi scripts run on a regular VM to pull in metadata from partners. +All metadata is saved to either petabox (for public stuff) or HDFS (for +restricted). Scripts process/munge the data and push directly to the catalog +(for trusted/authoritative sources like Crossref, ISSN, PubMed, DOAJ); others +extract seedlists and push to the crawlers ( + +**Resources:** 1 VM (could be a devbox), with a large attached disk (spinning +probably ok) + +## Crawling + +All fulltext content comes in from the public web via crawling, and all crawled +content ends up in global wayback. + +One or more VMs serve as perpetual crawlers, with multiple active ("perpetual") +Heritrix crawls operating with differing configuration. These could be +orchestrated (like h3), or just have the crawl jobs cut off and restarted every +year or so. + +In a starter configuration, there would be two crawl queues. One would target +direct PDF links, landing pages, author homepages, DOI redirects, etc. It would +process HTML and look for PDF outlinks, but wouldn't crawl recursively. + +HBase is used for de-dupe, with records (pointers) stored in WARCs. + +A second config would take seeds as entire journal websites, and would crawl +continuously. + +Other components of the system "push" tasks to the crawlers by copying schedule +files into the crawl action directories. + +WARCs would be uploaded into petabox via draintasker as usual, and CDX +derivation would be left to the derive process. Other processes are notified of +"new crawl content" being available when they see new unprocessed CDX files in +items from specific collections. draintasker could be configured to "cut" new +items every 24 hours at most to ensure this pipeline moves along regularly, or +we could come up with other hacks to get lower "latency" at this stage. + +**Resources:** 1-2 crawler VMs, each with a large attached disk (spinning) + +### De-Dupe Efficiency + +We would certainly feed CDX info from all bulk journal crawling into HBase +before any additional large crawling, to get that level of de-dupe. + +As to whether all GWB PDFs should be de-dupe against is a policy question: is +there something special about the journal-specific crawls that makes it worth +having second copies? Eg, if we had previously domain crawled and access is +restricted, we then wouldn't be allowed to provide researcher access to those +files... on the other hand, we could extract for researchers given that we +"refound" the content at a new URL? + +Only fulltext files (PDFs) would be de-duped against (by content), so we'd be +recrawling lots of HTML. Presumably this is a fraction of crawl data size; what +fraction? + +Watermarked files would be refreshed repeatedly from the same PDF, and even +extracted/processed repeatedly (because the hash would be different). This is +hard to de-dupe/skip, because we would want to catch "content drift" (changes +in files). + +## Extractors + +Off-the-shelf PDF extraction software runs on high-CPU VM nodes (probably +GROBID running on 1-2 data nodes, which have 30+ CPU cores and plenty of RAM +and network throughput). + +A hadoop streaming job (written in python) takes a CDX file as task input. It +filters for only PDFs, and then checks each line against HBase to see if it has +already been extracted. If it hasn't, the script downloads directly from +petabox using the full CDX info (bypassing wayback, which would be a +bottleneck). It optionally runs any "quick check" scripts to see if the PDF +should be skipped ("definitely not a scholarly work"), then if it looks Ok +submits the file over HTTP to the GROBID worker pool for extraction. The +results are pushed to HBase, and a short status line written to Hadoop. The +overall Hadoop job has a reduce phase that generates a human-meaningful report +of job status (eg, number of corrupt files) for monitoring. + +A side job as part of extracting can "score" the extracted metadata to flag +problems with GROBID, to be used as potential training data for improvement. + +**Resources:** 1-2 datanode VMs; hadoop cluster time. Needed up-front for +backlog processing; less CPU needed over time. + +## Matchers + +The matcher runs as a "scan" HBase map/reduce job over new (unprocessed) HBasej +rows. It pulls just the basic metadata (title, author, identifiers, abstract) +and calls the catalog API to identify potential match candidates. If no match +is found, and the metadata "look good" based on some filters (to remove, eg, +spam), works are inserted into the catalog (eg, for those works that don't have +globally available identifiers or other metadata; "long tail" and legacy +content). + +**Resources:** Hadoop cluster time + +## Catalog + +The catalog is a versioned relational database. All scripts interact with an +API server (instead of connecting directly to the database). It should be +reliable and low-latency for simple reads, so it can be relied on to provide a +public-facing API and have public web interfaces built on top. This is in +contrast to Hadoop, which for the most part could go down with no public-facing +impact (other than fulltext API queries). The catalog does not contain +copywritable material, but it does contain strong (verified) links to fulltext +content. Policy gets implemented here if necessary. + +A global "changelog" (append-only log) is used in the catalog to record every +change, allowing for easier replication (internal or external, to partners). As +little as possible is implemented in the catalog itself; instead helper and +cleanup bots use the API to propose and verify edits, similar to the wikidata +and git data models. + +Public APIs and any front-end services are built on the catalog. Elasticsearch +(for metadata or fulltext search) could build on top of the catalog. + +**Resources:** Unknown, but estimate 1+ TB of SSD storage each on 2 or more +database machines + +## Machine Learning and "Bottom Up" + +TBD. + +## Logistics + +Ansible is used to deploy all components. Luigi is used as a task scheduler for +batch jobs, with cron to initiate periodic tasks. Errors and actionable +problems are aggregated in Sentry. + +Logging, metrics, and other debugging and monitoring are TBD. + diff --git a/proposals/2019_ingest.md b/proposals/2019_ingest.md index c649809..768784f 100644 --- a/proposals/2019_ingest.md +++ b/proposals/2019_ingest.md @@ -1,5 +1,5 @@ -status: work-in-progress +status: deployed This document proposes structure and systems for ingesting (crawling) paper PDFs and other content as part of sandcrawler. @@ -84,7 +84,7 @@ HTML? Or both? Let's just recrawl. *IngestRequest* - `ingest_type`: required, one of `pdf`, `xml`, `html`, `dataset`. For backwards compatibility, `file` should be interpreted as `pdf`. `pdf` and - `xml` return file ingest respose; `html` and `dataset` not implemented but + `xml` return file ingest response; `html` and `dataset` not implemented but would be webcapture (wayback) and fileset (archive.org item or wayback?). In the future: `epub`, `video`, `git`, etc. - `base_url`: required, where to start crawl process @@ -258,7 +258,7 @@ and hacks to crawl publicly available papers. Related existing work includes [unpaywall's crawler][unpaywall_crawl], LOCKSS extraction code, dissem.in's efforts, zotero's bibliography extractor, etc. The "memento tracer" work is also similar. Many of these are even in python! It would be great to reduce -duplicated work and maintenance. An analagous system in the wild is youtube-dl +duplicated work and maintenance. An analogous system in the wild is youtube-dl for downloading video from many sources. [unpaywall_crawl]: https://github.com/ourresearch/oadoi/blob/master/webpage.py diff --git a/proposals/20200129_pdf_ingest.md b/proposals/20200129_pdf_ingest.md index 9469217..157607e 100644 --- a/proposals/20200129_pdf_ingest.md +++ b/proposals/20200129_pdf_ingest.md @@ -1,5 +1,5 @@ -status: planned +status: deployed 2020q1 Fulltext PDF Ingest Plan =================================== @@ -27,7 +27,7 @@ There are a few million papers in fatacat which: 2. are known OA, usually because publication is Gold OA 3. don't have any fulltext PDF in fatcat -As a detail, some of these "known OA" journals actually have embargos (aka, +As a detail, some of these "known OA" journals actually have embargoes (aka, they aren't true Gold OA). In particular, those marked via EZB OA "color", and recent pubmed central ids. @@ -104,7 +104,7 @@ Actions: update ingest result table with status. - fetch new MAG and unpaywall seedlists, transform to ingest requests, persist into ingest request table. use SQL to dump only the *new* URLs (not seen in - previous dumps) using the created timestamp, outputing new bulk ingest + previous dumps) using the created timestamp, outputting new bulk ingest request lists. if possible, de-dupe between these two. then start bulk heritrix crawls over these two long lists. Probably sharded over several machines. Could also run serially (first one, then the other, with @@ -133,7 +133,7 @@ We have run GROBID+glutton over basically all of these PDFs. We should be able to do a SQL query to select PDFs that: - have at least one known CDX row -- GROBID processed successfuly and glutton matched to a fatcat release +- GROBID processed successfully and glutton matched to a fatcat release - do not have an existing fatcat file (based on sha1hex) - output GROBID metadata, `file_meta`, and one or more CDX rows @@ -161,7 +161,7 @@ Coding Tasks: Actions: - update `fatcat_file` sandcrawler table -- check how many PDFs this might ammount to. both by uniq SHA1 and uniq +- check how many PDFs this might amount to. both by uniq SHA1 and uniq `fatcat_release` matches - do some manual random QA verification to check that this method results in quality content in fatcat diff --git a/proposals/20200207_pdftrio.md b/proposals/20200207_pdftrio.md index 31a2db6..6f6443f 100644 --- a/proposals/20200207_pdftrio.md +++ b/proposals/20200207_pdftrio.md @@ -1,5 +1,8 @@ -status: in progress +status: deployed + +NOTE: while this has been used in production, as of December 2022 the results +are not used much in practice, and we don't score every PDF that comes along PDF Trio (ML Classification) ============================== diff --git a/proposals/20200211_nsq.md b/proposals/20200211_nsq.md new file mode 100644 index 0000000..6aa885b --- /dev/null +++ b/proposals/20200211_nsq.md @@ -0,0 +1,79 @@ + +status: planned + +In short, Kafka is not working well as a job task scheduler, and I want to try +NSQ as a medium-term solution to that problem. + + +## Motivation + +Thinking of setting up NSQ to use for scheduling distributed work, to replace +kafka for some topics. for example, "regrobid" requests where we enqueue +millions of, basically, CDX lines, and want to process on dozens of cores or +multiple machines. or file ingest backfill. results would still go to kafka (to +persist), and pipelines like DOI harvest -> import -> elasticsearch would still +be kafka + +The pain point with kafka is having dozens of workers on tasks that take more +than a couple seconds per task. we could keep tweaking kafka and writing weird +consumer group things to handle this, but I think it will never work very well. +NSQ supports re-queues with delay (eg, on failure, defer to re-process later), +allows many workers to connect and leave with no disruption, messages don't +have to be processed in order, and has a very simple enqueue API (HTTP POST). + +The slowish tasks we have now are file ingest (wayback and/or SPNv2 + +GROBID) and re-GROBID. In the near future will also have ML backlog to go +through. + +Throughput isn't much of a concern as tasks take 10+ seconds each. + + +## Specific Plan + +Continue publishing ingest requests to Kafka topic. Have a new persist worker +consume from this topic and push to request table (but not result table) using +`ON CONFLICT DO NOTHING`. Have a new single-process kafka consumer pull from +the topic and push to NSQ. This consumer monitors NSQ and doesn't push too many +requests (eg, 1k maximum). NSQ could potentially even run as in-memory mode. +New worker/pusher class that acts as an NSQ client, possibly with parallelism. + +*Clean* NSQ shutdown/restart always persists data locally to disk. + +Unclean shutdown (eg, power failure) would mean NSQ might have lost state. +Because we are persisting requests to sandcrawler-db, cleanup is simple: +re-enqueue all requests from the past N days with null result or result older +than M days. + +Still need multiple kafka and NSQ topics to have priority queues (eg, bulk, +platform-specific). + +To start, have a single static NSQ host; don't need nsqlookupd. Could use +wbgrp-svc506 (datanode VM with SSD, lots of CPU and RAM). + +To move hosts, simply restart the kafka pusher pointing at the new NSQ host. +When the old host's queue is empty, restart the workers to consume from the new +host, and destroy the old NSQ host. + + +## Alternatives + +Work arounds i've done to date have been using the `grobid_tool.py` or +`ingest_tool.py` JSON input modes to pipe JSON task files (millions of lines) +through GNU/parallel. I guess GNU/parallel's distributed mode is also an option +here. + +Other things that could be used: + +**celery**: popular, many features. need to run separate redis, no disk persistence (?) + +**disque**: need to run redis, no disk persistence (?) <https://github.com/antirez/disque> + +**gearman**: <http://gearman.org/> no disk persistence (?) + + +## Old Notes + +TBD if would want to switch ingest requests from fatcat -> sandcrawler over, +and have the continuous ingests run out of NSQ, or keep using kafka for that. +currently can only do up to 10x parallelism or so with SPNv2, so that isn't a +scaling pain point diff --git a/proposals/20201012_no_capture.md b/proposals/20201012_no_capture.md new file mode 100644 index 0000000..7f6a1f5 --- /dev/null +++ b/proposals/20201012_no_capture.md @@ -0,0 +1,39 @@ + +status: work-in-progress + +NOTE: as of December 2022, bnewbold can't remember if this was fully +implemented or not. + +Storing no-capture missing URLs in `terminal_url` +================================================= + +Currently, when the bulk-mode ingest code terminates with a `no-capture` +status, the missing URL (which is not in GWB CDX) is not stored in +sandcrawler-db. This proposed change is to include it in the existing +`terminal_url` database column, with the `terminal_status_code` and +`terminal_dt` columns empty. + +The implementation is rather simple: + +- CDX lookup code path should save the *actual* final missing URL (`next_url` + after redirects) in the result object's `terminal_url` field +- ensure that this field gets passed through all the way to the database on the + `no-capture` code path + +This change does change the semantics of the `terminal_url` field somewhat, and +could break existing assumptions, so it is being documented in this proposal +document. + + +## Alternatives + +The current status quo is to store the missing URL as the last element in the +"hops" field of the JSON structure. We could keep this and have a convoluted +pipeline that would read from the Kafka feed and extract them, but this would +be messy. Eg, re-ingesting would not update the old kafka messages, so we could +need some accounting of consumer group offsets after which missing URLs are +truly missing. + +We could add a new `missing_url` database column and field to the JSON schema, +for this specific use case. This seems like unnecessary extra work. + diff --git a/proposals/20201026_html_ingest.md b/proposals/20201026_html_ingest.md new file mode 100644 index 0000000..785471b --- /dev/null +++ b/proposals/20201026_html_ingest.md @@ -0,0 +1,129 @@ + +status: deployed + +HTML Ingest Pipeline +======================== + +Basic goal: given an ingest request of type 'html', output an object (JSON) +which could be imported into fatcat. + +Should work with things like (scholarly) blog posts, micropubs, registrations, +protocols. Doesn't need to work with everything to start. "Platform" sites +(like youtube, figshare, etc) will probably be a different ingest worker. + +A current unknown is what the expected size of this metadata is. Both in number +of documents and amount of metadata per document. + +Example HTML articles to start testing: + +- complex distill article: <https://distill.pub/2020/bayesian-optimization/> +- old HTML journal: <http://web.archive.org/web/20081120141926fw_/http://www.mundanebehavior.org/issues/v5n1/rosen.htm> +- NIH pub: <https://www.nlm.nih.gov/pubs/techbull/ja02/ja02_locatorplus_merge.html> +- first mondays (OJS): <https://firstmonday.org/ojs/index.php/fm/article/view/10274/9729> +- d-lib: <http://www.dlib.org/dlib/july17/williams/07williams.html> + + +## Ingest Process + +Follow base URL to terminal document, which is assumed to be a status=200 HTML document. + +Verify that terminal document is fulltext. Extract both metadata and fulltext. + +Extract list of sub-resources. Filter out unwanted (eg favicon, analytics, +unnecessary), apply a sanity limit. Convert to fully qualified URLs. For each +sub-resource, fetch down to the terminal resource, and compute hashes/metadata. + +Open questions: + +- will probably want to parallelize sub-resource fetching. async? +- behavior when failure fetching sub-resources + + +## Ingest Result Schema + +JSON should be basically compatible with existing `ingest_file_result` objects, +with some new sub-objects. + +Overall object (`IngestWebResult`): + +- `status`: str +- `hit`: bool +- `error_message`: optional, if an error +- `hops`: optional, array of URLs +- `cdx`: optional; single CDX row of primary HTML document +- `terminal`: optional; same as ingest result + - `terminal_url` + - `terminal_dt` + - `terminal_status_code` + - `terminal_sha1hex` +- `request`: optional but usually present; ingest request object, verbatim +- `file_meta`: optional; file metadata about primary HTML document +- `html_biblio`: optional; extracted biblio metadata from primary HTML document +- `scope`: optional; detected/guessed scope (fulltext, etc) +- `html_resources`: optional; array of sub-resources. primary HTML is not included +- `html_body`: optional; just the status code and some metadata is passed through; + actual document would go through a different KafkaTopic + - `status`: str + - `agent`: str, eg "trafilatura/0.4" + - `tei_xml`: optional, str + - `word_count`: optional, str + + +## New SQL Tables + +`html_meta` + sha1hex (primary key) + updated (of SQL row) + status + scope + has_teixml + has_thumbnail + word_count (from teixml fulltext) + biblio (JSON) + resources (JSON) + +Also writes to `ingest_file_result`, `file_meta`, and `cdx`, all only for the base HTML document. + +Note: needed to enable postgrest access to this table (for scholar worker). + + +## Fatcat API Wants + +Would be nice to have lookup by SURT+timestamp, and/or by sha1hex of terminal base file. + +`hide` option for cdx rows; also for fileset equivalent. + + +## New Workers + +Could reuse existing worker, have code branch depending on type of ingest. + +ingest file worker + => same as existing worker, because could be calling SPN + +persist result + => same as existing worker; adds persisting various HTML metadata + +persist html text + => talks to seaweedfs + + +## New Kafka Topics + +HTML ingest result topic (webcapture-ish) + +sandcrawler-ENV.html-teixml + JSON wrapping TEI-XML (same as other fulltext topics) + key compaction and content compression enabled + +JSON schema: + +- `key` and `sha1hex`: str; used as kafka key +- `status`: str +- `tei_xml`: str, optional +- `word_count`: int, optional + +## New S3/SeaweedFS Content + +`sandcrawler` bucket, `html` folder, `.tei.xml` suffix. + diff --git a/proposals/20201103_xml_ingest.md b/proposals/20201103_xml_ingest.md new file mode 100644 index 0000000..34e00b0 --- /dev/null +++ b/proposals/20201103_xml_ingest.md @@ -0,0 +1,64 @@ + +status: deployed + +XML Fulltext Ingest +==================== + +This document details changes to include XML fulltext ingest in the same way +that we currently ingest PDF fulltext. + +Currently this will just fetch the single XML document, which is often lacking +figures, tables, and other required files. + +## Text Encoding + +Because we would like to treat XML as a string in a couple contexts, but XML +can have multiple encodings (indicated in an XML header), we are in a bit of a +bind. Simply parsing into unicode and then re-encoding as UTF-8 could result in +a header/content mismatch. Any form of re-encoding will change the hash of the +document. For recording in fatcat, the file metadata will be passed through. +For storing in Kafka and blob store (for downstream analysis), we will parse +the raw XML document (as "bytes") with an XML parser, then re-output with UTF-8 +encoding. The hash of the *original* XML file will be used as the key for +referring to this document. This is unintuitive, but similar to what we are +doing with PDF and HTML documents (extracting in a useful format, but keeping +the original document's hash as a key). + +Unclear if we need to do this re-encode process for XML documents already in +UTF-8 encoding. + +## Ingest Worker + +Could either re-use HTML metadata extractor to fetch XML fulltext links, or +fork that code off to a separate method, like the PDF fulltext URL extractor. + +Hopefully can re-use almost all of the PDF pipeline code, by making that ingest +worker class more generic and subclassing it. + +Result objects are treated the same as PDF ingest results: the result object +has context about status, and if successful, file metadata and CDX row of the +terminal object. + +TODO: should it be assumed that XML fulltext will end up in S3 bucket? or +should there be an `xml_meta` SQL table tracking this, like we have for PDFs +and HTML? + +TODO: should we detect and specify the XML schema better? Eg, indicate if JATS. + + +## Persist Pipeline + +### Kafka Topic + +sandcrawler-ENV.xml-doc + similar to other fulltext topics; JSON wrapping the XML + key compaction, content compression + +### S3/SeaweedFS + +`sandcrawler` bucket, `xml` folder. Extension could depend on sub-type of XML? + +### Persist Worker + +New S3-only worker that pulls from kafka topic and pushes to S3. Works +basically the same as PDF persist in S3-only mode, or like pdf-text worker. diff --git a/proposals/2020_pdf_meta_thumbnails.md b/proposals/2020_pdf_meta_thumbnails.md new file mode 100644 index 0000000..141ece8 --- /dev/null +++ b/proposals/2020_pdf_meta_thumbnails.md @@ -0,0 +1,328 @@ + +status: deployed + +New PDF derivatives: thumbnails, metadata, raw text +=================================================== + +To support scholar.archive.org (fulltext search) and other downstream uses of +fatcat, want to extract from many PDFs: + +- pdf structured metadata +- thumbnail images +- raw extracted text + +A single worker should extract all of these fields, and publish in to two kafka +streams. Separate persist workers consume from the streams and push in to SQL +and/or seaweedfs. + +Additionally, this extraction should happen automatically for newly-crawled +PDFs as part of the ingest pipeline. When possible, checks should be run +against the existing SQL table to avoid duplication of processing. + + +## PDF Metadata and Text + +Kafka topic (name: `sandcrawler-ENV.pdf-text`; 12x partitions; gzip +compression) JSON schema: + + sha1hex (string; used as key) + status (string) + text (string) + page0_thumbnail (boolean) + meta_xml (string) + pdf_info (object) + pdf_extra (object) + word_count + file_meta (object) + source (object) + +For the SQL table we should have columns for metadata fields that are *always* +saved, and put a subset of other interesting fields in a JSON blob. We don't +need all metadata fields in SQL. Full metadata/info will always be available in +Kafka, and we don't want SQL table size to explode. Schema: + + CREATE TABLE IF NOT EXISTS pdf_meta ( + sha1hex TEXT PRIMARY KEY CHECK (octet_length(sha1hex) = 40), + updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL, + status TEXT CHECK (octet_length(status) >= 1) NOT NULL, + has_page0_thumbnail BOOLEAN NOT NULL, + page_count INT CHECK (page_count >= 0), + word_count INT CHECK (word_count >= 0), + page0_height REAL CHECK (page0_height >= 0), + page0_width REAL CHECK (page0_width >= 0), + permanent_id TEXT CHECK (octet_length(permanent_id) >= 1), + pdf_created TIMESTAMP WITH TIME ZONE, + pdf_version TEXT CHECK (octet_length(pdf_version) >= 1), + metadata JSONB + -- maybe some analysis of available fields? + -- metadata JSON fields: + -- title + -- subject + -- author + -- creator + -- producer + -- CrossMarkDomains + -- doi + -- form + -- encrypted + ); + + +## Thumbnail Images + +Kafka Schema is raw image bytes as message body; sha1sum of PDF as the key. No +compression, 12x partitions. + +Kafka topic name is `sandcrawler-ENV.pdf-thumbnail-SIZE-TYPE` (eg, +`sandcrawler-qa.pdf-thumbnail-180px-jpg`). Thus, topic name contains the +"metadata" of thumbail size/shape. + +Have decided to use JPEG thumbnails, 180px wide (and max 300px high, though +width restriction is almost always the limiting factor). This size matches that +used on archive.org, and is slightly larger than the thumbnails currently used +on scholar.archive.org prototype. We intend to tweak the scholar.archive.org +CSS to use the full/raw thumbnail image at max desktop size. At this size it +would be difficult (though maybe not impossible?) to extract text (other than +large-font titles). + + +### Implementation + +We use the `poppler` CPP library (wrapper for python) to extract and convert everything. + +Some example usage of the `python-poppler` library: + + import poppler + from PIL import Image + + pdf = poppler.load_from_file("/home/bnewbold/10.1038@s41551-020-0534-9.pdf") + pdf.pdf_id + page = pdf.create_page(0) + page.page_rect().width + + renderer = poppler.PageRenderer() + full_page = renderer.render_page(page) + img = Image.frombuffer("RGBA", (full_page.width, full_page.height), full_page.data, 'raw', "RGBA") + img.thumbnail((180,300), Image.BICUBIC) + img.save("something.jpg") + +## Deployment and Infrastructure + +Deployment will involve: + +- sandcrawler DB SQL table + => guesstimate size 100 GByte for hundreds of PDFs +- postgrest/SQL access to new table for internal HTTP API hits +- seaweedfs raw text folder + => reuse existing bucket with GROBID XML; same access restrictions on content +- seaweedfs thumbnail bucket + => new bucket for this world-public content +- public nginx access to seaweed thumbnail bucket +- extraction work queue kafka topic + => same schema/semantics as ungrobided +- text/metadata kafka topic +- thumbnail kafka topic +- text/metadata persist worker(s) + => from kafka; metadata to SQL database; text to seaweedfs blob store +- thumbnail persist worker + => from kafka to seaweedfs blob store +- pdf extraction worker pool + => very similar to GROBID worker pool +- ansible roles for all of the above + +Plan for processing/catchup is: + +- test with COVID-19 PDF corpus +- run extraction on all current fatcat files available via IA +- integrate with ingest pipeline for all new files +- run a batch catchup job over all GROBID-parsed files with no pdf meta + extracted, on basis of SQL table query + +## Appendix: Thumbnail Size and Format Experimentation + +Using 190 PDFs from `/data/pdfs/random_crawl/files` on my laptop to test. + +TODO: actually, 4x images failed to convert with pdftocairo; this throws off +"mean" sizes by a small amount. + + time ls | parallel -j1 pdftocairo -singlefile -scale-to 200 -png {} /tmp/test-png/{}.png + real 0m29.314s + user 0m26.794s + sys 0m2.484s + => missing: 4 + => min: 0.8k + => max: 57K + => mean: 16.4K + => total: 3120K + + time ls | parallel -j1 pdftocairo -singlefile -scale-to 200 -jpeg {} /tmp/test-jpeg/{}.jpg + real 0m26.289s + user 0m24.022s + sys 0m2.490s + => missing: 4 + => min: 1.2K + => max: 13K + => mean: 8.02k + => total: 1524K + + time ls | parallel -j1 pdftocairo -singlefile -scale-to 200 -jpeg -jpegopt optimize=y,quality=80 {} /tmp/test-jpeg2/{}.jpg + real 0m27.401s + user 0m24.941s + sys 0m2.519s + => missing: 4 + => min: 577 + => max: 14K + => mean: + => total: 1540K + + time ls | parallel -j1 convert -resize 200x200 {}[0] /tmp/magick-png/{}.png + => missing: 4 + real 1m19.399s + user 1m17.150s + sys 0m6.322s + => min: 1.1K + => max: 325K + => mean: + => total: 8476K + + time ls | parallel -j1 convert -resize 200x200 {}[0] /tmp/magick-jpeg/{}.jpg + real 1m21.766s + user 1m17.040s + sys 0m7.155s + => total: 3484K + +NOTE: the following `pdf_thumbnail.py` images are somewhat smaller than the above +jpg and pngs (max 180px wide, not 200px wide) + + time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-png/{}.png + real 0m48.198s + user 0m42.997s + sys 0m4.509s + => missing: 2; 2x additional stub images + => total: 5904K + + time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-jpg/{}.jpg + real 0m45.252s + user 0m41.232s + sys 0m4.273s + => min: 1.4K + => max: 16K + => mean: ~9.3KByte + => total: 1772K + + time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-jpg-360/{}.jpg + real 0m48.639s + user 0m44.121s + sys 0m4.568s + => mean: ~28k + => total: 5364K (3x of 180px batch) + + quality=95 + time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-jpg2-360/{}.jpg + real 0m49.407s + user 0m44.607s + sys 0m4.869s + => total: 9812K + + quality=95 + time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-jpg2-180/{}.jpg + real 0m45.901s + user 0m41.486s + sys 0m4.591s + => mean: 16.4K + => total: 3116K + +At the 180px size, the difference between default and quality=95 seems +indistinguishable visually to me, but is more than a doubling of file size. +Also tried at 300px and seems near-indistinguishable there as well. + +At a mean of 10 Kbytes per file: + + 10 million -> 100 GBytes + 100 million -> 1 Tbyte + +Older COVID-19 thumbnails were about 400px wide: + + pdftocairo -png -singlefile -scale-to-x 400 -scale-to-y -1 + +Display on scholar-qa.archive.org is about 135x181px + +archive.org does 180px wide + +Unclear if we should try to do double resolution for high DPI screens (eg, +apple "retina"). + +Using same size as archive.org probably makes the most sense: max 180px wide, +preserve aspect ratio. And jpeg improvement seems worth it. + +#### Merlijn notes + +From work on optimizing microfilm thumbnail images: + + When possible, generate a thumbnail that fits well on the screen of the + user. Always creating a large thumbnail will result in the browsers + downscaling them, leading to fuzzy text. If it’s not possible, then create + the pick the resolution you’d want to support (1.5x or 2x scaling) and + create thumbnails of that size, but also apply the other recommendations + below - especially a sharpening filter. + + Use bicubic or lanczos interpolation. Bilinear and nearest neighbour are + not OK. + + For text, consider applying a sharpening filter. Not a strong one, but some + sharpening can definitely help. + + +## Appendix: PDF Info Fields + +From `pdfinfo` manpage: + + The ´Info' dictionary contains the following values: + + title + subject + keywords + author + creator + producer + creation date + modification date + + In addition, the following information is printed: + + tagged (yes/no) + form (AcroForm / XFA / none) + javascript (yes/no) + page count + encrypted flag (yes/no) + print and copy permissions (if encrypted) + page size + file size + linearized (yes/no) + PDF version + metadata (only if requested) + +For an example file, the output looks like: + + Title: A mountable toilet system for personalized health monitoring via the analysis of excreta + Subject: Nature Biomedical Engineering, doi:10.1038/s41551-020-0534-9 + Keywords: + Author: Seung-min Park + Creator: Springer + CreationDate: Thu Mar 26 01:26:57 2020 PDT + ModDate: Thu Mar 26 01:28:06 2020 PDT + Tagged: no + UserProperties: no + Suspects: no + Form: AcroForm + JavaScript: no + Pages: 14 + Encrypted: no + Page size: 595.276 x 790.866 pts + Page rot: 0 + File size: 6104749 bytes + Optimized: yes + PDF version: 1.4 + +For context on the `pdf_id` fields ("original" and "updated"), read: +<https://web.hypothes.is/blog/synchronizing-annotations-between-local-and-remote-pdfs/> diff --git a/proposals/2020_seaweed_s3.md b/proposals/2020_seaweed_s3.md new file mode 100644 index 0000000..677393b --- /dev/null +++ b/proposals/2020_seaweed_s3.md @@ -0,0 +1,426 @@ +# Notes on seaweedfs + +> 2020-04-28, martin@archive.org + +Currently (04/2020) [minio](https://github.com/minio/minio) is used to store +output from PDF analysis for [fatcat](https://fatcat.wiki) (e.g. from +[grobid](https://grobid.readthedocs.io/en/latest/)). The file checksum (sha1) +serves as key, values are blobs of XML or JSON. + +Problem: minio inserts slowed down after inserting 80M or more objects. + +Summary: I did four test runs, three failed, one (testrun-4) succeeded. + +* [testrun-4](https://git.archive.org/webgroup/sandcrawler/-/blob/master/proposals/2020_seaweed_s3.md#testrun-4) + +So far, in a non-distributed mode, the project looks usable. Added 200M objects +(about 550G) in 6 days. Full CPU load, 400M RAM usage, constant insert times. + +---- + +Details (03/2020) / @bnewbold, slack + +> the sandcrawler XML data store (currently on aitio) is grinding to a halt, I +> think because despite tuning minio+ext4+hdd just doesn't work. current at 2.6 +> TiB of data (each document compressed with snappy) and 87,403,183 objects. + +> this doesn't impact ingest processing (because content is queued and archived +> in kafka), but does impact processing and analysis + +> it is possible that the other load on aitio is making this worse, but I did +> an experiment with dumping to a 16 TB disk that slowed way down after about +> 50 million files also. some people on the internet said to just not worry +> about these huge file counts on modern filesystems, but i've debugged a bit +> and I think it is a bad idea after all + +Possible solutions + +* putting content in fake WARCs and trying to do something like CDX +* deploy CEPH object store (or swift, or any other off-the-shelf object store) +* try putting the files in postgres tables, mongodb, cassandra, etc: these are + not designed for hundreds of millions of ~50 KByte XML documents (5 - 500 + KByte range) +* try to find or adapt an open source tool like Haystack, Facebook's solution + to this engineering problem. eg: + https://engineering.linkedin.com/blog/2016/05/introducing-and-open-sourcing-ambry---linkedins-new-distributed- + +---- + +The following are notes gathered during a few test runs of seaweedfs in 04/2020 +on wbgrp-svc170.us.archive.org (4 core E5-2620 v4, 4GB RAM). + +---- + +## Setup + +There are frequent [releases](https://github.com/chrislusf/seaweedfs/releases) +but for the test, we used a build off master branch. + +Directions for configuring AWS CLI for seaweedfs: +[https://github.com/chrislusf/seaweedfs/wiki/AWS-CLI-with-SeaweedFS](https://github.com/chrislusf/seaweedfs/wiki/AWS-CLI-with-SeaweedFS). + +### Build the binary + +Using development version (requires a [Go installation](https://golang.org/dl/)). + +``` +$ git clone git@github.com:chrislusf/seaweedfs.git # 11f5a6d9 +$ cd seaweedfs +$ make +$ ls -lah weed/weed +-rwxr-xr-x 1 tir tir 55M Apr 17 16:57 weed + +$ git rev-parse HEAD +11f5a6d91346e5f3cbf3b46e0a660e231c5c2998 + +$ sha1sum weed/weed +a7f8f0b49e6183da06fc2d1411c7a0714a2cc96b +``` + +A single, 55M binary emerges after a few seconds. The binary contains +subcommands to run different parts of seaweed, e.g. master or volume servers, +filer and commands for maintenance tasks, like backup and compaction. + +To *deploy*, just copy this binary to the destination. + +### Quickstart with S3 + +Assuming `weed` binary is in PATH. + +Start a master and volume server (over /tmp, most likely) and the S3 API with a single command: + +``` +$ weed -server s3 +... +Start Seaweed Master 30GB 1.74 at 0.0.0.0:9333 +... +Store started on dir: /tmp with 0 volumes max 7 +Store started on dir: /tmp with 0 ec shards +Volume server start with seed master nodes: [localhost:9333] +... +Start Seaweed S3 API Server 30GB 1.74 at http port 8333 +... +``` + +Install the [AWS +CLI](https://github.com/chrislusf/seaweedfs/wiki/AWS-CLI-with-SeaweedFS). +Create a bucket. + +``` +$ aws --endpoint-url http://localhost:8333 s3 mb s3://sandcrawler-dev +make_bucket: sandcrawler-dev +``` + +List buckets. + +``` +$ aws --endpoint-url http://localhost:8333 s3 ls +2020-04-17 17:44:39 sandcrawler-dev +``` + +Create a dummy file. + +``` +$ echo "blob" > 12340d9a4a4f710ecf03b127051814385e83ff08.tei.xml +``` + +Upload. + +``` +$ aws --endpoint-url http://localhost:8333 s3 cp 12340d9a4a4f710ecf03b127051814385e83ff08.tei.xml s3://sandcrawler-dev +upload: ./12340d9a4a4f710ecf03b127051814385e83ff08.tei.xml to s3://sandcrawler-dev/12340d9a4a4f710ecf03b127051814385e83ff08.tei.xml +``` + +List. + +``` +$ aws --endpoint-url http://localhost:8333 s3 ls s3://sandcrawler-dev +2020-04-17 17:50:35 5 12340d9a4a4f710ecf03b127051814385e83ff08.tei.xml +``` + +Stream to stdout. + +``` +$ aws --endpoint-url http://localhost:8333 s3 cp s3://sandcrawler-dev/12340d9a4a4f710ecf03b127051814385e83ff08.tei.xml - +blob +``` + +Drop the bucket. + +``` +$ aws --endpoint-url http://localhost:8333 s3 rm --recursive s3://sandcrawler-dev +``` + +### Builtin benchmark + +The project comes with a builtin benchmark command. + +``` +$ weed benchmark +``` + +I encountered an error like +[#181](https://github.com/chrislusf/seaweedfs/issues/181), "no free volume +left" - when trying to start the benchmark after the S3 ops. A restart or a restart with `-volume.max 100` helped. + +``` +$ weed server -s3 -volume.max 100 +``` + +### Listing volumes + +``` +$ weed shell +> volume.list +Topology volume:15/112757 active:8 free:112742 remote:0 volumeSizeLimit:100 MB + DataCenter DefaultDataCenter volume:15/112757 active:8 free:112742 remote:0 + Rack DefaultRack volume:15/112757 active:8 free:112742 remote:0 + DataNode localhost:8080 volume:15/112757 active:8 free:112742 remote:0 + volume id:1 size:105328040 collection:"test" file_count:33933 version:3 modified_at_second:1587215730 + volume id:2 size:106268552 collection:"test" file_count:34236 version:3 modified_at_second:1587215730 + volume id:3 size:106290280 collection:"test" file_count:34243 version:3 modified_at_second:1587215730 + volume id:4 size:105815368 collection:"test" file_count:34090 version:3 modified_at_second:1587215730 + volume id:5 size:105660168 collection:"test" file_count:34040 version:3 modified_at_second:1587215730 + volume id:6 size:106296488 collection:"test" file_count:34245 version:3 modified_at_second:1587215730 + volume id:7 size:105753288 collection:"test" file_count:34070 version:3 modified_at_second:1587215730 + volume id:8 size:7746408 file_count:12 version:3 modified_at_second:1587215764 + volume id:9 size:10438760 collection:"test" file_count:3363 version:3 modified_at_second:1587215788 + volume id:10 size:10240104 collection:"test" file_count:3299 version:3 modified_at_second:1587215788 + volume id:11 size:10258728 collection:"test" file_count:3305 version:3 modified_at_second:1587215788 + volume id:12 size:10240104 collection:"test" file_count:3299 version:3 modified_at_second:1587215788 + volume id:13 size:10112840 collection:"test" file_count:3258 version:3 modified_at_second:1587215788 + volume id:14 size:10190440 collection:"test" file_count:3283 version:3 modified_at_second:1587215788 + volume id:15 size:10112840 collection:"test" file_count:3258 version:3 modified_at_second:1587215788 + DataNode localhost:8080 total size:820752408 file_count:261934 + Rack DefaultRack total size:820752408 file_count:261934 + DataCenter DefaultDataCenter total size:820752408 file_count:261934 +total size:820752408 file_count:261934 +``` + +### Custom S3 benchmark + +To simulate the use case of S3 for 100-500M small files (grobid xml, pdftotext, +...), I created a synthetic benchmark. + +* [https://gist.github.com/miku/6f3fee974ba82083325c2f24c912b47b](https://gist.github.com/miku/6f3fee974ba82083325c2f24c912b47b) + +We just try to fill up the datastore with millions of 5k blobs. + +---- + +### testrun-1 + +Small set, just to run. Status: done. Learned that the default in-memory volume +index grows too quickly for the 4GB RAM machine. + +``` +$ weed server -dir /tmp/martin-seaweedfs-testrun-1 -s3 -volume.max 512 -master.volumeSizeLimitMB 100 +``` + +* https://github.com/chrislusf/seaweedfs/issues/498 -- RAM +* at 10M files, we already consume ~1G + +``` +-volume.index string + Choose [memory|leveldb|leveldbMedium|leveldbLarge] mode for memory~performance balance. (default "memory") +``` + +### testrun-2 + +200M 5k objects, in-memory volume index. Status: done. Observed: After 18M +objects the 512 100MB volumes are exhausted and seaweedfs will not accept any +new data. + +``` +$ weed server -dir /tmp/martin-seaweedfs-testrun-2 -s3 -volume.max 512 -master.volumeSizeLimitMB 100 +... +I0418 12:01:43 1622 volume_loading.go:104] loading index /tmp/martin-seaweedfs-testrun-2/test_511.idx to memory +I0418 12:01:43 1622 store.go:122] add volume 511 +I0418 12:01:43 1622 volume_layout.go:243] Volume 511 becomes writable +I0418 12:01:43 1622 volume_growth.go:224] Created Volume 511 on topo:DefaultDataCenter:DefaultRack:localhost:8080 +I0418 12:01:43 1622 master_grpc_server.go:158] master send to master@[::1]:45084: url:"localhost:8080" public_url:"localhost:8080" new_vids:511 +I0418 12:01:43 1622 master_grpc_server.go:158] master send to filer@::1:18888: url:"localhost:8080" public_url:"localhost:8080" new_vids:511 +I0418 12:01:43 1622 store.go:118] In dir /tmp/martin-seaweedfs-testrun-2 adds volume:512 collection:test replicaPlacement:000 ttl: +I0418 12:01:43 1622 volume_loading.go:104] loading index /tmp/martin-seaweedfs-testrun-2/test_512.idx to memory +I0418 12:01:43 1622 store.go:122] add volume 512 +I0418 12:01:43 1622 volume_layout.go:243] Volume 512 becomes writable +I0418 12:01:43 1622 master_grpc_server.go:158] master send to master@[::1]:45084: url:"localhost:8080" public_url:"localhost:8080" new_vids:512 +I0418 12:01:43 1622 master_grpc_server.go:158] master send to filer@::1:18888: url:"localhost:8080" public_url:"localhost:8080" new_vids:512 +I0418 12:01:43 1622 volume_growth.go:224] Created Volume 512 on topo:DefaultDataCenter:DefaultRack:localhost:8080 +I0418 12:01:43 1622 node.go:82] topo failed to pick 1 from 0 node candidates +I0418 12:01:43 1622 volume_growth.go:88] create 7 volume, created 2: No enough data node found! +I0418 12:04:30 1622 volume_layout.go:231] Volume 511 becomes unwritable +I0418 12:04:30 1622 volume_layout.go:231] Volume 512 becomes unwritable +E0418 12:04:30 1622 filer_server_handlers_write.go:69] failing to assign a file id: rpc error: code = Unknown desc = No free volumes left! +I0418 12:04:30 1622 filer_server_handlers_write.go:120] fail to allocate volume for /buckets/test/k43731970, collection:test, datacenter: +E0418 12:04:30 1622 filer_server_handlers_write.go:69] failing to assign a file id: rpc error: code = Unknown desc = No free volumes left! +E0418 12:04:30 1622 filer_server_handlers_write.go:69] failing to assign a file id: rpc error: code = Unknown desc = No free volumes left! +E0418 12:04:30 1622 filer_server_handlers_write.go:69] failing to assign a file id: rpc error: code = Unknown desc = No free volumes left! +E0418 12:04:30 1622 filer_server_handlers_write.go:69] failing to assign a file id: rpc error: code = Unknown desc = No free volumes left! +I0418 12:04:30 1622 masterclient.go:88] filer failed to receive from localhost:9333: rpc error: code = Unavailable desc = transport is closing +I0418 12:04:30 1622 master_grpc_server.go:276] - client filer@::1:18888 +``` + +Inserted about 18M docs, then: + +``` +worker-0 @3720000 45475.13 81.80 +worker-1 @3730000 45525.00 81.93 +worker-3 @3720000 45525.76 81.71 +worker-4 @3720000 45527.22 81.71 +Process Process-1: +Traceback (most recent call last): + File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap + self.run() + File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run + self._target(*self._args, **self._kwargs) + File "s3test.py", line 42, in insert_keys + s3.Bucket(bucket).put_object(Key=key, Body=data) + File "/home/martin/.virtualenvs/6f3fee974ba82083325c2f24c912b47b/lib/python3.5/site-packages/boto3/resources/factory.py", line 520, in do_action + response = action(self, *args, **kwargs) + File "/home/martin/.virtualenvs/6f3fee974ba82083325c2f24c912b47b/lib/python3.5/site-packages/boto3/resources/action.py", line 83, in __call__ + response = getattr(parent.meta.client, operation_name)(**params) + File "/home/martin/.virtualenvs/6f3fee974ba82083325c2f24c912b47b/lib/python3.5/site-packages/botocore/client.py", line 316, in _api_call + return self._make_api_call(operation_name, kwargs) + File "/home/martin/.virtualenvs/6f3fee974ba82083325c2f24c912b47b/lib/python3.5/site-packages/botocore/client.py", line 626, in _make_api_call + raise error_class(parsed_response, operation_name) +botocore.exceptions.ClientError: An error occurred (InternalError) when calling the PutObject operation (reached max retries: 4): We encountered an internal error, please try again. + +real 759m30.034s +user 1962m47.487s +sys 105m21.113s +``` + +Sustained 400 S3 puts/s, RAM usage 41% of a 4G machine. 56G on disk. + +> No free volumes left! Failed to allocate bucket for /buckets/test/k163721819 + +### testrun-3 + +* use leveldb, leveldbLarge +* try "auto" volumes +* Status: done. Observed: rapid memory usage increase. + +``` +$ weed server -dir /tmp/martin-seaweedfs-testrun-3 -s3 -volume.max 0 -volume.index=leveldbLarge -filer=false -master.volumeSizeLimitMB 100 +``` + +Observations: memory usage grows rapidly, soon at 15%. + +Note-to-self: [https://github.com/chrislusf/seaweedfs/wiki/Optimization](https://github.com/chrislusf/seaweedfs/wiki/Optimization) + +### testrun-4 + +The default volume size is 30G (and cannot be more at the moment), and RAM +grows very much with the number of volumes. Therefore, keep default volume size +and do not limit number of volumes `-volume.max 0` and do not use in-memory +index (rather leveldb) + +Status: done, 200M object upload via Python script successfully in about 6 days, +memory usage was at a moderate 400M (~10% of RAM). Relatively constant +performance at about 400 `PutObject` requests/s (over 5 threads, each thread +was around 80 requests/s; then testing with 4 threads, each thread got to +around 100 requests/s). + +``` +$ weed server -dir /tmp/martin-seaweedfs-testrun-4 -s3 -volume.max 0 -volume.index=leveldb +``` + +The test script command was (40M files per worker, 5 workers). + +``` +$ time python s3test.py -n 40000000 -w 5 2> s3test.4.log +... + +real 8454m33.695s +user 21318m23.094s +sys 1128m32.293s +``` + +The test script adds keys from `k0...k199999999`. + +``` +$ aws --endpoint-url http://localhost:8333 s3 ls s3://test | head -20 +2020-04-19 09:27:13 5000 k0 +2020-04-19 09:27:13 5000 k1 +2020-04-19 09:27:13 5000 k10 +2020-04-19 09:27:15 5000 k100 +2020-04-19 09:27:26 5000 k1000 +2020-04-19 09:29:15 5000 k10000 +2020-04-19 09:47:49 5000 k100000 +2020-04-19 12:54:03 5000 k1000000 +2020-04-20 20:14:10 5000 k10000000 +2020-04-22 07:33:46 5000 k100000000 +2020-04-22 07:33:46 5000 k100000001 +2020-04-22 07:33:46 5000 k100000002 +2020-04-22 07:33:46 5000 k100000003 +2020-04-22 07:33:46 5000 k100000004 +2020-04-22 07:33:46 5000 k100000005 +2020-04-22 07:33:46 5000 k100000006 +2020-04-22 07:33:46 5000 k100000007 +2020-04-22 07:33:46 5000 k100000008 +2020-04-22 07:33:46 5000 k100000009 +2020-04-20 20:14:10 5000 k10000001 +``` + +Glance at stats. + +``` +$ du -hs /tmp/martin-seaweedfs-testrun-4 +596G /tmp/martin-seaweedfs-testrun-4 + +$ find . /tmp/martin-seaweedfs-testrun-4 | wc -l +5104 + +$ ps --pid $(pidof weed) -o pid,tid,class,stat,vsz,rss,comm + PID TID CLS STAT VSZ RSS COMMAND +32194 32194 TS Sl+ 1966964 491644 weed + +$ ls -1 /proc/$(pidof weed)/fd | wc -l +192 + +$ free -m + total used free shared buff/cache available +Mem: 3944 534 324 39 3086 3423 +Swap: 4094 27 4067 +``` + +### Note on restart + +When stopping (CTRL-C) and restarting `weed` it will take about 10 seconds to +get the S3 API server back up, but another minute or two, until seaweedfs +inspects all existing volumes and indices. + +In that gap, requests to S3 will look like internal server errors. + +``` +$ aws --endpoint-url http://localhost:8333 s3 cp s3://test/k100 - +download failed: s3://test/k100 to - An error occurred (500) when calling the +GetObject operation (reached max retries: 4): Internal Server Error +``` + +### Read benchmark + +Reading via command line `aws` client is a bit slow at first sight (3-5s). + +``` +$ time aws --endpoint-url http://localhost:8333 s3 cp s3://test/k123456789 - +ppbhjgzkrrgwagmjsuwhqcwqzmefybeopqz [...] + +real 0m5.839s +user 0m0.898s +sys 0m0.293s +``` + +#### Single process random reads + +* via [s3read.go](https://gist.github.com/miku/6f3fee974ba82083325c2f24c912b47b#file-s3read-go) + +Running 1000 random reads takes 49s. + +#### Concurrent random reads + +* 80000 request with 8 parallel processes: 7m41.973968488s, so about 170 objects/s) +* seen up to 760 keys/s reads for 8 workers +* weed will utilize all cores, so more cpus could result in higher read throughput +* RAM usage can increase (seen up to 20% of 4G RAM), then descrease (GC) back to 5%, depending on query load diff --git a/proposals/2021-04-22_crossref_db.md b/proposals/2021-04-22_crossref_db.md new file mode 100644 index 0000000..1d4c3f8 --- /dev/null +++ b/proposals/2021-04-22_crossref_db.md @@ -0,0 +1,86 @@ + +status: deployed + +Crossref DOI Metadata in Sandcrawler DB +======================================= + +Proposal is to have a local copy of Crossref API metadata records in +sandcrawler DB, accessible by simple key lookup via postgrest. + +Initial goal is to include these in scholar work "bundles" (along with +fulltext, etc), in particular as part of reference extraction pipeline. Around +late 2020, many additional references became available via Crossref records, +and have not been imported (updated) into fatcat. Reference storage in fatcat +API is a scaling problem we would like to put off, so injecting content in this +way is desirable. + +To start, working with a bulk dump made available by Crossref. In the future, +might persist the daily feed to that we have a continuously up-to-date copy. + +Another application of Crossref-in-bundles is to identify overall scale of +changes since initial Crossref metadata import. + + +## Sandcrawler DB Schema + +The "updated" field in this case refers to the upstream timestamp, not the +sandcrawler database update time. + + CREATE TABLE IF NOT EXISTS crossref ( + doi TEXT NOT NULL CHECK (octet_length(doi) >= 4 AND doi = LOWER(doi)), + indexed TIMESTAMP WITH TIME ZONE NOT NULL, + record JSON NOT NULL, + PRIMARY KEY(doi) + ); + +For postgrest access, may need to also: + + GRANT SELECT ON public.crossref TO web_anon; + +## SQL Backfill Command + +For an example file: + + cat sample.json \ + | jq -rc '[(.DOI | ascii_downcase), .indexed."date-time", (. | tostring)] | @tsv' \ + | psql sandcrawler -c "COPY crossref (doi, indexed, record) FROM STDIN (DELIMITER E'\t');" + +For a full snapshot: + + zcat crossref_public_data_file_2021_01.json.gz \ + | pv -l \ + | jq -rc '[(.DOI | ascii_downcase), .indexed."date-time", (. | tostring)] | @tsv' \ + | psql sandcrawler -c "COPY crossref (doi, indexed, record) FROM STDIN (DELIMITER E'\t');" + +jq is the bottleneck (100% of a single CPU core). + +## Kafka Worker + +Pulls from the fatcat crossref ingest Kafka feed and persists into the crossref +table. + +## SQL Table Disk Utilization + +An example backfill from early 2021, with about 120 million Crossref DOI +records. + +Starting database size (with ingest running): + + Filesystem Size Used Avail Use% Mounted on + /dev/vdb1 1.7T 896G 818G 53% /1 + + Size: 475.14G + +Ingest SQL command took: + + 120M 15:06:08 [2.22k/s] + COPY 120684688 + +After database size: + + Filesystem Size Used Avail Use% Mounted on + /dev/vdb1 1.7T 1.2T 498G 71% /1 + + Size: 794.88G + +So about 320 GByte of disk. diff --git a/proposals/2021-09-09_component_ingest.md b/proposals/2021-09-09_component_ingest.md new file mode 100644 index 0000000..09dee4f --- /dev/null +++ b/proposals/2021-09-09_component_ingest.md @@ -0,0 +1,114 @@ + +File Ingest Mode: 'component' +============================= + +A new ingest type for downloading individual files which are a subset of a +complete work. + +Some publishers now assign DOIs to individual figures, supplements, and other +"components" of an over release or document. + +Initial mimetypes to allow: + +- image/jpeg +- image/tiff +- image/png +- image/gif +- audio/mpeg +- video/mp4 +- video/mpeg +- text/plain +- text/csv +- application/json +- application/xml +- application/pdf +- application/gzip +- application/x-bzip +- application/x-bzip2 +- application/zip +- application/x-rar +- application/x-7z-compressed +- application/x-tar +- application/vnd.ms-powerpoint +- application/vnd.ms-excel +- application/msword +- application/vnd.openxmlformats-officedocument.wordprocessingml.document +- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet + +Intentionally not supporting: + +- text/html + + +## Fatcat Changes + +In the file importer, allow the additional mimetypes for 'component' ingest. + + +## Ingest Changes + +Allow additional terminal mimetypes for 'component' crawls. + + +## Examples + +Hundreds of thousands: <https://fatcat.wiki/release/search?q=type%3Acomponent+in_ia%3Afalse> + +#### ACS Supplement File + +<https://doi.org/10.1021/acscatal.0c02627.s002> + +Redirects directly to .zip in browser. SPN is blocked by cookie check. + +#### Frontiers .docx Supplement + +<https://doi.org/10.3389/fpls.2019.01642.s001> + +Redirects to full article page. There is a pop-up for figshare, seems hard to process. + +#### Figshare Single FIle + +<https://doi.org/10.6084/m9.figshare.13646972.v1> + +As 'component' type in fatcat. + +Redirects to a landing page. Dataset ingest seems more appropriate for this entire domain. + +#### PeerJ supplement file + +<https://doi.org/10.7717/peerj.10257/supp-7> + +PeerJ is hard because it redirects to a single HTML page, which has links to +supplements in the HTML. Perhaps a custom extractor will work. + +#### eLife + +<https://doi.org/10.7554/elife.38407.010> + +The current crawl mechanism makes it seemingly impossible to extract a specific +supplement from the document as a whole. + +#### Zookeys + +<https://doi.org/10.3897/zookeys.895.38576.figure53> + +These are extract-able. + +#### OECD PDF Supplement + +<https://doi.org/10.1787/f08c6324-en> +<https://www.oecd-ilibrary.org/trade/imports-of-services-billions-of-us-dollars_f08c6324-en> + +Has an Excel (.xls) link, great, but then paywall. + +#### Direct File Link + +<https://doi.org/10.1787/888934207500> + +This one is also OECD, but is a simple direct download. + +#### Protein Data Base (PDB) Entry + +<https://doi.org/10.2210/pdb6ls2/pdb> + +Multiple files; dataset/fileset more appropriate for these. diff --git a/proposals/2021-09-09_fileset_ingest.md b/proposals/2021-09-09_fileset_ingest.md new file mode 100644 index 0000000..65c9ccf --- /dev/null +++ b/proposals/2021-09-09_fileset_ingest.md @@ -0,0 +1,343 @@ + +status: implemented + +Fileset Ingest Pipeline (for Datasets) +====================================== + +Sandcrawler currently has ingest support for individual files saved as `file` +entities in fatcat (xml and pdf ingest types) and HTML files with +sub-components saved as `webcapture` entities in fatcat (html ingest type). + +This document describes extensions to this ingest system to flexibly support +groups of files, which may be represented in fatcat as `fileset` entities. The +main new ingest type is `dataset`. + +Compared to the existing ingest process, there are two major complications with +datasets: + +- the ingest process often requires more than parsing HTML files, and will be + specific to individual platforms and host software packages +- the storage backend and fatcat entity type is flexible: a dataset might be + represented by a single file, multiple files combined in to a single .zip + file, or multiple separate files; the data may get archived in wayback or in + an archive.org item + +The new concepts of "strategy" and "platform" are introduced to accommodate +these complications. + + +## Ingest Strategies + +The ingest strategy describes the fatcat entity type that will be output; the +storage backend used; and whether an enclosing file format is used. The +strategy to use can not be determined until the number and size of files is +known. It is a function of file count, total file size, and publication +platform. + +Strategy names are compact strings with the format +`{storage_backend}-{fatcat_entity}`. A `-bundled` suffix after a `fileset` +entity type indicates that metadata about multiple files is retained, but that +in the storage backend only a single enclosing file (eg, `.zip`) will be +stored. + +The supported strategies are: + +- `web-file`: single file of any type, stored in wayback, represented as fatcat `file` +- `web-fileset`: multiple files of any type, stored in wayback, represented as fatcat `fileset` +- `web-fileset-bundled`: single bundle file, stored in wayback, represented as fatcat `fileset` +- `archiveorg-file`: single file of any type, stored in archive.org item, represented as fatcat `file` +- `archiveorg-fileset`: multiple files of any type, stored in archive.org item, represented as fatcat `fileset` +- `archiveorg-fileset-bundled`: single bundle file, stored in archive.org item, represented as fatcat `fileset` + +"Bundle" or "enclosing" files are things like .zip or .tar.gz. Not all .zip +files are handled as bundles! Only when the transfer from the hosting platform +is via a "download all as .zip" (or similar) do we consider a zipfile a +"bundle" and index the interior files as a fileset. + +The term "bundle file" is used over "archive file" or "container file" to +prevent confusion with the other use of those terms in the context of fatcat +(container entities; archive; Internet Archive as an organization). + +The motivation for supporting both `web` and `archiveorg` is that `web` is +somewhat simpler for small files, but `archiveorg` is better for larger groups +of files (say more than 20) and larger total size (say more than 1 GByte total, +or 128 MByte for any one file). + +The motivation for supporting "bundled" filesets is that there is only a single +file to archive. + + +## Ingest Pseudocode + +1. Determine `platform`, which may involve resolving redirects and crawling a landing page. + + a. currently we always crawl the ingest `base_url`, capturing a platform landing page + b. we don't currently handle the case of `base_url` leading to a non-HTML + terminal resource. the `component` ingest type does handle this + +2. Use platform-specific methods to fetch manifest metadata and decide on an `ingest_strategy`. + + a. depending on platform, may include access URLs for multiple strategies + (eg, URL for each file and a bundle URL), metadata about the item for, eg, + archive.org item upload, etc + +3. Use strategy-specific methods to archive all files in platform manifest, and verify manifest metadata. + +4. Summarize status and return structured result metadata. + + a. if the strategy was `web-file` or `archiveorg-file`, potentially submit an + `ingest_file_result` object down the file ingest pipeline (Kafka topic and + later persist and fatcat import workers), with `dataset-file` ingest + type (or `{ingest_type}-file` more generally). + +New python types: + + FilesetManifestFile + path: str + size: Optional[int] + md5: Optional[str] + sha1: Optional[str] + sha256: Optional[str] + mimetype: Optional[str] + extra: Optional[Dict[str, Any]] + + status: Optional[str] + platform_url: Optional[str] + terminal_url: Optional[str] + terminal_dt: Optional[str] + + FilesetPlatformItem + platform_name: str + platform_status: str + platform_domain: Optional[str] + platform_id: Optional[str] + manifest: Optional[List[FilesetManifestFile]] + archiveorg_item_name: Optional[str] + archiveorg_item_meta + web_base_url + web_bundle_url + + ArchiveStrategyResult + ingest_strategy: str + status: str + manifest: List[FilesetManifestFile] + file_file_meta: Optional[dict] + file_terminal: Optional[dict] + file_cdx: Optional[dict] + bundle_file_meta: Optional[dict] + bundle_terminal: Optional[dict] + bundle_cdx: Optional[dict] + bundle_archiveorg_path: Optional[dict] + +New python APIs/classes: + + FilesetPlatformHelper + match_request(request, resource, html_biblio) -> bool + does the request and landing page metadata indicate a match for this platform? + process_request(request, resource, html_biblio) -> FilesetPlatformItem + do API requests, parsing, etc to fetch metadata and access URLs for this fileset/dataset. platform-specific + chose_strategy(item: FilesetPlatformItem) -> IngestStrategy + select an archive strategy for the given fileset/dataset + + FilesetIngestStrategy + check_existing(item: FilesetPlatformItem) -> Optional[ArchiveStrategyResult] + check the given backend for an existing capture/archive; if found, return result + process(item: FilesetPlatformItem) -> ArchiveStrategyResult + perform an actual archival capture + +## Limits and Failure Modes + +- `too-large-size`: total size of the fileset is too large for archiving. + initial limit is 64 GBytes, controlled by `max_total_size` parameter. +- `too-many-files`: number of files (and thus file-level metadata) is too + large. initial limit is 200, controlled by `max_file_count` parameter. +- `platform-scope / FilesetPlatformScopeError`: for when `base_url` leads to a + valid platform, which could be found via API or parsing, but has the wrong + scope. Eg, tried to fetch a dataset, but got a DOI which represents all + versions of the dataset, not a specific version. +- `platform-restricted`/`PlatformRestrictedError`: for, eg, embargoes +- `platform-404`: got to a landing page, and seemed like in-scope, but no + platform record found anyways + + +## New Sandcrawler Code and Worker + + sandcrawler-ingest-fileset-worker@{1..6} (or up to 1..12 later) + +Worker consumes from ingest request topic, produces to fileset ingest results, +and optionally produces to file ingest results. + + sandcrawler-persist-ingest-fileset-worker@1 + +Simply writes fileset ingest rows to SQL. + + +## New Fatcat Worker and Code Changes + + fatcat-import-ingest-fileset-worker + +This importer is modeled on file and web worker. Filters for `success` with +strategy of `*-fileset*`. + +Existing `fatcat-import-ingest-file-worker` should be updated to allow +`dataset` single-file imports, with largely same behavior and semantics as +current importer (`component` mode). + +Existing fatcat transforms, and possibly even elasticsearch schemas, should be +updated to include fileset status and `in_ia` flag for dataset type releases. + +Existing entity updates worker submits `dataset` type ingests to ingest request +topic. + + +## Ingest Result Schema + +Common with file results, and mostly relating to landing page HTML: + + hit: bool + status: str + success + success-existing + success-file (for `web-file` or `archiveorg-file` only) + request: object + terminal: object + file_meta: object + cdx: object + revisit_cdx: object + html_biblio: object + +Additional fileset-specific fields: + + manifest: list of objects + platform_name: str + platform_domain: str + platform_id: str + platform_base_url: str + ingest_strategy: str + archiveorg_item_name: str (optional, only for `archiveorg-*` strategies) + file_count: int + total_size: int + fileset_bundle (optional, only for `*-fileset-bundle` strategy) + file_meta + cdx + revisit_cdx + terminal + archiveorg_bundle_path + fileset_file (optional, only for `*-file` strategy) + file_meta + terminal + cdx + revisit_cdx + +If the strategy was `web-file` or `archiveorg-file` and the status is +`success-file`, then an ingest file result will also be published to +`sandcrawler-ENV.ingest-file-results`, using the same ingest type and fields as +regular ingest. + + +All fileset ingest results get published to ingest-fileset-result. + +Existing sandcrawler persist workers also subscribe to this topic and persist +status and landing page terminal info to tables just like with file ingest. +GROBID, HTML, and other metadata is not persisted in this path. + +If the ingest strategy was a single file (`*-file`), then an ingest file is +also published to the ingest-file-result topic, with the `fileset_file` +metadata, and ingest type `dataset-file`. This should only happen on success +condition. + + +## New SQL Tables + +Note that this table *complements* `ingest_file_result`, doesn't replace it. +`ingest_file_result` could more accurately be called `ingest_result`. + + CREATE TABLE IF NOT EXISTS ingest_fileset_platform ( + ingest_type TEXT NOT NULL CHECK (octet_length(ingest_type) >= 1), + base_url TEXT NOT NULL CHECK (octet_length(base_url) >= 1), + updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL, + hit BOOLEAN NOT NULL, + status TEXT CHECK (octet_length(status) >= 1), + + platform_name TEXT NOT NULL CHECK (octet_length(platform_name) >= 1), + platform_domain TEXT NOT NULL CHECK (octet_length(platform_domain) >= 1), + platform_id TEXT NOT NULL CHECK (octet_length(platform_id) >= 1), + ingest_strategy TEXT CHECK (octet_length(ingest_strategy) >= 1), + total_size BIGINT, + file_count BIGINT, + archiveorg_item_name TEXT CHECK (octet_length(archiveorg_item_name) >= 1), + + archiveorg_item_bundle_path TEXT CHECK (octet_length(archiveorg_item_bundle_path) >= 1), + web_bundle_url TEXT CHECK (octet_length(web_bundle_url) >= 1), + web_bundle_dt TEXT CHECK (octet_length(web_bundle_dt) = 14), + + manifest JSONB, + -- list, similar to fatcat fileset manifest, plus extra: + -- status (str) + -- path (str) + -- size (int) + -- md5 (str) + -- sha1 (str) + -- sha256 (str) + -- mimetype (str) + -- extra (dict) + -- platform_url (str) + -- terminal_url (str) + -- terminal_dt (str) + + PRIMARY KEY (ingest_type, base_url) + ); + CREATE INDEX ingest_fileset_platform_name_domain_id_idx ON ingest_fileset_platform(platform_name, platform_domain, platform_id); + +Persist worker should only insert in to this table if `platform_name` is +identified. + +## New Kafka Topic + + sandcrawler-ENV.ingest-fileset-results 6x, no retention limit + + +## Implementation Plan + +First implement ingest worker, including platform and strategy helpers, and +test those as simple stdin/stdout CLI tools in sandcrawler repo to validate +this proposal. + +Second implement fatcat importer and test locally and/or in QA. + +Lastly implement infrastructure, automation, and other "glue": + +- SQL schema +- persist worker + + +## Design Note: Single-File Datasets + +Should datasets and other groups of files which only contain a single file get +imported as a fatcat `file` or `fileset`? This can be broken down further as +documents (single PDF) vs other individual files. + +Advantages of `file`: + +- handles case of article PDFs being marked as dataset accidentally +- `file` entities get de-duplicated with simple lookup (eg, on `sha1`) +- conceptually simpler if individual files are `file` entity +- easier to download individual files + +Advantages of `fileset`: + +- conceptually simpler if all `dataset` entities have `fileset` form factor +- code path is simpler: one fewer strategy, and less complexity of sending + files down separate import path +- metadata about platform is retained +- would require no modification of existing fatcat file importer +- fatcat import of archive.org of `file` is not actually implemented yet? + +Decision is to do individual files. Fatcat fileset import worker should reject +single-file (and empty) manifest filesets. Fatcat file import worker should +accept all mimetypes for `dataset-file` (similar to `component`). + + +## Example Entities + +See `notes/dataset_examples.txt` diff --git a/proposals/2021-09-13_src_ingest.md b/proposals/2021-09-13_src_ingest.md new file mode 100644 index 0000000..470827a --- /dev/null +++ b/proposals/2021-09-13_src_ingest.md @@ -0,0 +1,53 @@ + +File Ingest Mode: 'src' +======================= + +Ingest type for "source" of works in document form. For example, tarballs of +LaTeX source and figures, as published on arxiv.org and Pubmed Central. + +For now, presumption is that this would be a single file (`file` entity in +fatcat). + +Initial mimetypes to allow: + +- text/x-tex +- application/xml +- application/gzip +- application/x-bzip +- application/x-bzip2 +- application/zip +- application/x-tar +- application/msword +- application/vnd.openxmlformats-officedocument.wordprocessingml.document + + +## Fatcat Changes + +In the file importer, allow the additional mimetypes for 'src' ingest. + +Might keep ingest disabled on the fatcat side, at least initially. Eg, until +there is some scope of "file scope", or other ways of treating 'src' tarballs +separate from PDFs or other fulltext formats. + + +## Ingest Changes + +Allow additional terminal mimetypes for 'src' crawls. + + +## Examples + + arxiv:2109.00954v1 + fatcat:release_akzp2lgqjbcbhpoeoitsj5k5hy + https://arxiv.org/format/2109.00954v1 + https://arxiv.org/e-print/2109.00954v1 + + arxiv:1912.03397v2 + https://arxiv.org/format/1912.03397v2 + https://arxiv.org/e-print/1912.03397v2 + NOT: https://arxiv.org/pdf/1912.03397v2 + + pmcid:PMC3767916 + https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/08/03/PMC3767916.tar.gz + +For PMC, will need to use one of the .csv file lists to get the digit prefixes. diff --git a/proposals/2021-09-21_spn_accounts.md b/proposals/2021-09-21_spn_accounts.md new file mode 100644 index 0000000..e41c162 --- /dev/null +++ b/proposals/2021-09-21_spn_accounts.md @@ -0,0 +1,14 @@ + +Formalization of SPNv2 API requests from fatcat/sandcrawler + +Create two new system accounts, one for regular/daily ingest requests, one for +priority requests (save-paper-now or as a flag with things like fatcat-ingest; +"interactive"). These accounts should have @archive.org emails. Request the +daily one to have the current rate limit as bnewbold@archive.org account; the +priority queue can have less. + +Create new ingest kafka queues from scratch, one for priority and one for +regular. Chose sizes carefully, probably keep 24x for the regular and do 6x or +so (small) for priority queue. + +Deploy new priority workers; reconfigure/deploy broadly. diff --git a/proposals/2021-10-28_grobid_refs.md b/proposals/2021-10-28_grobid_refs.md new file mode 100644 index 0000000..1fc79b6 --- /dev/null +++ b/proposals/2021-10-28_grobid_refs.md @@ -0,0 +1,125 @@ + +GROBID References in Sandcrawler DB +=================================== + +Want to start processing "unstructured" raw references coming from upstream +metadata sources (distinct from upstream fulltext sources, like PDFs or JATS +XML), and save the results in sandcrawler DB. From there, they will get pulled +in to fatcat-scholar "intermediate bundles" and included in reference exports. + +The initial use case for this is to parse "unstructured" references deposited +in Crossref, and include them in refcat. + + +## Schema and Semantics + +The output JSON/dict schema for parsed references follows that of +`grobid_tei_xml` version 0.1.x, for the `GrobidBiblio` field. The +`unstructured` field that was parsed is included in the output, though it may +not be byte-for-byte exact (see below). One notable change from the past (eg, +older GROBID-parsed references) is that author `name` is now `full_name`. New +fields include `editors` (same schema as `authors`), `book_title`, and +`series_title`. + +The overall output schema matches that of the `grobid_refs` SQL table: + + source: string, lower-case. eg 'crossref' + source_id: string, eg '10.1145/3366650.3366668' + source_ts: optional timestamp (full ISO datetime with timezone (eg, `Z` + suffix), which identifies version of upstream metadata + refs_json: JSON, list of `GrobidBiblio` JSON objects + +References are re-processed on a per-article (or per-release) basis. All the +references for an article are handled as a batch and output as a batch. If +there are no upstream references, row with `ref_json` as empty list may be +returned. + +Not all upstream references get re-parsed, even if an 'unstructured' field is +available. If 'unstructured' is not available, no row is ever output. For +example, if a reference includes `unstructured` (raw citation string), but also +has structured metadata for authors, title, year, and journal name, we might +not re-parse the `unstructured` string. Whether to re-parse is evaulated on a +per-reference basis. This behavior may change over time. + +`unstructured` strings may be pre-processed before being submitted to GROBID. +This is because many sources have systemic encoding issues. GROBID itself may +also do some modification of the input citation string before returning it in +the output. This means the `unstructured` string is not a reliable way to map +between specific upstream references and parsed references. Instead, the `id` +field (str) of `GrobidBiblio` gets set to any upstream "key" or "index" +identifier used to track individual references. If there is only a numeric +index, the `id` is that number as a string. + +The `key` or `id` may need to be woven back in to the ref objects manually, +because GROBID `processCitationList` takes just a list of raw strings, with no +attached reference-level key or id. + + +## New SQL Table and View + +We may want to do re-parsing of references from sources other than `crossref`, +so there is a generic `grobid_refs` table. But it is also common to fetch both +the crossref metadata and any re-parsed references together, so as a convenience +there is a PostgreSQL view (virtual table) that includes both a crossref +metadata record and parsed citations, if available. If downstream code cares a +lot about having the refs and record be in sync, the `source_ts` field on +`grobid_refs` can be matched against the `indexed` column of `crossref` (or the +`.indexed.date-time` JSON field in the record itself). + +Remember that DOIs should always be lower-cased before querying, inserting, +comparing, etc. + + CREATE TABLE IF NOT EXISTS grobid_refs ( + source TEXT NOT NULL CHECK (octet_length(source) >= 1), + source_id TEXT NOT NULL CHECK (octet_length(source_id) >= 1), + source_ts TIMESTAMP WITH TIME ZONE, + updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL, + refs_json JSON NOT NULL, + PRIMARY KEY(source, source_id) + ); + + CREATE OR REPLACE VIEW crossref_with_refs (doi, indexed, record, source_ts, refs_json) AS + SELECT + crossref.doi as doi, + crossref.indexed as indexed, + crossref.record as record, + grobid_refs.source_ts as source_ts, + grobid_refs.refs_json as refs_json + FROM crossref + LEFT JOIN grobid_refs ON + grobid_refs.source_id = crossref.doi + AND grobid_refs.source = 'crossref'; + +Both `grobid_refs` and `crossref_with_refs` will be exposed through postgrest. + + +## New Workers / Tools + +For simplicity, to start, a single worker with consume from +`fatcat-prod.api-crossref`, process citations with GROBID (if necessary), and +insert to both `crossref` and `grobid_refs` tables. This worker will run +locally on the machine with sandcrawler-db. + +Another tool will support taking large chunks of Crossref JSON (as lines), +filter them, process with GROBID, and print JSON to stdout, in the +`grobid_refs` JSON schema. + + +## Task Examples + +Command to process crossref records with refs tool: + + cat crossref_sample.json \ + | parallel -j5 --linebuffer --round-robin --pipe ./grobid_tool.py parse-crossref-refs - \ + | pv -l \ + > crossref_sample.parsed.json + + # => 10.0k 0:00:27 [ 368 /s] + +Load directly in to postgres (after tables have been created): + + cat crossref_sample.parsed.json \ + | jq -rc '[.source, .source_id, .source_ts, (.refs_json | tostring)] | @tsv' \ + | psql sandcrawler -c "COPY grobid_refs (source, source_id, source_ts, refs_json) FROM STDIN (DELIMITER E'\t');" + + # => COPY 9999 diff --git a/proposals/2021-12-09_trawling.md b/proposals/2021-12-09_trawling.md new file mode 100644 index 0000000..33b6b4c --- /dev/null +++ b/proposals/2021-12-09_trawling.md @@ -0,0 +1,180 @@ + +status: work-in-progress + +NOTE: as of December 2022, the implementation on these features haven't been +merged to the main branch. Development stalled in December 2021. + +Trawling for Unstructured Scholarly Web Content +=============================================== + +## Background and Motivation + +A long-term goal for sandcrawler has been the ability to pick through +unstructured web archive content (or even non-web collection), identify +potential in-scope research outputs, extract metadata for those outputs, and +merge the content in to a catalog (fatcat). + +This process requires integration of many existing tools (HTML and PDF +extraction; fuzzy bibliographic metadata matching; machine learning to identify +in-scope content; etc), as well as high-level curration, targetting, and +evaluation by human operators. The goal is to augment and improve the +productivity of human operators as much as possible. + +This process will be similar to "ingest", which is where we start with a +specific URL and have some additional context about the expected result (eg, +content type, exernal identifier). Some differences with trawling are that we +are start with a collection or context (instead of single URL); have little or +no context about the content we are looking for; and may even be creating a new +catalog entry, as opposed to matching to a known existing entry. + + +## Architecture + +The core operation is to take a resource and run a flowchart of processing +steps on it, resulting in an overall status and possible related metadata. The +common case is that the resource is a PDF or HTML coming from wayback (with +contextual metadata about the capture), but we should be flexible to supporting +more content types in the future, and should try to support plain files with no +context as well. + +Some relatively simple wrapper code handles fetching resources and summarizing +status/counts. + +Outside of the scope of sandcrawler, new fatcat code (importer or similar) will +be needed to handle trawl results. It will probably make sense to pre-filter +(with `jq` or `rg`) before passing results to fatcat. + +At this stage, trawl workers will probably be run manually. Some successful +outputs (like GROBID, HTML metadata) would be written to existing kafka topics +to be persisted, but there would not be any specific `trawl` SQL tables or +automation. + +It will probably be helpful to have some kind of wrapper script that can run +sandcrawler trawl processes, then filter and pipe the output into fatcat +importer, all from a single invocation, while reporting results. + +TODO: +- for HTML imports, do we fetch the full webcapture stuff and return that? + + +## Methods of Operation + +### `cdx_file` + +An existing CDX file is provided on-disk locally. + +### `cdx_api` + +Simplified variants: `cdx_domain`, `cdx_surt` + +Uses CDX API to download records matching the configured filters, then processes the file. + +Saves the CDX file intermediate result somewhere locally (working or tmp +directory), with timestamp in the path, to make re-trying with `cdx_file` fast +and easy. + + +### `archiveorg_web_collection` + +Uses `cdx_collection.py` (or similar) to fetch a full CDX list, by iterating over +then process it. + +Saves the CDX file intermediate result somewhere locally (working or tmp +directory), with timestamp in the path, to make re-trying with `cdx_file` fast +and easy. + +### Others + +- `archiveorg_file_collection`: fetch file list via archive.org metadata, then processes each + +## Schema + +Per-resource results: + + hit (bool) + indicates whether resource seems in scope and was processed successfully + (roughly, status 'success', and + status (str) + success: fetched resource, ran processing, pa + skip-cdx: filtered before even fetching resource + skip-resource: filtered after fetching resource + wayback-error (etc): problem fetching + content_scope (str) + filtered-{filtertype} + article (etc) + landing-page + resource_type (str) + pdf, html + file_meta{} + cdx{} + revisit_cdx{} + + # below are resource_type specific + grobid + pdf_meta + pdf_trio + html_biblio + (other heuristics and ML) + +High-level request: + + trawl_method: str + cdx_file_path + default_filters: bool + resource_filters[] + scope: str + surt_prefix, domain, host, mimetype, size, datetime, resource_type, http_status + value: any + values[]: any + min: any + max: any + biblio_context{}: set of expected/default values + container_id + release_type + release_stage + url_rel + +High-level summary / results: + + status + request{}: the entire request object + counts + total_resources + status{} + content_scope{} + resource_type{} + +## Example Corpuses + +All PDFs (`application/pdf`) in web.archive.org from before the year 2000. +Starting point would be a CDX list. + +Spidering crawls starting from a set of OA journal homepage URLs. + +Archive-It partner collections from research universities, particularly of +their own .edu domains. Starting point would be an archive.org collection, from +which WARC files or CDX lists can be accessed. + +General archive.org PDF collections, such as +[ERIC](https://archive.org/details/ericarchive) or +[Document Cloud](https://archive.org/details/documentcloud). + +Specific Journal or Publisher URL patterns. Starting point could be a domain, +hostname, SURT prefix, and/or URL regex. + +Heuristic patterns over full web.archive.org CDX index. For example, .edu +domains with user directories and a `.pdf` in the file path ("tilde" username +pattern). + +Random samples of entire Wayback corpus. For example, random samples filtered +by date, content type, TLD, etc. This would be true "trawling" over the entire +corpus. + + +## Other Ideas + +Could have a web archive spidering mode: starting from a seed, fetch multiple +captures (different captures), then extract outlinks from those, up to some +number of hops. An example application would be links to research group +webpages or author homepages, and to try to extract PDF links from CVs, etc. + diff --git a/proposals/brainstorm/2021-debug_web_interface.md b/proposals/brainstorm/2021-debug_web_interface.md new file mode 100644 index 0000000..442b439 --- /dev/null +++ b/proposals/brainstorm/2021-debug_web_interface.md @@ -0,0 +1,9 @@ + +status: brainstorm idea + +Simple internal-only web interface to help debug ingest issues. + +- paste a hash, URL, or identifier and get a display of "everything we know" about it +- enter a URL/SURT prefix and get aggregate stats (?) +- enter a domain/host/prefix and get recent attempts/results +- pre-computed periodic reports on ingest pipeline (?) diff --git a/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md new file mode 100644 index 0000000..b3ad447 --- /dev/null +++ b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md @@ -0,0 +1,36 @@ + +status: brainstorming + +We continue to see issues with heritrix3-based crawling. Would like to have an +option to switch to higher-throughput heritrix-based crawling. + +SPNv2 path would stick around at least for save-paper-now style ingest. + + +## Sketch + +Ingest requests are created continuously by fatcat, with daily spikes. + +Ingest workers run mostly in "bulk" mode, aka they don't make SPNv2 calls. +`no-capture` responses are recorded in sandcrawler SQL database. + +Periodically (daily?), a script queries for new no-capture results, filtered to +the most recent period. These are processed in a bit in to a URL list, then +converted to a heritrix frontier, and sent to crawlers. This could either be an +h3 instance (?), or simple `scp` to a running crawl directory. + +The crawler crawls, with usual landing page config, and draintasker runs. + +TODO: can we have draintasker/heritrix set a maximum WARC life? Like 6 hours? +or, target a smaller draintasker item size, so they get updated more frequently + +Another SQL script dumps ingest requests from the *previous* period, and +re-submits them for bulk-style ingest (by workers). + +The end result would be things getting crawled and updated within a couple +days. + + +## Sketch 2 + +Upload URL list to petabox item, wait for heritrix derive to run (!) |