diff options
Diffstat (limited to 'proposals')
-rw-r--r-- | proposals/2018_original_sandcrawler_rfc.md | 180 | ||||
-rw-r--r-- | proposals/2019_ingest.md | 6 | ||||
-rw-r--r-- | proposals/20200129_pdf_ingest.md | 10 | ||||
-rw-r--r-- | proposals/20200207_pdftrio.md | 5 | ||||
-rw-r--r-- | proposals/20201012_no_capture.md | 39 | ||||
-rw-r--r-- | proposals/20201026_html_ingest.md | 129 | ||||
-rw-r--r-- | proposals/20201103_xml_ingest.md | 64 | ||||
-rw-r--r-- | proposals/2020_pdf_meta_thumbnails.md | 4 | ||||
-rw-r--r-- | proposals/2020_seaweed_s3.md | 22 | ||||
-rw-r--r-- | proposals/2021-04-22_crossref_db.md | 86 | ||||
-rw-r--r-- | proposals/2021-09-09_component_ingest.md | 114 | ||||
-rw-r--r-- | proposals/2021-09-09_fileset_ingest.md | 343 | ||||
-rw-r--r-- | proposals/2021-09-13_src_ingest.md | 53 | ||||
-rw-r--r-- | proposals/2021-09-21_spn_accounts.md | 14 | ||||
-rw-r--r-- | proposals/2021-10-28_grobid_refs.md | 125 | ||||
-rw-r--r-- | proposals/2021-12-09_trawling.md | 180 | ||||
-rw-r--r-- | proposals/brainstorm/2021-debug_web_interface.md | 9 | ||||
-rw-r--r-- | proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md | 36 |
18 files changed, 1398 insertions, 21 deletions
diff --git a/proposals/2018_original_sandcrawler_rfc.md b/proposals/2018_original_sandcrawler_rfc.md new file mode 100644 index 0000000..ecf7ab8 --- /dev/null +++ b/proposals/2018_original_sandcrawler_rfc.md @@ -0,0 +1,180 @@ + +**Title:** Journal Archiving Pipeline + +**Author:** Bryan Newbold <bnewbold@archive.org> + +**Date:** March 2018 + +**Status:** work-in-progress + +This is an RFC-style technical proposal for a journal crawling, archiving, +extracting, resolving, and cataloging pipeline. + +Design work funded by a Mellon Foundation grant in 2018. + +## Overview + +Let's start with data stores first: + +- crawled original fulltext (PDF, JATS, HTML) ends up in petabox/global-wayback +- file-level extracted fulltext and metadata is stored in HBase, with the hash + of the original file as the key +- cleaned metadata is stored in a "catalog" relational (SQL) database (probably + PostgreSQL or some hip scalable NewSQL thing compatible with Postgres or + MariaDB) + +**Resources:** back-of-the-envelope, around 100 TB petabox storage total (for +100 million PDF files); 10-20 TB HBase table total. Can start small. + + +All "system" (aka, pipeline) state (eg, "what work has been done") is ephemeral +and is rederived relatively easily (but might be cached for performance). + +The overall "top-down", metadata-driven cycle is: + +1. Partners and public sources provide metadata (for catalog) and seed lists + (for crawlers) +2. Crawlers pull in fulltext and HTTP/HTML metadata from the public web +3. Extractors parse raw fulltext files (PDFs) and store structured metadata (in + HBase) +4. Data Mungers match extracted metadata (from HBase) against the catalog, or + create new records if none found. + +In the "bottom up" cycle, batch jobs run as map/reduce jobs against the +catalog, HBase, global wayback, and partner metadata datasets to identify +potential new public or already-archived content to process, and pushes tasks +to the crawlers, extractors, and mungers. + +## Partner Metadata + +Periodic Luigi scripts run on a regular VM to pull in metadata from partners. +All metadata is saved to either petabox (for public stuff) or HDFS (for +restricted). Scripts process/munge the data and push directly to the catalog +(for trusted/authoritative sources like Crossref, ISSN, PubMed, DOAJ); others +extract seedlists and push to the crawlers ( + +**Resources:** 1 VM (could be a devbox), with a large attached disk (spinning +probably ok) + +## Crawling + +All fulltext content comes in from the public web via crawling, and all crawled +content ends up in global wayback. + +One or more VMs serve as perpetual crawlers, with multiple active ("perpetual") +Heritrix crawls operating with differing configuration. These could be +orchestrated (like h3), or just have the crawl jobs cut off and restarted every +year or so. + +In a starter configuration, there would be two crawl queues. One would target +direct PDF links, landing pages, author homepages, DOI redirects, etc. It would +process HTML and look for PDF outlinks, but wouldn't crawl recursively. + +HBase is used for de-dupe, with records (pointers) stored in WARCs. + +A second config would take seeds as entire journal websites, and would crawl +continuously. + +Other components of the system "push" tasks to the crawlers by copying schedule +files into the crawl action directories. + +WARCs would be uploaded into petabox via draintasker as usual, and CDX +derivation would be left to the derive process. Other processes are notified of +"new crawl content" being available when they see new unprocessed CDX files in +items from specific collections. draintasker could be configured to "cut" new +items every 24 hours at most to ensure this pipeline moves along regularly, or +we could come up with other hacks to get lower "latency" at this stage. + +**Resources:** 1-2 crawler VMs, each with a large attached disk (spinning) + +### De-Dupe Efficiency + +We would certainly feed CDX info from all bulk journal crawling into HBase +before any additional large crawling, to get that level of de-dupe. + +As to whether all GWB PDFs should be de-dupe against is a policy question: is +there something special about the journal-specific crawls that makes it worth +having second copies? Eg, if we had previously domain crawled and access is +restricted, we then wouldn't be allowed to provide researcher access to those +files... on the other hand, we could extract for researchers given that we +"refound" the content at a new URL? + +Only fulltext files (PDFs) would be de-duped against (by content), so we'd be +recrawling lots of HTML. Presumably this is a fraction of crawl data size; what +fraction? + +Watermarked files would be refreshed repeatedly from the same PDF, and even +extracted/processed repeatedly (because the hash would be different). This is +hard to de-dupe/skip, because we would want to catch "content drift" (changes +in files). + +## Extractors + +Off-the-shelf PDF extraction software runs on high-CPU VM nodes (probably +GROBID running on 1-2 data nodes, which have 30+ CPU cores and plenty of RAM +and network throughput). + +A hadoop streaming job (written in python) takes a CDX file as task input. It +filters for only PDFs, and then checks each line against HBase to see if it has +already been extracted. If it hasn't, the script downloads directly from +petabox using the full CDX info (bypassing wayback, which would be a +bottleneck). It optionally runs any "quick check" scripts to see if the PDF +should be skipped ("definitely not a scholarly work"), then if it looks Ok +submits the file over HTTP to the GROBID worker pool for extraction. The +results are pushed to HBase, and a short status line written to Hadoop. The +overall Hadoop job has a reduce phase that generates a human-meaningful report +of job status (eg, number of corrupt files) for monitoring. + +A side job as part of extracting can "score" the extracted metadata to flag +problems with GROBID, to be used as potential training data for improvement. + +**Resources:** 1-2 datanode VMs; hadoop cluster time. Needed up-front for +backlog processing; less CPU needed over time. + +## Matchers + +The matcher runs as a "scan" HBase map/reduce job over new (unprocessed) HBasej +rows. It pulls just the basic metadata (title, author, identifiers, abstract) +and calls the catalog API to identify potential match candidates. If no match +is found, and the metadata "look good" based on some filters (to remove, eg, +spam), works are inserted into the catalog (eg, for those works that don't have +globally available identifiers or other metadata; "long tail" and legacy +content). + +**Resources:** Hadoop cluster time + +## Catalog + +The catalog is a versioned relational database. All scripts interact with an +API server (instead of connecting directly to the database). It should be +reliable and low-latency for simple reads, so it can be relied on to provide a +public-facing API and have public web interfaces built on top. This is in +contrast to Hadoop, which for the most part could go down with no public-facing +impact (other than fulltext API queries). The catalog does not contain +copywritable material, but it does contain strong (verified) links to fulltext +content. Policy gets implemented here if necessary. + +A global "changelog" (append-only log) is used in the catalog to record every +change, allowing for easier replication (internal or external, to partners). As +little as possible is implemented in the catalog itself; instead helper and +cleanup bots use the API to propose and verify edits, similar to the wikidata +and git data models. + +Public APIs and any front-end services are built on the catalog. Elasticsearch +(for metadata or fulltext search) could build on top of the catalog. + +**Resources:** Unknown, but estimate 1+ TB of SSD storage each on 2 or more +database machines + +## Machine Learning and "Bottom Up" + +TBD. + +## Logistics + +Ansible is used to deploy all components. Luigi is used as a task scheduler for +batch jobs, with cron to initiate periodic tasks. Errors and actionable +problems are aggregated in Sentry. + +Logging, metrics, and other debugging and monitoring are TBD. + diff --git a/proposals/2019_ingest.md b/proposals/2019_ingest.md index c649809..768784f 100644 --- a/proposals/2019_ingest.md +++ b/proposals/2019_ingest.md @@ -1,5 +1,5 @@ -status: work-in-progress +status: deployed This document proposes structure and systems for ingesting (crawling) paper PDFs and other content as part of sandcrawler. @@ -84,7 +84,7 @@ HTML? Or both? Let's just recrawl. *IngestRequest* - `ingest_type`: required, one of `pdf`, `xml`, `html`, `dataset`. For backwards compatibility, `file` should be interpreted as `pdf`. `pdf` and - `xml` return file ingest respose; `html` and `dataset` not implemented but + `xml` return file ingest response; `html` and `dataset` not implemented but would be webcapture (wayback) and fileset (archive.org item or wayback?). In the future: `epub`, `video`, `git`, etc. - `base_url`: required, where to start crawl process @@ -258,7 +258,7 @@ and hacks to crawl publicly available papers. Related existing work includes [unpaywall's crawler][unpaywall_crawl], LOCKSS extraction code, dissem.in's efforts, zotero's bibliography extractor, etc. The "memento tracer" work is also similar. Many of these are even in python! It would be great to reduce -duplicated work and maintenance. An analagous system in the wild is youtube-dl +duplicated work and maintenance. An analogous system in the wild is youtube-dl for downloading video from many sources. [unpaywall_crawl]: https://github.com/ourresearch/oadoi/blob/master/webpage.py diff --git a/proposals/20200129_pdf_ingest.md b/proposals/20200129_pdf_ingest.md index 9469217..157607e 100644 --- a/proposals/20200129_pdf_ingest.md +++ b/proposals/20200129_pdf_ingest.md @@ -1,5 +1,5 @@ -status: planned +status: deployed 2020q1 Fulltext PDF Ingest Plan =================================== @@ -27,7 +27,7 @@ There are a few million papers in fatacat which: 2. are known OA, usually because publication is Gold OA 3. don't have any fulltext PDF in fatcat -As a detail, some of these "known OA" journals actually have embargos (aka, +As a detail, some of these "known OA" journals actually have embargoes (aka, they aren't true Gold OA). In particular, those marked via EZB OA "color", and recent pubmed central ids. @@ -104,7 +104,7 @@ Actions: update ingest result table with status. - fetch new MAG and unpaywall seedlists, transform to ingest requests, persist into ingest request table. use SQL to dump only the *new* URLs (not seen in - previous dumps) using the created timestamp, outputing new bulk ingest + previous dumps) using the created timestamp, outputting new bulk ingest request lists. if possible, de-dupe between these two. then start bulk heritrix crawls over these two long lists. Probably sharded over several machines. Could also run serially (first one, then the other, with @@ -133,7 +133,7 @@ We have run GROBID+glutton over basically all of these PDFs. We should be able to do a SQL query to select PDFs that: - have at least one known CDX row -- GROBID processed successfuly and glutton matched to a fatcat release +- GROBID processed successfully and glutton matched to a fatcat release - do not have an existing fatcat file (based on sha1hex) - output GROBID metadata, `file_meta`, and one or more CDX rows @@ -161,7 +161,7 @@ Coding Tasks: Actions: - update `fatcat_file` sandcrawler table -- check how many PDFs this might ammount to. both by uniq SHA1 and uniq +- check how many PDFs this might amount to. both by uniq SHA1 and uniq `fatcat_release` matches - do some manual random QA verification to check that this method results in quality content in fatcat diff --git a/proposals/20200207_pdftrio.md b/proposals/20200207_pdftrio.md index 31a2db6..6f6443f 100644 --- a/proposals/20200207_pdftrio.md +++ b/proposals/20200207_pdftrio.md @@ -1,5 +1,8 @@ -status: in progress +status: deployed + +NOTE: while this has been used in production, as of December 2022 the results +are not used much in practice, and we don't score every PDF that comes along PDF Trio (ML Classification) ============================== diff --git a/proposals/20201012_no_capture.md b/proposals/20201012_no_capture.md new file mode 100644 index 0000000..7f6a1f5 --- /dev/null +++ b/proposals/20201012_no_capture.md @@ -0,0 +1,39 @@ + +status: work-in-progress + +NOTE: as of December 2022, bnewbold can't remember if this was fully +implemented or not. + +Storing no-capture missing URLs in `terminal_url` +================================================= + +Currently, when the bulk-mode ingest code terminates with a `no-capture` +status, the missing URL (which is not in GWB CDX) is not stored in +sandcrawler-db. This proposed change is to include it in the existing +`terminal_url` database column, with the `terminal_status_code` and +`terminal_dt` columns empty. + +The implementation is rather simple: + +- CDX lookup code path should save the *actual* final missing URL (`next_url` + after redirects) in the result object's `terminal_url` field +- ensure that this field gets passed through all the way to the database on the + `no-capture` code path + +This change does change the semantics of the `terminal_url` field somewhat, and +could break existing assumptions, so it is being documented in this proposal +document. + + +## Alternatives + +The current status quo is to store the missing URL as the last element in the +"hops" field of the JSON structure. We could keep this and have a convoluted +pipeline that would read from the Kafka feed and extract them, but this would +be messy. Eg, re-ingesting would not update the old kafka messages, so we could +need some accounting of consumer group offsets after which missing URLs are +truly missing. + +We could add a new `missing_url` database column and field to the JSON schema, +for this specific use case. This seems like unnecessary extra work. + diff --git a/proposals/20201026_html_ingest.md b/proposals/20201026_html_ingest.md new file mode 100644 index 0000000..785471b --- /dev/null +++ b/proposals/20201026_html_ingest.md @@ -0,0 +1,129 @@ + +status: deployed + +HTML Ingest Pipeline +======================== + +Basic goal: given an ingest request of type 'html', output an object (JSON) +which could be imported into fatcat. + +Should work with things like (scholarly) blog posts, micropubs, registrations, +protocols. Doesn't need to work with everything to start. "Platform" sites +(like youtube, figshare, etc) will probably be a different ingest worker. + +A current unknown is what the expected size of this metadata is. Both in number +of documents and amount of metadata per document. + +Example HTML articles to start testing: + +- complex distill article: <https://distill.pub/2020/bayesian-optimization/> +- old HTML journal: <http://web.archive.org/web/20081120141926fw_/http://www.mundanebehavior.org/issues/v5n1/rosen.htm> +- NIH pub: <https://www.nlm.nih.gov/pubs/techbull/ja02/ja02_locatorplus_merge.html> +- first mondays (OJS): <https://firstmonday.org/ojs/index.php/fm/article/view/10274/9729> +- d-lib: <http://www.dlib.org/dlib/july17/williams/07williams.html> + + +## Ingest Process + +Follow base URL to terminal document, which is assumed to be a status=200 HTML document. + +Verify that terminal document is fulltext. Extract both metadata and fulltext. + +Extract list of sub-resources. Filter out unwanted (eg favicon, analytics, +unnecessary), apply a sanity limit. Convert to fully qualified URLs. For each +sub-resource, fetch down to the terminal resource, and compute hashes/metadata. + +Open questions: + +- will probably want to parallelize sub-resource fetching. async? +- behavior when failure fetching sub-resources + + +## Ingest Result Schema + +JSON should be basically compatible with existing `ingest_file_result` objects, +with some new sub-objects. + +Overall object (`IngestWebResult`): + +- `status`: str +- `hit`: bool +- `error_message`: optional, if an error +- `hops`: optional, array of URLs +- `cdx`: optional; single CDX row of primary HTML document +- `terminal`: optional; same as ingest result + - `terminal_url` + - `terminal_dt` + - `terminal_status_code` + - `terminal_sha1hex` +- `request`: optional but usually present; ingest request object, verbatim +- `file_meta`: optional; file metadata about primary HTML document +- `html_biblio`: optional; extracted biblio metadata from primary HTML document +- `scope`: optional; detected/guessed scope (fulltext, etc) +- `html_resources`: optional; array of sub-resources. primary HTML is not included +- `html_body`: optional; just the status code and some metadata is passed through; + actual document would go through a different KafkaTopic + - `status`: str + - `agent`: str, eg "trafilatura/0.4" + - `tei_xml`: optional, str + - `word_count`: optional, str + + +## New SQL Tables + +`html_meta` + sha1hex (primary key) + updated (of SQL row) + status + scope + has_teixml + has_thumbnail + word_count (from teixml fulltext) + biblio (JSON) + resources (JSON) + +Also writes to `ingest_file_result`, `file_meta`, and `cdx`, all only for the base HTML document. + +Note: needed to enable postgrest access to this table (for scholar worker). + + +## Fatcat API Wants + +Would be nice to have lookup by SURT+timestamp, and/or by sha1hex of terminal base file. + +`hide` option for cdx rows; also for fileset equivalent. + + +## New Workers + +Could reuse existing worker, have code branch depending on type of ingest. + +ingest file worker + => same as existing worker, because could be calling SPN + +persist result + => same as existing worker; adds persisting various HTML metadata + +persist html text + => talks to seaweedfs + + +## New Kafka Topics + +HTML ingest result topic (webcapture-ish) + +sandcrawler-ENV.html-teixml + JSON wrapping TEI-XML (same as other fulltext topics) + key compaction and content compression enabled + +JSON schema: + +- `key` and `sha1hex`: str; used as kafka key +- `status`: str +- `tei_xml`: str, optional +- `word_count`: int, optional + +## New S3/SeaweedFS Content + +`sandcrawler` bucket, `html` folder, `.tei.xml` suffix. + diff --git a/proposals/20201103_xml_ingest.md b/proposals/20201103_xml_ingest.md new file mode 100644 index 0000000..34e00b0 --- /dev/null +++ b/proposals/20201103_xml_ingest.md @@ -0,0 +1,64 @@ + +status: deployed + +XML Fulltext Ingest +==================== + +This document details changes to include XML fulltext ingest in the same way +that we currently ingest PDF fulltext. + +Currently this will just fetch the single XML document, which is often lacking +figures, tables, and other required files. + +## Text Encoding + +Because we would like to treat XML as a string in a couple contexts, but XML +can have multiple encodings (indicated in an XML header), we are in a bit of a +bind. Simply parsing into unicode and then re-encoding as UTF-8 could result in +a header/content mismatch. Any form of re-encoding will change the hash of the +document. For recording in fatcat, the file metadata will be passed through. +For storing in Kafka and blob store (for downstream analysis), we will parse +the raw XML document (as "bytes") with an XML parser, then re-output with UTF-8 +encoding. The hash of the *original* XML file will be used as the key for +referring to this document. This is unintuitive, but similar to what we are +doing with PDF and HTML documents (extracting in a useful format, but keeping +the original document's hash as a key). + +Unclear if we need to do this re-encode process for XML documents already in +UTF-8 encoding. + +## Ingest Worker + +Could either re-use HTML metadata extractor to fetch XML fulltext links, or +fork that code off to a separate method, like the PDF fulltext URL extractor. + +Hopefully can re-use almost all of the PDF pipeline code, by making that ingest +worker class more generic and subclassing it. + +Result objects are treated the same as PDF ingest results: the result object +has context about status, and if successful, file metadata and CDX row of the +terminal object. + +TODO: should it be assumed that XML fulltext will end up in S3 bucket? or +should there be an `xml_meta` SQL table tracking this, like we have for PDFs +and HTML? + +TODO: should we detect and specify the XML schema better? Eg, indicate if JATS. + + +## Persist Pipeline + +### Kafka Topic + +sandcrawler-ENV.xml-doc + similar to other fulltext topics; JSON wrapping the XML + key compaction, content compression + +### S3/SeaweedFS + +`sandcrawler` bucket, `xml` folder. Extension could depend on sub-type of XML? + +### Persist Worker + +New S3-only worker that pulls from kafka topic and pushes to S3. Works +basically the same as PDF persist in S3-only mode, or like pdf-text worker. diff --git a/proposals/2020_pdf_meta_thumbnails.md b/proposals/2020_pdf_meta_thumbnails.md index 793d6b5..141ece8 100644 --- a/proposals/2020_pdf_meta_thumbnails.md +++ b/proposals/2020_pdf_meta_thumbnails.md @@ -1,5 +1,5 @@ -status: work-in-progress +status: deployed New PDF derivatives: thumbnails, metadata, raw text =================================================== @@ -133,7 +133,7 @@ Deployment will involve: Plan for processing/catchup is: - test with COVID-19 PDF corpus -- run extraction on all current fatcat files avaiable via IA +- run extraction on all current fatcat files available via IA - integrate with ingest pipeline for all new files - run a batch catchup job over all GROBID-parsed files with no pdf meta extracted, on basis of SQL table query diff --git a/proposals/2020_seaweed_s3.md b/proposals/2020_seaweed_s3.md index 9473cb7..677393b 100644 --- a/proposals/2020_seaweed_s3.md +++ b/proposals/2020_seaweed_s3.md @@ -11,7 +11,7 @@ Problem: minio inserts slowed down after inserting 80M or more objects. Summary: I did four test runs, three failed, one (testrun-4) succeeded. -* [testrun-4](https://git.archive.org/webgroup/sandcrawler/-/blob/martin-seaweed-s3/proposals/2020_seaweed_s3.md#testrun-4) +* [testrun-4](https://git.archive.org/webgroup/sandcrawler/-/blob/master/proposals/2020_seaweed_s3.md#testrun-4) So far, in a non-distributed mode, the project looks usable. Added 200M objects (about 550G) in 6 days. Full CPU load, 400M RAM usage, constant insert times. @@ -54,9 +54,9 @@ on wbgrp-svc170.us.archive.org (4 core E5-2620 v4, 4GB RAM). ## Setup There are frequent [releases](https://github.com/chrislusf/seaweedfs/releases) -but for the test, we used a build off the master branch. +but for the test, we used a build off master branch. -Directions from configuring AWS CLI for seaweedfs: +Directions for configuring AWS CLI for seaweedfs: [https://github.com/chrislusf/seaweedfs/wiki/AWS-CLI-with-SeaweedFS](https://github.com/chrislusf/seaweedfs/wiki/AWS-CLI-with-SeaweedFS). ### Build the binary @@ -79,7 +79,7 @@ a7f8f0b49e6183da06fc2d1411c7a0714a2cc96b A single, 55M binary emerges after a few seconds. The binary contains subcommands to run different parts of seaweed, e.g. master or volume servers, -filer and commands for maintenance tasks, like backup and compact. +filer and commands for maintenance tasks, like backup and compaction. To *deploy*, just copy this binary to the destination. @@ -199,8 +199,8 @@ total size:820752408 file_count:261934 ### Custom S3 benchmark -To simulate the use case of S3 use case for 100-500M small files (grobid xml, -pdftotext, ...), I created a synthetic benchmark. +To simulate the use case of S3 for 100-500M small files (grobid xml, pdftotext, +...), I created a synthetic benchmark. * [https://gist.github.com/miku/6f3fee974ba82083325c2f24c912b47b](https://gist.github.com/miku/6f3fee974ba82083325c2f24c912b47b) @@ -210,8 +210,8 @@ We just try to fill up the datastore with millions of 5k blobs. ### testrun-1 -Small set, just to run. Status: done. Learned that the default in memory volume -index grows too quickly for the 4GB machine. +Small set, just to run. Status: done. Learned that the default in-memory volume +index grows too quickly for the 4GB RAM machine. ``` $ weed server -dir /tmp/martin-seaweedfs-testrun-1 -s3 -volume.max 512 -master.volumeSizeLimitMB 100 @@ -299,7 +299,7 @@ Sustained 400 S3 puts/s, RAM usage 41% of a 4G machine. 56G on disk. * use leveldb, leveldbLarge * try "auto" volumes -* Status: done. Observed: rapid memory usage. +* Status: done. Observed: rapid memory usage increase. ``` $ weed server -dir /tmp/martin-seaweedfs-testrun-3 -s3 -volume.max 0 -volume.index=leveldbLarge -filer=false -master.volumeSizeLimitMB 100 @@ -316,7 +316,7 @@ grows very much with the number of volumes. Therefore, keep default volume size and do not limit number of volumes `-volume.max 0` and do not use in-memory index (rather leveldb) -Status: done, 200M object upload via Python script sucessfully in about 6 days, +Status: done, 200M object upload via Python script successfully in about 6 days, memory usage was at a moderate 400M (~10% of RAM). Relatively constant performance at about 400 `PutObject` requests/s (over 5 threads, each thread was around 80 requests/s; then testing with 4 threads, each thread got to @@ -414,6 +414,8 @@ sys 0m0.293s #### Single process random reads +* via [s3read.go](https://gist.github.com/miku/6f3fee974ba82083325c2f24c912b47b#file-s3read-go) + Running 1000 random reads takes 49s. #### Concurrent random reads diff --git a/proposals/2021-04-22_crossref_db.md b/proposals/2021-04-22_crossref_db.md new file mode 100644 index 0000000..1d4c3f8 --- /dev/null +++ b/proposals/2021-04-22_crossref_db.md @@ -0,0 +1,86 @@ + +status: deployed + +Crossref DOI Metadata in Sandcrawler DB +======================================= + +Proposal is to have a local copy of Crossref API metadata records in +sandcrawler DB, accessible by simple key lookup via postgrest. + +Initial goal is to include these in scholar work "bundles" (along with +fulltext, etc), in particular as part of reference extraction pipeline. Around +late 2020, many additional references became available via Crossref records, +and have not been imported (updated) into fatcat. Reference storage in fatcat +API is a scaling problem we would like to put off, so injecting content in this +way is desirable. + +To start, working with a bulk dump made available by Crossref. In the future, +might persist the daily feed to that we have a continuously up-to-date copy. + +Another application of Crossref-in-bundles is to identify overall scale of +changes since initial Crossref metadata import. + + +## Sandcrawler DB Schema + +The "updated" field in this case refers to the upstream timestamp, not the +sandcrawler database update time. + + CREATE TABLE IF NOT EXISTS crossref ( + doi TEXT NOT NULL CHECK (octet_length(doi) >= 4 AND doi = LOWER(doi)), + indexed TIMESTAMP WITH TIME ZONE NOT NULL, + record JSON NOT NULL, + PRIMARY KEY(doi) + ); + +For postgrest access, may need to also: + + GRANT SELECT ON public.crossref TO web_anon; + +## SQL Backfill Command + +For an example file: + + cat sample.json \ + | jq -rc '[(.DOI | ascii_downcase), .indexed."date-time", (. | tostring)] | @tsv' \ + | psql sandcrawler -c "COPY crossref (doi, indexed, record) FROM STDIN (DELIMITER E'\t');" + +For a full snapshot: + + zcat crossref_public_data_file_2021_01.json.gz \ + | pv -l \ + | jq -rc '[(.DOI | ascii_downcase), .indexed."date-time", (. | tostring)] | @tsv' \ + | psql sandcrawler -c "COPY crossref (doi, indexed, record) FROM STDIN (DELIMITER E'\t');" + +jq is the bottleneck (100% of a single CPU core). + +## Kafka Worker + +Pulls from the fatcat crossref ingest Kafka feed and persists into the crossref +table. + +## SQL Table Disk Utilization + +An example backfill from early 2021, with about 120 million Crossref DOI +records. + +Starting database size (with ingest running): + + Filesystem Size Used Avail Use% Mounted on + /dev/vdb1 1.7T 896G 818G 53% /1 + + Size: 475.14G + +Ingest SQL command took: + + 120M 15:06:08 [2.22k/s] + COPY 120684688 + +After database size: + + Filesystem Size Used Avail Use% Mounted on + /dev/vdb1 1.7T 1.2T 498G 71% /1 + + Size: 794.88G + +So about 320 GByte of disk. diff --git a/proposals/2021-09-09_component_ingest.md b/proposals/2021-09-09_component_ingest.md new file mode 100644 index 0000000..09dee4f --- /dev/null +++ b/proposals/2021-09-09_component_ingest.md @@ -0,0 +1,114 @@ + +File Ingest Mode: 'component' +============================= + +A new ingest type for downloading individual files which are a subset of a +complete work. + +Some publishers now assign DOIs to individual figures, supplements, and other +"components" of an over release or document. + +Initial mimetypes to allow: + +- image/jpeg +- image/tiff +- image/png +- image/gif +- audio/mpeg +- video/mp4 +- video/mpeg +- text/plain +- text/csv +- application/json +- application/xml +- application/pdf +- application/gzip +- application/x-bzip +- application/x-bzip2 +- application/zip +- application/x-rar +- application/x-7z-compressed +- application/x-tar +- application/vnd.ms-powerpoint +- application/vnd.ms-excel +- application/msword +- application/vnd.openxmlformats-officedocument.wordprocessingml.document +- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet + +Intentionally not supporting: + +- text/html + + +## Fatcat Changes + +In the file importer, allow the additional mimetypes for 'component' ingest. + + +## Ingest Changes + +Allow additional terminal mimetypes for 'component' crawls. + + +## Examples + +Hundreds of thousands: <https://fatcat.wiki/release/search?q=type%3Acomponent+in_ia%3Afalse> + +#### ACS Supplement File + +<https://doi.org/10.1021/acscatal.0c02627.s002> + +Redirects directly to .zip in browser. SPN is blocked by cookie check. + +#### Frontiers .docx Supplement + +<https://doi.org/10.3389/fpls.2019.01642.s001> + +Redirects to full article page. There is a pop-up for figshare, seems hard to process. + +#### Figshare Single FIle + +<https://doi.org/10.6084/m9.figshare.13646972.v1> + +As 'component' type in fatcat. + +Redirects to a landing page. Dataset ingest seems more appropriate for this entire domain. + +#### PeerJ supplement file + +<https://doi.org/10.7717/peerj.10257/supp-7> + +PeerJ is hard because it redirects to a single HTML page, which has links to +supplements in the HTML. Perhaps a custom extractor will work. + +#### eLife + +<https://doi.org/10.7554/elife.38407.010> + +The current crawl mechanism makes it seemingly impossible to extract a specific +supplement from the document as a whole. + +#### Zookeys + +<https://doi.org/10.3897/zookeys.895.38576.figure53> + +These are extract-able. + +#### OECD PDF Supplement + +<https://doi.org/10.1787/f08c6324-en> +<https://www.oecd-ilibrary.org/trade/imports-of-services-billions-of-us-dollars_f08c6324-en> + +Has an Excel (.xls) link, great, but then paywall. + +#### Direct File Link + +<https://doi.org/10.1787/888934207500> + +This one is also OECD, but is a simple direct download. + +#### Protein Data Base (PDB) Entry + +<https://doi.org/10.2210/pdb6ls2/pdb> + +Multiple files; dataset/fileset more appropriate for these. diff --git a/proposals/2021-09-09_fileset_ingest.md b/proposals/2021-09-09_fileset_ingest.md new file mode 100644 index 0000000..65c9ccf --- /dev/null +++ b/proposals/2021-09-09_fileset_ingest.md @@ -0,0 +1,343 @@ + +status: implemented + +Fileset Ingest Pipeline (for Datasets) +====================================== + +Sandcrawler currently has ingest support for individual files saved as `file` +entities in fatcat (xml and pdf ingest types) and HTML files with +sub-components saved as `webcapture` entities in fatcat (html ingest type). + +This document describes extensions to this ingest system to flexibly support +groups of files, which may be represented in fatcat as `fileset` entities. The +main new ingest type is `dataset`. + +Compared to the existing ingest process, there are two major complications with +datasets: + +- the ingest process often requires more than parsing HTML files, and will be + specific to individual platforms and host software packages +- the storage backend and fatcat entity type is flexible: a dataset might be + represented by a single file, multiple files combined in to a single .zip + file, or multiple separate files; the data may get archived in wayback or in + an archive.org item + +The new concepts of "strategy" and "platform" are introduced to accommodate +these complications. + + +## Ingest Strategies + +The ingest strategy describes the fatcat entity type that will be output; the +storage backend used; and whether an enclosing file format is used. The +strategy to use can not be determined until the number and size of files is +known. It is a function of file count, total file size, and publication +platform. + +Strategy names are compact strings with the format +`{storage_backend}-{fatcat_entity}`. A `-bundled` suffix after a `fileset` +entity type indicates that metadata about multiple files is retained, but that +in the storage backend only a single enclosing file (eg, `.zip`) will be +stored. + +The supported strategies are: + +- `web-file`: single file of any type, stored in wayback, represented as fatcat `file` +- `web-fileset`: multiple files of any type, stored in wayback, represented as fatcat `fileset` +- `web-fileset-bundled`: single bundle file, stored in wayback, represented as fatcat `fileset` +- `archiveorg-file`: single file of any type, stored in archive.org item, represented as fatcat `file` +- `archiveorg-fileset`: multiple files of any type, stored in archive.org item, represented as fatcat `fileset` +- `archiveorg-fileset-bundled`: single bundle file, stored in archive.org item, represented as fatcat `fileset` + +"Bundle" or "enclosing" files are things like .zip or .tar.gz. Not all .zip +files are handled as bundles! Only when the transfer from the hosting platform +is via a "download all as .zip" (or similar) do we consider a zipfile a +"bundle" and index the interior files as a fileset. + +The term "bundle file" is used over "archive file" or "container file" to +prevent confusion with the other use of those terms in the context of fatcat +(container entities; archive; Internet Archive as an organization). + +The motivation for supporting both `web` and `archiveorg` is that `web` is +somewhat simpler for small files, but `archiveorg` is better for larger groups +of files (say more than 20) and larger total size (say more than 1 GByte total, +or 128 MByte for any one file). + +The motivation for supporting "bundled" filesets is that there is only a single +file to archive. + + +## Ingest Pseudocode + +1. Determine `platform`, which may involve resolving redirects and crawling a landing page. + + a. currently we always crawl the ingest `base_url`, capturing a platform landing page + b. we don't currently handle the case of `base_url` leading to a non-HTML + terminal resource. the `component` ingest type does handle this + +2. Use platform-specific methods to fetch manifest metadata and decide on an `ingest_strategy`. + + a. depending on platform, may include access URLs for multiple strategies + (eg, URL for each file and a bundle URL), metadata about the item for, eg, + archive.org item upload, etc + +3. Use strategy-specific methods to archive all files in platform manifest, and verify manifest metadata. + +4. Summarize status and return structured result metadata. + + a. if the strategy was `web-file` or `archiveorg-file`, potentially submit an + `ingest_file_result` object down the file ingest pipeline (Kafka topic and + later persist and fatcat import workers), with `dataset-file` ingest + type (or `{ingest_type}-file` more generally). + +New python types: + + FilesetManifestFile + path: str + size: Optional[int] + md5: Optional[str] + sha1: Optional[str] + sha256: Optional[str] + mimetype: Optional[str] + extra: Optional[Dict[str, Any]] + + status: Optional[str] + platform_url: Optional[str] + terminal_url: Optional[str] + terminal_dt: Optional[str] + + FilesetPlatformItem + platform_name: str + platform_status: str + platform_domain: Optional[str] + platform_id: Optional[str] + manifest: Optional[List[FilesetManifestFile]] + archiveorg_item_name: Optional[str] + archiveorg_item_meta + web_base_url + web_bundle_url + + ArchiveStrategyResult + ingest_strategy: str + status: str + manifest: List[FilesetManifestFile] + file_file_meta: Optional[dict] + file_terminal: Optional[dict] + file_cdx: Optional[dict] + bundle_file_meta: Optional[dict] + bundle_terminal: Optional[dict] + bundle_cdx: Optional[dict] + bundle_archiveorg_path: Optional[dict] + +New python APIs/classes: + + FilesetPlatformHelper + match_request(request, resource, html_biblio) -> bool + does the request and landing page metadata indicate a match for this platform? + process_request(request, resource, html_biblio) -> FilesetPlatformItem + do API requests, parsing, etc to fetch metadata and access URLs for this fileset/dataset. platform-specific + chose_strategy(item: FilesetPlatformItem) -> IngestStrategy + select an archive strategy for the given fileset/dataset + + FilesetIngestStrategy + check_existing(item: FilesetPlatformItem) -> Optional[ArchiveStrategyResult] + check the given backend for an existing capture/archive; if found, return result + process(item: FilesetPlatformItem) -> ArchiveStrategyResult + perform an actual archival capture + +## Limits and Failure Modes + +- `too-large-size`: total size of the fileset is too large for archiving. + initial limit is 64 GBytes, controlled by `max_total_size` parameter. +- `too-many-files`: number of files (and thus file-level metadata) is too + large. initial limit is 200, controlled by `max_file_count` parameter. +- `platform-scope / FilesetPlatformScopeError`: for when `base_url` leads to a + valid platform, which could be found via API or parsing, but has the wrong + scope. Eg, tried to fetch a dataset, but got a DOI which represents all + versions of the dataset, not a specific version. +- `platform-restricted`/`PlatformRestrictedError`: for, eg, embargoes +- `platform-404`: got to a landing page, and seemed like in-scope, but no + platform record found anyways + + +## New Sandcrawler Code and Worker + + sandcrawler-ingest-fileset-worker@{1..6} (or up to 1..12 later) + +Worker consumes from ingest request topic, produces to fileset ingest results, +and optionally produces to file ingest results. + + sandcrawler-persist-ingest-fileset-worker@1 + +Simply writes fileset ingest rows to SQL. + + +## New Fatcat Worker and Code Changes + + fatcat-import-ingest-fileset-worker + +This importer is modeled on file and web worker. Filters for `success` with +strategy of `*-fileset*`. + +Existing `fatcat-import-ingest-file-worker` should be updated to allow +`dataset` single-file imports, with largely same behavior and semantics as +current importer (`component` mode). + +Existing fatcat transforms, and possibly even elasticsearch schemas, should be +updated to include fileset status and `in_ia` flag for dataset type releases. + +Existing entity updates worker submits `dataset` type ingests to ingest request +topic. + + +## Ingest Result Schema + +Common with file results, and mostly relating to landing page HTML: + + hit: bool + status: str + success + success-existing + success-file (for `web-file` or `archiveorg-file` only) + request: object + terminal: object + file_meta: object + cdx: object + revisit_cdx: object + html_biblio: object + +Additional fileset-specific fields: + + manifest: list of objects + platform_name: str + platform_domain: str + platform_id: str + platform_base_url: str + ingest_strategy: str + archiveorg_item_name: str (optional, only for `archiveorg-*` strategies) + file_count: int + total_size: int + fileset_bundle (optional, only for `*-fileset-bundle` strategy) + file_meta + cdx + revisit_cdx + terminal + archiveorg_bundle_path + fileset_file (optional, only for `*-file` strategy) + file_meta + terminal + cdx + revisit_cdx + +If the strategy was `web-file` or `archiveorg-file` and the status is +`success-file`, then an ingest file result will also be published to +`sandcrawler-ENV.ingest-file-results`, using the same ingest type and fields as +regular ingest. + + +All fileset ingest results get published to ingest-fileset-result. + +Existing sandcrawler persist workers also subscribe to this topic and persist +status and landing page terminal info to tables just like with file ingest. +GROBID, HTML, and other metadata is not persisted in this path. + +If the ingest strategy was a single file (`*-file`), then an ingest file is +also published to the ingest-file-result topic, with the `fileset_file` +metadata, and ingest type `dataset-file`. This should only happen on success +condition. + + +## New SQL Tables + +Note that this table *complements* `ingest_file_result`, doesn't replace it. +`ingest_file_result` could more accurately be called `ingest_result`. + + CREATE TABLE IF NOT EXISTS ingest_fileset_platform ( + ingest_type TEXT NOT NULL CHECK (octet_length(ingest_type) >= 1), + base_url TEXT NOT NULL CHECK (octet_length(base_url) >= 1), + updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL, + hit BOOLEAN NOT NULL, + status TEXT CHECK (octet_length(status) >= 1), + + platform_name TEXT NOT NULL CHECK (octet_length(platform_name) >= 1), + platform_domain TEXT NOT NULL CHECK (octet_length(platform_domain) >= 1), + platform_id TEXT NOT NULL CHECK (octet_length(platform_id) >= 1), + ingest_strategy TEXT CHECK (octet_length(ingest_strategy) >= 1), + total_size BIGINT, + file_count BIGINT, + archiveorg_item_name TEXT CHECK (octet_length(archiveorg_item_name) >= 1), + + archiveorg_item_bundle_path TEXT CHECK (octet_length(archiveorg_item_bundle_path) >= 1), + web_bundle_url TEXT CHECK (octet_length(web_bundle_url) >= 1), + web_bundle_dt TEXT CHECK (octet_length(web_bundle_dt) = 14), + + manifest JSONB, + -- list, similar to fatcat fileset manifest, plus extra: + -- status (str) + -- path (str) + -- size (int) + -- md5 (str) + -- sha1 (str) + -- sha256 (str) + -- mimetype (str) + -- extra (dict) + -- platform_url (str) + -- terminal_url (str) + -- terminal_dt (str) + + PRIMARY KEY (ingest_type, base_url) + ); + CREATE INDEX ingest_fileset_platform_name_domain_id_idx ON ingest_fileset_platform(platform_name, platform_domain, platform_id); + +Persist worker should only insert in to this table if `platform_name` is +identified. + +## New Kafka Topic + + sandcrawler-ENV.ingest-fileset-results 6x, no retention limit + + +## Implementation Plan + +First implement ingest worker, including platform and strategy helpers, and +test those as simple stdin/stdout CLI tools in sandcrawler repo to validate +this proposal. + +Second implement fatcat importer and test locally and/or in QA. + +Lastly implement infrastructure, automation, and other "glue": + +- SQL schema +- persist worker + + +## Design Note: Single-File Datasets + +Should datasets and other groups of files which only contain a single file get +imported as a fatcat `file` or `fileset`? This can be broken down further as +documents (single PDF) vs other individual files. + +Advantages of `file`: + +- handles case of article PDFs being marked as dataset accidentally +- `file` entities get de-duplicated with simple lookup (eg, on `sha1`) +- conceptually simpler if individual files are `file` entity +- easier to download individual files + +Advantages of `fileset`: + +- conceptually simpler if all `dataset` entities have `fileset` form factor +- code path is simpler: one fewer strategy, and less complexity of sending + files down separate import path +- metadata about platform is retained +- would require no modification of existing fatcat file importer +- fatcat import of archive.org of `file` is not actually implemented yet? + +Decision is to do individual files. Fatcat fileset import worker should reject +single-file (and empty) manifest filesets. Fatcat file import worker should +accept all mimetypes for `dataset-file` (similar to `component`). + + +## Example Entities + +See `notes/dataset_examples.txt` diff --git a/proposals/2021-09-13_src_ingest.md b/proposals/2021-09-13_src_ingest.md new file mode 100644 index 0000000..470827a --- /dev/null +++ b/proposals/2021-09-13_src_ingest.md @@ -0,0 +1,53 @@ + +File Ingest Mode: 'src' +======================= + +Ingest type for "source" of works in document form. For example, tarballs of +LaTeX source and figures, as published on arxiv.org and Pubmed Central. + +For now, presumption is that this would be a single file (`file` entity in +fatcat). + +Initial mimetypes to allow: + +- text/x-tex +- application/xml +- application/gzip +- application/x-bzip +- application/x-bzip2 +- application/zip +- application/x-tar +- application/msword +- application/vnd.openxmlformats-officedocument.wordprocessingml.document + + +## Fatcat Changes + +In the file importer, allow the additional mimetypes for 'src' ingest. + +Might keep ingest disabled on the fatcat side, at least initially. Eg, until +there is some scope of "file scope", or other ways of treating 'src' tarballs +separate from PDFs or other fulltext formats. + + +## Ingest Changes + +Allow additional terminal mimetypes for 'src' crawls. + + +## Examples + + arxiv:2109.00954v1 + fatcat:release_akzp2lgqjbcbhpoeoitsj5k5hy + https://arxiv.org/format/2109.00954v1 + https://arxiv.org/e-print/2109.00954v1 + + arxiv:1912.03397v2 + https://arxiv.org/format/1912.03397v2 + https://arxiv.org/e-print/1912.03397v2 + NOT: https://arxiv.org/pdf/1912.03397v2 + + pmcid:PMC3767916 + https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/08/03/PMC3767916.tar.gz + +For PMC, will need to use one of the .csv file lists to get the digit prefixes. diff --git a/proposals/2021-09-21_spn_accounts.md b/proposals/2021-09-21_spn_accounts.md new file mode 100644 index 0000000..e41c162 --- /dev/null +++ b/proposals/2021-09-21_spn_accounts.md @@ -0,0 +1,14 @@ + +Formalization of SPNv2 API requests from fatcat/sandcrawler + +Create two new system accounts, one for regular/daily ingest requests, one for +priority requests (save-paper-now or as a flag with things like fatcat-ingest; +"interactive"). These accounts should have @archive.org emails. Request the +daily one to have the current rate limit as bnewbold@archive.org account; the +priority queue can have less. + +Create new ingest kafka queues from scratch, one for priority and one for +regular. Chose sizes carefully, probably keep 24x for the regular and do 6x or +so (small) for priority queue. + +Deploy new priority workers; reconfigure/deploy broadly. diff --git a/proposals/2021-10-28_grobid_refs.md b/proposals/2021-10-28_grobid_refs.md new file mode 100644 index 0000000..1fc79b6 --- /dev/null +++ b/proposals/2021-10-28_grobid_refs.md @@ -0,0 +1,125 @@ + +GROBID References in Sandcrawler DB +=================================== + +Want to start processing "unstructured" raw references coming from upstream +metadata sources (distinct from upstream fulltext sources, like PDFs or JATS +XML), and save the results in sandcrawler DB. From there, they will get pulled +in to fatcat-scholar "intermediate bundles" and included in reference exports. + +The initial use case for this is to parse "unstructured" references deposited +in Crossref, and include them in refcat. + + +## Schema and Semantics + +The output JSON/dict schema for parsed references follows that of +`grobid_tei_xml` version 0.1.x, for the `GrobidBiblio` field. The +`unstructured` field that was parsed is included in the output, though it may +not be byte-for-byte exact (see below). One notable change from the past (eg, +older GROBID-parsed references) is that author `name` is now `full_name`. New +fields include `editors` (same schema as `authors`), `book_title`, and +`series_title`. + +The overall output schema matches that of the `grobid_refs` SQL table: + + source: string, lower-case. eg 'crossref' + source_id: string, eg '10.1145/3366650.3366668' + source_ts: optional timestamp (full ISO datetime with timezone (eg, `Z` + suffix), which identifies version of upstream metadata + refs_json: JSON, list of `GrobidBiblio` JSON objects + +References are re-processed on a per-article (or per-release) basis. All the +references for an article are handled as a batch and output as a batch. If +there are no upstream references, row with `ref_json` as empty list may be +returned. + +Not all upstream references get re-parsed, even if an 'unstructured' field is +available. If 'unstructured' is not available, no row is ever output. For +example, if a reference includes `unstructured` (raw citation string), but also +has structured metadata for authors, title, year, and journal name, we might +not re-parse the `unstructured` string. Whether to re-parse is evaulated on a +per-reference basis. This behavior may change over time. + +`unstructured` strings may be pre-processed before being submitted to GROBID. +This is because many sources have systemic encoding issues. GROBID itself may +also do some modification of the input citation string before returning it in +the output. This means the `unstructured` string is not a reliable way to map +between specific upstream references and parsed references. Instead, the `id` +field (str) of `GrobidBiblio` gets set to any upstream "key" or "index" +identifier used to track individual references. If there is only a numeric +index, the `id` is that number as a string. + +The `key` or `id` may need to be woven back in to the ref objects manually, +because GROBID `processCitationList` takes just a list of raw strings, with no +attached reference-level key or id. + + +## New SQL Table and View + +We may want to do re-parsing of references from sources other than `crossref`, +so there is a generic `grobid_refs` table. But it is also common to fetch both +the crossref metadata and any re-parsed references together, so as a convenience +there is a PostgreSQL view (virtual table) that includes both a crossref +metadata record and parsed citations, if available. If downstream code cares a +lot about having the refs and record be in sync, the `source_ts` field on +`grobid_refs` can be matched against the `indexed` column of `crossref` (or the +`.indexed.date-time` JSON field in the record itself). + +Remember that DOIs should always be lower-cased before querying, inserting, +comparing, etc. + + CREATE TABLE IF NOT EXISTS grobid_refs ( + source TEXT NOT NULL CHECK (octet_length(source) >= 1), + source_id TEXT NOT NULL CHECK (octet_length(source_id) >= 1), + source_ts TIMESTAMP WITH TIME ZONE, + updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL, + refs_json JSON NOT NULL, + PRIMARY KEY(source, source_id) + ); + + CREATE OR REPLACE VIEW crossref_with_refs (doi, indexed, record, source_ts, refs_json) AS + SELECT + crossref.doi as doi, + crossref.indexed as indexed, + crossref.record as record, + grobid_refs.source_ts as source_ts, + grobid_refs.refs_json as refs_json + FROM crossref + LEFT JOIN grobid_refs ON + grobid_refs.source_id = crossref.doi + AND grobid_refs.source = 'crossref'; + +Both `grobid_refs` and `crossref_with_refs` will be exposed through postgrest. + + +## New Workers / Tools + +For simplicity, to start, a single worker with consume from +`fatcat-prod.api-crossref`, process citations with GROBID (if necessary), and +insert to both `crossref` and `grobid_refs` tables. This worker will run +locally on the machine with sandcrawler-db. + +Another tool will support taking large chunks of Crossref JSON (as lines), +filter them, process with GROBID, and print JSON to stdout, in the +`grobid_refs` JSON schema. + + +## Task Examples + +Command to process crossref records with refs tool: + + cat crossref_sample.json \ + | parallel -j5 --linebuffer --round-robin --pipe ./grobid_tool.py parse-crossref-refs - \ + | pv -l \ + > crossref_sample.parsed.json + + # => 10.0k 0:00:27 [ 368 /s] + +Load directly in to postgres (after tables have been created): + + cat crossref_sample.parsed.json \ + | jq -rc '[.source, .source_id, .source_ts, (.refs_json | tostring)] | @tsv' \ + | psql sandcrawler -c "COPY grobid_refs (source, source_id, source_ts, refs_json) FROM STDIN (DELIMITER E'\t');" + + # => COPY 9999 diff --git a/proposals/2021-12-09_trawling.md b/proposals/2021-12-09_trawling.md new file mode 100644 index 0000000..33b6b4c --- /dev/null +++ b/proposals/2021-12-09_trawling.md @@ -0,0 +1,180 @@ + +status: work-in-progress + +NOTE: as of December 2022, the implementation on these features haven't been +merged to the main branch. Development stalled in December 2021. + +Trawling for Unstructured Scholarly Web Content +=============================================== + +## Background and Motivation + +A long-term goal for sandcrawler has been the ability to pick through +unstructured web archive content (or even non-web collection), identify +potential in-scope research outputs, extract metadata for those outputs, and +merge the content in to a catalog (fatcat). + +This process requires integration of many existing tools (HTML and PDF +extraction; fuzzy bibliographic metadata matching; machine learning to identify +in-scope content; etc), as well as high-level curration, targetting, and +evaluation by human operators. The goal is to augment and improve the +productivity of human operators as much as possible. + +This process will be similar to "ingest", which is where we start with a +specific URL and have some additional context about the expected result (eg, +content type, exernal identifier). Some differences with trawling are that we +are start with a collection or context (instead of single URL); have little or +no context about the content we are looking for; and may even be creating a new +catalog entry, as opposed to matching to a known existing entry. + + +## Architecture + +The core operation is to take a resource and run a flowchart of processing +steps on it, resulting in an overall status and possible related metadata. The +common case is that the resource is a PDF or HTML coming from wayback (with +contextual metadata about the capture), but we should be flexible to supporting +more content types in the future, and should try to support plain files with no +context as well. + +Some relatively simple wrapper code handles fetching resources and summarizing +status/counts. + +Outside of the scope of sandcrawler, new fatcat code (importer or similar) will +be needed to handle trawl results. It will probably make sense to pre-filter +(with `jq` or `rg`) before passing results to fatcat. + +At this stage, trawl workers will probably be run manually. Some successful +outputs (like GROBID, HTML metadata) would be written to existing kafka topics +to be persisted, but there would not be any specific `trawl` SQL tables or +automation. + +It will probably be helpful to have some kind of wrapper script that can run +sandcrawler trawl processes, then filter and pipe the output into fatcat +importer, all from a single invocation, while reporting results. + +TODO: +- for HTML imports, do we fetch the full webcapture stuff and return that? + + +## Methods of Operation + +### `cdx_file` + +An existing CDX file is provided on-disk locally. + +### `cdx_api` + +Simplified variants: `cdx_domain`, `cdx_surt` + +Uses CDX API to download records matching the configured filters, then processes the file. + +Saves the CDX file intermediate result somewhere locally (working or tmp +directory), with timestamp in the path, to make re-trying with `cdx_file` fast +and easy. + + +### `archiveorg_web_collection` + +Uses `cdx_collection.py` (or similar) to fetch a full CDX list, by iterating over +then process it. + +Saves the CDX file intermediate result somewhere locally (working or tmp +directory), with timestamp in the path, to make re-trying with `cdx_file` fast +and easy. + +### Others + +- `archiveorg_file_collection`: fetch file list via archive.org metadata, then processes each + +## Schema + +Per-resource results: + + hit (bool) + indicates whether resource seems in scope and was processed successfully + (roughly, status 'success', and + status (str) + success: fetched resource, ran processing, pa + skip-cdx: filtered before even fetching resource + skip-resource: filtered after fetching resource + wayback-error (etc): problem fetching + content_scope (str) + filtered-{filtertype} + article (etc) + landing-page + resource_type (str) + pdf, html + file_meta{} + cdx{} + revisit_cdx{} + + # below are resource_type specific + grobid + pdf_meta + pdf_trio + html_biblio + (other heuristics and ML) + +High-level request: + + trawl_method: str + cdx_file_path + default_filters: bool + resource_filters[] + scope: str + surt_prefix, domain, host, mimetype, size, datetime, resource_type, http_status + value: any + values[]: any + min: any + max: any + biblio_context{}: set of expected/default values + container_id + release_type + release_stage + url_rel + +High-level summary / results: + + status + request{}: the entire request object + counts + total_resources + status{} + content_scope{} + resource_type{} + +## Example Corpuses + +All PDFs (`application/pdf`) in web.archive.org from before the year 2000. +Starting point would be a CDX list. + +Spidering crawls starting from a set of OA journal homepage URLs. + +Archive-It partner collections from research universities, particularly of +their own .edu domains. Starting point would be an archive.org collection, from +which WARC files or CDX lists can be accessed. + +General archive.org PDF collections, such as +[ERIC](https://archive.org/details/ericarchive) or +[Document Cloud](https://archive.org/details/documentcloud). + +Specific Journal or Publisher URL patterns. Starting point could be a domain, +hostname, SURT prefix, and/or URL regex. + +Heuristic patterns over full web.archive.org CDX index. For example, .edu +domains with user directories and a `.pdf` in the file path ("tilde" username +pattern). + +Random samples of entire Wayback corpus. For example, random samples filtered +by date, content type, TLD, etc. This would be true "trawling" over the entire +corpus. + + +## Other Ideas + +Could have a web archive spidering mode: starting from a seed, fetch multiple +captures (different captures), then extract outlinks from those, up to some +number of hops. An example application would be links to research group +webpages or author homepages, and to try to extract PDF links from CVs, etc. + diff --git a/proposals/brainstorm/2021-debug_web_interface.md b/proposals/brainstorm/2021-debug_web_interface.md new file mode 100644 index 0000000..442b439 --- /dev/null +++ b/proposals/brainstorm/2021-debug_web_interface.md @@ -0,0 +1,9 @@ + +status: brainstorm idea + +Simple internal-only web interface to help debug ingest issues. + +- paste a hash, URL, or identifier and get a display of "everything we know" about it +- enter a URL/SURT prefix and get aggregate stats (?) +- enter a domain/host/prefix and get recent attempts/results +- pre-computed periodic reports on ingest pipeline (?) diff --git a/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md new file mode 100644 index 0000000..b3ad447 --- /dev/null +++ b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md @@ -0,0 +1,36 @@ + +status: brainstorming + +We continue to see issues with heritrix3-based crawling. Would like to have an +option to switch to higher-throughput heritrix-based crawling. + +SPNv2 path would stick around at least for save-paper-now style ingest. + + +## Sketch + +Ingest requests are created continuously by fatcat, with daily spikes. + +Ingest workers run mostly in "bulk" mode, aka they don't make SPNv2 calls. +`no-capture` responses are recorded in sandcrawler SQL database. + +Periodically (daily?), a script queries for new no-capture results, filtered to +the most recent period. These are processed in a bit in to a URL list, then +converted to a heritrix frontier, and sent to crawlers. This could either be an +h3 instance (?), or simple `scp` to a running crawl directory. + +The crawler crawls, with usual landing page config, and draintasker runs. + +TODO: can we have draintasker/heritrix set a maximum WARC life? Like 6 hours? +or, target a smaller draintasker item size, so they get updated more frequently + +Another SQL script dumps ingest requests from the *previous* period, and +re-submits them for bulk-style ingest (by workers). + +The end result would be things getting crawled and updated within a couple +days. + + +## Sketch 2 + +Upload URL list to petabox item, wait for heritrix derive to run (!) |