From 5495d2ba4c92cf3ea3f1c31efe9ca670f6900047 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Mon, 13 Sep 2021 19:33:08 -0700 Subject: ingest: basic 'component' and 'src' support --- proposals/2021-09-09_component_ingest.md | 114 +++++++++++++++++++++++++++++++ 1 file changed, 114 insertions(+) create mode 100644 proposals/2021-09-09_component_ingest.md (limited to 'proposals/2021-09-09_component_ingest.md') diff --git a/proposals/2021-09-09_component_ingest.md b/proposals/2021-09-09_component_ingest.md new file mode 100644 index 0000000..09dee4f --- /dev/null +++ b/proposals/2021-09-09_component_ingest.md @@ -0,0 +1,114 @@ + +File Ingest Mode: 'component' +============================= + +A new ingest type for downloading individual files which are a subset of a +complete work. + +Some publishers now assign DOIs to individual figures, supplements, and other +"components" of an over release or document. + +Initial mimetypes to allow: + +- image/jpeg +- image/tiff +- image/png +- image/gif +- audio/mpeg +- video/mp4 +- video/mpeg +- text/plain +- text/csv +- application/json +- application/xml +- application/pdf +- application/gzip +- application/x-bzip +- application/x-bzip2 +- application/zip +- application/x-rar +- application/x-7z-compressed +- application/x-tar +- application/vnd.ms-powerpoint +- application/vnd.ms-excel +- application/msword +- application/vnd.openxmlformats-officedocument.wordprocessingml.document +- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet + +Intentionally not supporting: + +- text/html + + +## Fatcat Changes + +In the file importer, allow the additional mimetypes for 'component' ingest. + + +## Ingest Changes + +Allow additional terminal mimetypes for 'component' crawls. + + +## Examples + +Hundreds of thousands: + +#### ACS Supplement File + + + +Redirects directly to .zip in browser. SPN is blocked by cookie check. + +#### Frontiers .docx Supplement + + + +Redirects to full article page. There is a pop-up for figshare, seems hard to process. + +#### Figshare Single FIle + + + +As 'component' type in fatcat. + +Redirects to a landing page. Dataset ingest seems more appropriate for this entire domain. + +#### PeerJ supplement file + + + +PeerJ is hard because it redirects to a single HTML page, which has links to +supplements in the HTML. Perhaps a custom extractor will work. + +#### eLife + + + +The current crawl mechanism makes it seemingly impossible to extract a specific +supplement from the document as a whole. + +#### Zookeys + + + +These are extract-able. + +#### OECD PDF Supplement + + + + +Has an Excel (.xls) link, great, but then paywall. + +#### Direct File Link + + + +This one is also OECD, but is a simple direct download. + +#### Protein Data Base (PDB) Entry + + + +Multiple files; dataset/fileset more appropriate for these. -- cgit v1.2.3