diff options
author | Bryan Newbold <bnewbold@archive.org> | 2021-09-13 19:33:08 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2021-10-04 13:02:08 -0700 |
commit | 5495d2ba4c92cf3ea3f1c31efe9ca670f6900047 (patch) | |
tree | 3c8c37cdbd03fb31aabce0397d894baa81fcd899 /proposals | |
parent | a613cf2fa66e59c412b9de15e487ab5d3431bb51 (diff) | |
download | sandcrawler-5495d2ba4c92cf3ea3f1c31efe9ca670f6900047.tar.gz sandcrawler-5495d2ba4c92cf3ea3f1c31efe9ca670f6900047.zip |
ingest: basic 'component' and 'src' support
Diffstat (limited to 'proposals')
-rw-r--r-- | proposals/2021-09-09_component_ingest.md | 114 | ||||
-rw-r--r-- | proposals/2021-09-13_src_ingest.md | 53 |
2 files changed, 167 insertions, 0 deletions
diff --git a/proposals/2021-09-09_component_ingest.md b/proposals/2021-09-09_component_ingest.md new file mode 100644 index 0000000..09dee4f --- /dev/null +++ b/proposals/2021-09-09_component_ingest.md @@ -0,0 +1,114 @@ + +File Ingest Mode: 'component' +============================= + +A new ingest type for downloading individual files which are a subset of a +complete work. + +Some publishers now assign DOIs to individual figures, supplements, and other +"components" of an over release or document. + +Initial mimetypes to allow: + +- image/jpeg +- image/tiff +- image/png +- image/gif +- audio/mpeg +- video/mp4 +- video/mpeg +- text/plain +- text/csv +- application/json +- application/xml +- application/pdf +- application/gzip +- application/x-bzip +- application/x-bzip2 +- application/zip +- application/x-rar +- application/x-7z-compressed +- application/x-tar +- application/vnd.ms-powerpoint +- application/vnd.ms-excel +- application/msword +- application/vnd.openxmlformats-officedocument.wordprocessingml.document +- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet + +Intentionally not supporting: + +- text/html + + +## Fatcat Changes + +In the file importer, allow the additional mimetypes for 'component' ingest. + + +## Ingest Changes + +Allow additional terminal mimetypes for 'component' crawls. + + +## Examples + +Hundreds of thousands: <https://fatcat.wiki/release/search?q=type%3Acomponent+in_ia%3Afalse> + +#### ACS Supplement File + +<https://doi.org/10.1021/acscatal.0c02627.s002> + +Redirects directly to .zip in browser. SPN is blocked by cookie check. + +#### Frontiers .docx Supplement + +<https://doi.org/10.3389/fpls.2019.01642.s001> + +Redirects to full article page. There is a pop-up for figshare, seems hard to process. + +#### Figshare Single FIle + +<https://doi.org/10.6084/m9.figshare.13646972.v1> + +As 'component' type in fatcat. + +Redirects to a landing page. Dataset ingest seems more appropriate for this entire domain. + +#### PeerJ supplement file + +<https://doi.org/10.7717/peerj.10257/supp-7> + +PeerJ is hard because it redirects to a single HTML page, which has links to +supplements in the HTML. Perhaps a custom extractor will work. + +#### eLife + +<https://doi.org/10.7554/elife.38407.010> + +The current crawl mechanism makes it seemingly impossible to extract a specific +supplement from the document as a whole. + +#### Zookeys + +<https://doi.org/10.3897/zookeys.895.38576.figure53> + +These are extract-able. + +#### OECD PDF Supplement + +<https://doi.org/10.1787/f08c6324-en> +<https://www.oecd-ilibrary.org/trade/imports-of-services-billions-of-us-dollars_f08c6324-en> + +Has an Excel (.xls) link, great, but then paywall. + +#### Direct File Link + +<https://doi.org/10.1787/888934207500> + +This one is also OECD, but is a simple direct download. + +#### Protein Data Base (PDB) Entry + +<https://doi.org/10.2210/pdb6ls2/pdb> + +Multiple files; dataset/fileset more appropriate for these. diff --git a/proposals/2021-09-13_src_ingest.md b/proposals/2021-09-13_src_ingest.md new file mode 100644 index 0000000..470827a --- /dev/null +++ b/proposals/2021-09-13_src_ingest.md @@ -0,0 +1,53 @@ + +File Ingest Mode: 'src' +======================= + +Ingest type for "source" of works in document form. For example, tarballs of +LaTeX source and figures, as published on arxiv.org and Pubmed Central. + +For now, presumption is that this would be a single file (`file` entity in +fatcat). + +Initial mimetypes to allow: + +- text/x-tex +- application/xml +- application/gzip +- application/x-bzip +- application/x-bzip2 +- application/zip +- application/x-tar +- application/msword +- application/vnd.openxmlformats-officedocument.wordprocessingml.document + + +## Fatcat Changes + +In the file importer, allow the additional mimetypes for 'src' ingest. + +Might keep ingest disabled on the fatcat side, at least initially. Eg, until +there is some scope of "file scope", or other ways of treating 'src' tarballs +separate from PDFs or other fulltext formats. + + +## Ingest Changes + +Allow additional terminal mimetypes for 'src' crawls. + + +## Examples + + arxiv:2109.00954v1 + fatcat:release_akzp2lgqjbcbhpoeoitsj5k5hy + https://arxiv.org/format/2109.00954v1 + https://arxiv.org/e-print/2109.00954v1 + + arxiv:1912.03397v2 + https://arxiv.org/format/1912.03397v2 + https://arxiv.org/e-print/1912.03397v2 + NOT: https://arxiv.org/pdf/1912.03397v2 + + pmcid:PMC3767916 + https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/08/03/PMC3767916.tar.gz + +For PMC, will need to use one of the .csv file lists to get the digit prefixes. |