File Ingest Mode: 'component' ============================= A new ingest type for downloading individual files which are a subset of a complete work. Some publishers now assign DOIs to individual figures, supplements, and other "components" of an over release or document. Initial mimetypes to allow: - image/jpeg - image/tiff - image/png - image/gif - audio/mpeg - video/mp4 - video/mpeg - text/plain - text/csv - application/json - application/xml - application/pdf - application/gzip - application/x-bzip - application/x-bzip2 - application/zip - application/x-rar - application/x-7z-compressed - application/x-tar - application/vnd.ms-powerpoint - application/vnd.ms-excel - application/msword - application/vnd.openxmlformats-officedocument.wordprocessingml.document - application/vnd.openxmlformats-officedocument.spreadsheetml.sheet Intentionally not supporting: - text/html ## Fatcat Changes In the file importer, allow the additional mimetypes for 'component' ingest. ## Ingest Changes Allow additional terminal mimetypes for 'component' crawls. ## Examples Hundreds of thousands: #### ACS Supplement File Redirects directly to .zip in browser. SPN is blocked by cookie check. #### Frontiers .docx Supplement Redirects to full article page. There is a pop-up for figshare, seems hard to process. #### Figshare Single FIle As 'component' type in fatcat. Redirects to a landing page. Dataset ingest seems more appropriate for this entire domain. #### PeerJ supplement file PeerJ is hard because it redirects to a single HTML page, which has links to supplements in the HTML. Perhaps a custom extractor will work. #### eLife The current crawl mechanism makes it seemingly impossible to extract a specific supplement from the document as a whole. #### Zookeys These are extract-able. #### OECD PDF Supplement Has an Excel (.xls) link, great, but then paywall. #### Direct File Link This one is also OECD, but is a simple direct download. #### Protein Data Base (PDB) Entry Multiple files; dataset/fileset more appropriate for these.