diff options
Diffstat (limited to 'proposals/2021-09-13_src_ingest.md')
-rw-r--r-- | proposals/2021-09-13_src_ingest.md | 53 |
1 files changed, 53 insertions, 0 deletions
diff --git a/proposals/2021-09-13_src_ingest.md b/proposals/2021-09-13_src_ingest.md new file mode 100644 index 0000000..470827a --- /dev/null +++ b/proposals/2021-09-13_src_ingest.md @@ -0,0 +1,53 @@ + +File Ingest Mode: 'src' +======================= + +Ingest type for "source" of works in document form. For example, tarballs of +LaTeX source and figures, as published on arxiv.org and Pubmed Central. + +For now, presumption is that this would be a single file (`file` entity in +fatcat). + +Initial mimetypes to allow: + +- text/x-tex +- application/xml +- application/gzip +- application/x-bzip +- application/x-bzip2 +- application/zip +- application/x-tar +- application/msword +- application/vnd.openxmlformats-officedocument.wordprocessingml.document + + +## Fatcat Changes + +In the file importer, allow the additional mimetypes for 'src' ingest. + +Might keep ingest disabled on the fatcat side, at least initially. Eg, until +there is some scope of "file scope", or other ways of treating 'src' tarballs +separate from PDFs or other fulltext formats. + + +## Ingest Changes + +Allow additional terminal mimetypes for 'src' crawls. + + +## Examples + + arxiv:2109.00954v1 + fatcat:release_akzp2lgqjbcbhpoeoitsj5k5hy + https://arxiv.org/format/2109.00954v1 + https://arxiv.org/e-print/2109.00954v1 + + arxiv:1912.03397v2 + https://arxiv.org/format/1912.03397v2 + https://arxiv.org/e-print/1912.03397v2 + NOT: https://arxiv.org/pdf/1912.03397v2 + + pmcid:PMC3767916 + https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/08/03/PMC3767916.tar.gz + +For PMC, will need to use one of the .csv file lists to get the digit prefixes. |