aboutsummaryrefslogtreecommitdiffstats
path: root/proposals/2021-09-13_src_ingest.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2021-09-13 19:33:08 -0700
committerBryan Newbold <bnewbold@archive.org>2021-10-04 13:02:08 -0700
commit5495d2ba4c92cf3ea3f1c31efe9ca670f6900047 (patch)
tree3c8c37cdbd03fb31aabce0397d894baa81fcd899 /proposals/2021-09-13_src_ingest.md
parenta613cf2fa66e59c412b9de15e487ab5d3431bb51 (diff)
downloadsandcrawler-5495d2ba4c92cf3ea3f1c31efe9ca670f6900047.tar.gz
sandcrawler-5495d2ba4c92cf3ea3f1c31efe9ca670f6900047.zip
ingest: basic 'component' and 'src' support
Diffstat (limited to 'proposals/2021-09-13_src_ingest.md')
-rw-r--r--proposals/2021-09-13_src_ingest.md53
1 files changed, 53 insertions, 0 deletions
diff --git a/proposals/2021-09-13_src_ingest.md b/proposals/2021-09-13_src_ingest.md
new file mode 100644
index 0000000..470827a
--- /dev/null
+++ b/proposals/2021-09-13_src_ingest.md
@@ -0,0 +1,53 @@
+
+File Ingest Mode: 'src'
+=======================
+
+Ingest type for "source" of works in document form. For example, tarballs of
+LaTeX source and figures, as published on arxiv.org and Pubmed Central.
+
+For now, presumption is that this would be a single file (`file` entity in
+fatcat).
+
+Initial mimetypes to allow:
+
+- text/x-tex
+- application/xml
+- application/gzip
+- application/x-bzip
+- application/x-bzip2
+- application/zip
+- application/x-tar
+- application/msword
+- application/vnd.openxmlformats-officedocument.wordprocessingml.document
+
+
+## Fatcat Changes
+
+In the file importer, allow the additional mimetypes for 'src' ingest.
+
+Might keep ingest disabled on the fatcat side, at least initially. Eg, until
+there is some scope of "file scope", or other ways of treating 'src' tarballs
+separate from PDFs or other fulltext formats.
+
+
+## Ingest Changes
+
+Allow additional terminal mimetypes for 'src' crawls.
+
+
+## Examples
+
+ arxiv:2109.00954v1
+ fatcat:release_akzp2lgqjbcbhpoeoitsj5k5hy
+ https://arxiv.org/format/2109.00954v1
+ https://arxiv.org/e-print/2109.00954v1
+
+ arxiv:1912.03397v2
+ https://arxiv.org/format/1912.03397v2
+ https://arxiv.org/e-print/1912.03397v2
+ NOT: https://arxiv.org/pdf/1912.03397v2
+
+ pmcid:PMC3767916
+ https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/08/03/PMC3767916.tar.gz
+
+For PMC, will need to use one of the .csv file lists to get the digit prefixes.