aboutsummaryrefslogtreecommitdiffstats
path: root/proposals/2021-09-13_src_ingest.md
blob: 470827ac9540b1d97622401df3765164d4341275 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

File Ingest Mode: 'src'
=======================

Ingest type for "source" of works in document form. For example, tarballs of
LaTeX source and figures, as published on arxiv.org and Pubmed Central.

For now, presumption is that this would be a single file (`file` entity in
fatcat).

Initial mimetypes to allow:

- text/x-tex
- application/xml
- application/gzip
- application/x-bzip
- application/x-bzip2
- application/zip
- application/x-tar
- application/msword
- application/vnd.openxmlformats-officedocument.wordprocessingml.document


## Fatcat Changes

In the file importer, allow the additional mimetypes for 'src' ingest.

Might keep ingest disabled on the fatcat side, at least initially. Eg, until
there is some scope of "file scope", or other ways of treating 'src' tarballs
separate from PDFs or other fulltext formats.


## Ingest Changes

Allow additional terminal mimetypes for 'src' crawls.


## Examples

    arxiv:2109.00954v1
    fatcat:release_akzp2lgqjbcbhpoeoitsj5k5hy
    https://arxiv.org/format/2109.00954v1
    https://arxiv.org/e-print/2109.00954v1

    arxiv:1912.03397v2
    https://arxiv.org/format/1912.03397v2
    https://arxiv.org/e-print/1912.03397v2
    NOT: https://arxiv.org/pdf/1912.03397v2

    pmcid:PMC3767916
    https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/08/03/PMC3767916.tar.gz

For PMC, will need to use one of the .csv file lists to get the digit prefixes.