proposals/2020-05-11_overview.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


Can be multiple releases for each work:

- required: most canonical published version ("version of record", what would be cited)
    => or, most updated?
- optional: mostly openly accessible version
- optional: updated version
    => errata, corrected version, or retraction
- optional: fulltext indexed version
    => might be not the most updated, or no accessible


## Initial Plan

Index all fatcat works in catalog.

Always link to a born-digital copy if one is accessible.

Always link to a SIM microfilm copy if one is available.

Use best available fulltext for search. If structured, like TEI-XML, index the
body text separate from abstracts and references.


## Other Ideas

Do fulltext indexing at the granularity of pages, or some other segments of
text within articles (paragraphs, chapters, sections).

Fatcat already has all of Crossref, Pubmed, Arxiv, and several other
authoritative metadata sources. But today we are missing a good chunk of
content, particularly from institutional repositories and CS conferences (which
don't use identifiers). Also don't have good affiliation or citation count
coverage, and mixed/poor abstract coverage.

Could use Microsoft Academic Graph (MAG) metadata corpus (or similar) to
bootstrap with better metadata coverage.