posts/metadata_collections.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209

Title: Bibliographic Metadata Dumps
Author: bnewbold
Date: 2017-06-07
Tags: tech, archive, scholar
Status: draft

# TODO:
# - does BASE link to fulltext PDFs? is that helpful?
# - can we actually get academia.edu and researchgate.net papers? maybe?

I've recently been lucky enough to start working on a new big project at the
[Internet Archive][]: collecting, indexing, and expanding access to research
publications and datasets in the open world. This is perhaps *the* original
goal of networked information technology, and thanks to a decade of hard
work by the Open Access movement it feels like intertia
[is building][nature-elsevier] towards this one small piece of "universal
access to all knowledge".

[Internet Archive]: https://archive.org
[nature-elsevier]: http://www.nature.com/news/scientists-in-germany-peru-and-taiwan-to-lose-access-to-elsevier-journals-1.21223

<div class="sidebar">
<img src="/static/fig/ia_logo.png" width="150px" alt="internet archive logo" />
</div>

This is a snapshot-in-time look at "what's already out there" regarding indexes
of scholarly papers and books (aka, "things that get cited"). There are a ton
of resources out there, and many of them are just re-aggregating or building on
top of each other.

Here's a table of index-only resources for papers. These are databases or
corpuses of metadata that might include links/URLs to full text, but don't seem
to host fulltext copies themselves:

<table>
 <tr>
   <th>What
   <th>Record Count (millions)
   <th>Notes
 <tr>
   <td>Total digital English language papers
   <td>114
   <td>estimated[0], 2014
 <tr>
   <td>Total open access
   <td>27
   <td>estimated[0], 2014. Meaning "available somewhere"? MS academic had 35
       million.
 <tr>
   <td>Number of DOIs
   <td>143
   <td>Global; includes non-journals. 
 <tr>
   <td>CrossRef DOIs
   <td>88
   <td>Primary registrar for journals/paper in western world
 <tr>
   <td>BASE Search
   <td>109
   <td>Data from OAI-PMH
 <tr>
   <td>Google Scholar
   <td>100
   <td>"records", not URLs
 <tr>
   <td>Web of Science
   <td>90
   <td>proprietary; 1 billion citation graph
 <tr>
   <td>Scopus
   <td>55
   <td>proprietary/Elsevier
 <tr>
   <td>PubMed
   <td>26
   <td>Only half (13mil) have abstract or link to fulltext
 <tr>
   <td>CORE
   <td>24
   <td>
 <tr>
   <td>Semantic Scholar
   <td>10 to 20
   <td>Sometimes mirror fulltext?
 <tr>
   <td>OpenCitations
   <td>5
   <td>Paper entries; Spring 2017
 <tr>
   <td>dblp
   <td>3.7
   <td>computer science bibliography; Spring 2017
</table>

A big open question to me is how many pre-digital scholarly articles there are
which have not been digitized or assigned DOI numbers. Eg, how good is JSTOR
coverage? I'm unsure how to even compute this number.

And here are full-text collections of papers (which also include metadata):

<table>
 <tr>
   <th>What
   <th>Fulltext Count (millions)
   <th>Notes
 <tr>
   <td>Sci-Hub/scimag
   <td>62
   <td>one-file-per-DOI, 2017
 <tr>
   <td>CiteSeerX
   <td>6
   <td>(2010; presumably many more now?). Crawled from the web
 <tr>
   <td>CORE
   <td>4
   <td>Extracted fulltext, not PDF? Complete "gold" OA?
 <tr>
   <td>PubMed Central
   <td>4
   <td>Open Access. 2017
 <tr>
   <td>OSF Preprints (COS)
   <td>2
   <td>2017
 <tr>
   <td>Internet Archive
   <td>1.5
   <td>"Clean" mirrored items in Journal collections; we probably have far more
 <tr>
   <td>arxiv.org
   <td>1.2
   <td>physics+math. articles, not files, 2017
 <tr>
   <td>JSTOR Total
   <td>10
   <td>mostly locked down. includes books, grey lit
 <tr>
   <td>JSTOR Early Articles
   <td>0.5
   <td>open access subset
 <tr>
   <td>biorxiv.org
   <td>0.01
   <td>2017
</table>

Numbers aside, here are the useful resources to build on top of:

**CrossRef** is the primary **DOI** registrar in the western (english speaking
world). They are a non-profit, one of only a dozen or so DOI registrars; almost
all scholarly publishers go through them. They provide some basic metadata
(title, authors, publication), and have excellent data access: bulk datasets, a
query API, and a streaming update API. This is a good, authoritative foundation
for building indexes. China, Korea, and Japan have their own DOI registries,
and published datasets end up in DataCite instead of CrossRef. Other holes in
DOI coverage are "grey literature" (unpublished or informally published
documents, like government reports or technical memos), documents pre-2000 with
absentee publishers, and books (only a small fraction of books/chapters have
DOIs).

Publishers and repositories seem to be pretty good about providing **OAI-PMH**
API access to their metadata and records (and sometimes fulltext). Directories
make it possible to look up thousands of API endpoints. **BASE** seems to be
the best aggregation of all this metadata, and some projects build on top of
BASE (eg, oaDOI). **CORE** finds all of it's fulltext this way. It's not
clear if BASE is a good place to pull bulk metadata from; they seem to re-index
from scratch occasionally. **oaDOI** and **dissem.in** are services that
provide an API and search interface over metadata and point to Open Access
copies of the results.

**PubMed** (index) and **PubMed Central** (fulltext) are large and well
maintained. There are Pubmed records and identifiers ("PMID") going far back in
history, though only for medical texts (there is increasing contemporary
coversage out of medicine/biology, but only very recently). Annual and daily
database dumps are available, so a good resource to pull from.

**CiteSeerX** has been crawling the web for PDFs for a long time. Other than
**Google Scholar** and maybe the **Internet Archive** I think they do the most
serious paper crawling, though many folks do smaller or one-off crawls. They
are academic/non-profit and are willing to share metadata and their collected
papers; their systems are documented and open-source. Metadata and citations
are extracted from PDFs themselves. They have collaborated with the Microsoft
Research and the Allen Institute; I suspect they provided most or all content
for **Semantic Scholar** and **Microsoft Academic Knowledge** (the later now
defunct). NB: there are some interesting per-domain crawl statistics
[available](http://csxcrawlweb01.ist.psu.edu//), though half-broken.

It's worth noting that there is probably a lot of redundancy between
**pre-prints** and the final published papers, even though semantically most
people would consider them versions or editions of the same paper, not totally
distinct works. This might inflate both the record counts and the DOI counts.

A large number of other resources are not listed because they are very
subject-specific or relatively small. They may or may not be worth pursuing,
depending on how redundant they are with the larger resources. Eg, CogPrints
(cognative science, ~thousands of fulltext), MathSciNet (proprietary math
bibliogrpahy, ERIC (educational resources and grey lit), paperity.org (similar
to CORE), etc.

*Note: We don't do a very good job promoting it, but as of June 2017 The
Internet Archive is hiring! In particular we're looking for an all-around web
designer and a project manager for an existing 5 person python-web-app team.
Check out those and more on our
[jobs page](https://archive.org/about/jobs.php)*

[0]: "The Number of Scholarly Documents on the Public Web", PLoS One, 1994,
Khabsa and Giles. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0093949