Title: Bibliographic Metadata Dumps
Author: bnewbold
Date: 2017-06-07
Tags: tech, archive, scholar
Status: draft

# TODO:
# - does BASE link to fulltext PDFs? is that helpful?
# - can we actually get academia.edu and researchgate.net papers? maybe?

I've recently been lucky enough to start working on a new big project at the
[Internet Archive][]: collecting, indexing, and expanding access to research
publications and datasets in the open world. This is perhaps *the* original
goal of networked information technology, and thanks to a decade of hard
work by the Open Access movement it feels like intertia
[is building][nature-elsevier] towards this one small piece of "universal
access to all knowledge".

[Internet Archive]: https://archive.org
[nature-elsevier]: http://www.nature.com/news/scientists-in-germany-peru-and-taiwan-to-lose-access-to-elsevier-journals-1.21223

<div class="sidebar">
<img src="/static/fig/ia_logo.png" width="150px" alt="internet archive logo" />
</div>

This is a snapshot-in-time look at "what's already out there" regarding indexes
of scholarly papers and books (aka, "things that get cited"). There are a ton
of resources out there, and many of them are just re-aggregating or building on
top of each other.

Here's a table of index-only resources for papers. These are databases or
corpuses of metadata that might include links/URLs to full text, but don't seem
to host fulltext copies themselves:

<table>
 <tr>
   <th>What
   <th>Record Count (millions)
   <th>Notes
 <tr>
   <td>Total digital English language papers
   <td>114
   <td>estimated[0], 2014
 <tr>
   <td>Total open access
   <td>27
   <td>estimated[0], 2014. Meaning "available somewhere"? MS academic had 35
       million.
 <tr>
   <td>Number of DOIs
   <td>143
   <td>Global; includes non-journals. 
 <tr>
   <td>CrossRef DOIs
   <td>88
   <td>Primary registrar for journals/paper in western world
 <tr>
   <td>BASE Search
   <td>109
   <td>Data from OAI-PMH
 <tr>
   <td>Google Scholar
   <td>100
   <td>"records", not URLs
 <tr>
   <td>Web of Science
   <td>90
   <td>proprietary; 1 billion citation graph
 <tr>
   <td>Scopus
   <td>55
   <td>proprietary/Elsevier
 <tr>
   <td>PubMed
   <td>26
   <td>Only half (13mil) have abstract or link to fulltext
 <tr>
   <td>CORE
   <td>24
   <td>
 <tr>
   <td>Semantic Scholar
   <td>10 to 20
   <td>Sometimes mirror fulltext?
 <tr>
   <td>OpenCitations
   <td>5
   <td>Paper entries; Spring 2017
 <tr>
   <td>dblp
   <td>3.7
   <td>computer science bibliography; Spring 2017
</table>

A big open question to me is how many pre-digital scholarly articles there are
which have not been digitized or assigned DOI numbers. Eg, how good is JSTOR
coverage? I'm unsure how to even compute this number.

And here are full-text collections of papers (which also include metadata):

<table>
 <tr>
   <th>What
   <th>Fulltext Count (millions)
   <th>Notes
 <tr>
   <td>Sci-Hub/scimag
   <td>62
   <td>one-file-per-DOI, 2017
 <tr>
   <td>CiteSeerX
   <td>6
   <td>(2010; presumably many more now?). Crawled from the web
 <tr>
   <td>CORE
   <td>4
   <td>Extracted fulltext, not PDF? Complete "gold" OA?
 <tr>
   <td>PubMed Central
   <td>4
   <td>Open Access. 2017
 <tr>
   <td>OSF Preprints (COS)
   <td>2
   <td>2017
 <tr>
   <td>Internet Archive
   <td>1.5
   <td>"Clean" mirrored items in Journal collections; we probably have far more
 <tr>
   <td>arxiv.org
   <td>1.2
   <td>physics+math. articles, not files, 2017
 <tr>
   <td>JSTOR Total
   <td>10
   <td>mostly locked down. includes books, grey lit
 <tr>
   <td>JSTOR Early Articles
   <td>0.5
   <td>open access subset
 <tr>
   <td>biorxiv.org
   <td>0.01
   <td>2017
</table>

Numbers aside, here are the useful resources to build on top of:

**CrossRef** is the primary **DOI** registrar in the western (english speaking
world). They are a non-profit, one of only a dozen or so DOI registrars; almost
all scholarly publishers go through them. They provide some basic metadata
(title, authors, publication), and have excellent data access: bulk datasets, a
query API, and a streaming update API. This is a good, authoritative foundation
for building indexes. China, Korea, and Japan have their own DOI registries,
and published datasets end up in DataCite instead of CrossRef. Other holes in
DOI coverage are "grey literature" (unpublished or informally published
documents, like government reports or technical memos), documents pre-2000 with
absentee publishers, and books (only a small fraction of books/chapters have
DOIs).

Publishers and repositories seem to be pretty good about providing **OAI-PMH**
API access to their metadata and records (and sometimes fulltext). Directories
make it possible to look up thousands of API endpoints. **BASE** seems to be
the best aggregation of all this metadata, and some projects build on top of
BASE (eg, oaDOI). **CORE** finds all of it's fulltext this way. It's not
clear if BASE is a good place to pull bulk metadata from; they seem to re-index
from scratch occasionally. **oaDOI** and **dissem.in** are services that
provide an API and search interface over metadata and point to Open Access
copies of the results.

**PubMed** (index) and **PubMed Central** (fulltext) are large and well
maintained. There are Pubmed records and identifiers ("PMID") going far back in
history, though only for medical texts (there is increasing contemporary
coversage out of medicine/biology, but only very recently). Annual and daily
database dumps are available, so a good resource to pull from.

**CiteSeerX** has been crawling the web for PDFs for a long time. Other than
**Google Scholar** and maybe the **Internet Archive** I think they do the most
serious paper crawling, though many folks do smaller or one-off crawls. They
are academic/non-profit and are willing to share metadata and their collected
papers; their systems are documented and open-source. Metadata and citations
are extracted from PDFs themselves. They have collaborated with the Microsoft
Research and the Allen Institute; I suspect they provided most or all content
for **Semantic Scholar** and **Microsoft Academic Knowledge** (the later now
defunct). NB: there are some interesting per-domain crawl statistics
[available](http://csxcrawlweb01.ist.psu.edu//), though half-broken.

It's worth noting that there is probably a lot of redundancy between
**pre-prints** and the final published papers, even though semantically most
people would consider them versions or editions of the same paper, not totally
distinct works. This might inflate both the record counts and the DOI counts.

A large number of other resources are not listed because they are very
subject-specific or relatively small. They may or may not be worth pursuing,
depending on how redundant they are with the larger resources. Eg, CogPrints
(cognative science, ~thousands of fulltext), MathSciNet (proprietary math
bibliogrpahy, ERIC (educational resources and grey lit), paperity.org (similar
to CORE), etc.

*Note: We don't do a very good job promoting it, but as of June 2017 The
Internet Archive is hiring! In particular we're looking for an all-around web
designer and a project manager for an existing 5 person python-web-app team.
Check out those and more on our
[jobs page](https://archive.org/about/jobs.php)*

[0]: "The Number of Scholarly Documents on the Public Web", PLoS One, 1994,
Khabsa and Giles. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0093949
What	Record Count (millions)	Notes
Total digital English language papers	114	estimated[0], 2014
Total open access	27	estimated[0], 2014. Meaning "available somewhere"? MS academic had 35 million.
Number of DOIs	143	Global; includes non-journals.
CrossRef DOIs	88	Primary registrar for journals/paper in western world
BASE Search	109	Data from OAI-PMH
Google Scholar	100	"records", not URLs
Web of Science	90	proprietary; 1 billion citation graph
Scopus	55	proprietary/Elsevier
PubMed	26	Only half (13mil) have abstract or link to fulltext
CORE	24
Semantic Scholar	10 to 20	Sometimes mirror fulltext?
OpenCitations	5	Paper entries; Spring 2017
dblp	3.7	computer science bibliography; Spring 2017
What	Fulltext Count (millions)	Notes
Sci-Hub/scimag	62	one-file-per-DOI, 2017
CiteSeerX	6	(2010; presumably many more now?). Crawled from the web
CORE	4	Extracted fulltext, not PDF? Complete "gold" OA?
PubMed Central	4	Open Access. 2017
OSF Preprints (COS)	2	2017
Internet Archive	1.5	"Clean" mirrored items in Journal collections; we probably have far more
arxiv.org	1.2	physics+math. articles, not files, 2017
JSTOR Total	10	mostly locked down. includes books, grey lit
JSTOR Early Articles	0.5	open access subset
biorxiv.org	0.01	2017