Title: Bibliographic Metadata Dumps Author: bnewbold Date: 2017-06-07 Tags: tech, archive, scholar Status: draft # TODO: # - does BASE link to fulltext PDFs? is that helpful? # - can we actually get academia.edu and researchgate.net papers? maybe? I've recently been lucky enough to start working on a new big project at the [Internet Archive][]: collecting, indexing, and expanding access to research publications and datasets in the open world. This is perhaps *the* original goal of networked information technology, and thanks to a decade of hard work by the Open Access movement it feels like intertia [is building][nature-elsevier] towards this one small piece of "universal access to all knowledge". [Internet Archive]: https://archive.org [nature-elsevier]: http://www.nature.com/news/scientists-in-germany-peru-and-taiwan-to-lose-access-to-elsevier-journals-1.21223 This is a snapshot-in-time look at "what's already out there" regarding indexes of scholarly papers and books (aka, "things that get cited"). There are a ton of resources out there, and many of them are just re-aggregating or building on top of each other. Here's a table of index-only resources for papers. These are databases or corpuses of metadata that might include links/URLs to full text, but don't seem to host fulltext copies themselves:
What Record Count (millions) Notes
Total digital English language papers 114 estimated[0], 2014
Total open access 27 estimated[0], 2014. Meaning "available somewhere"? MS academic had 35 million.
Number of DOIs 143 Global; includes non-journals.
CrossRef DOIs 88 Primary registrar for journals/paper in western world
BASE Search 109 Data from OAI-PMH
Google Scholar 100 "records", not URLs
Web of Science 90 proprietary; 1 billion citation graph
Scopus 55 proprietary/Elsevier
PubMed 26 Only half (13mil) have abstract or link to fulltext
CORE 24
Semantic Scholar 10 to 20 Sometimes mirror fulltext?
OpenCitations 5 Paper entries; Spring 2017
dblp 3.7 computer science bibliography; Spring 2017
A big open question to me is how many pre-digital scholarly articles there are which have not been digitized or assigned DOI numbers. Eg, how good is JSTOR coverage? I'm unsure how to even compute this number. And here are full-text collections of papers (which also include metadata):
What Fulltext Count (millions) Notes
Sci-Hub/scimag 62 one-file-per-DOI, 2017
CiteSeerX 6 (2010; presumably many more now?). Crawled from the web
CORE 4 Extracted fulltext, not PDF? Complete "gold" OA?
PubMed Central 4 Open Access. 2017
OSF Preprints (COS) 2 2017
Internet Archive 1.5 "Clean" mirrored items in Journal collections; we probably have far more
arxiv.org 1.2 physics+math. articles, not files, 2017
JSTOR Total 10 mostly locked down. includes books, grey lit
JSTOR Early Articles 0.5 open access subset
biorxiv.org 0.01 2017
Numbers aside, here are the useful resources to build on top of: **CrossRef** is the primary **DOI** registrar in the western (english speaking world). They are a non-profit, one of only a dozen or so DOI registrars; almost all scholarly publishers go through them. They provide some basic metadata (title, authors, publication), and have excellent data access: bulk datasets, a query API, and a streaming update API. This is a good, authoritative foundation for building indexes. China, Korea, and Japan have their own DOI registries, and published datasets end up in DataCite instead of CrossRef. Other holes in DOI coverage are "grey literature" (unpublished or informally published documents, like government reports or technical memos), documents pre-2000 with absentee publishers, and books (only a small fraction of books/chapters have DOIs). Publishers and repositories seem to be pretty good about providing **OAI-PMH** API access to their metadata and records (and sometimes fulltext). Directories make it possible to look up thousands of API endpoints. **BASE** seems to be the best aggregation of all this metadata, and some projects build on top of BASE (eg, oaDOI). **CORE** finds all of it's fulltext this way. It's not clear if BASE is a good place to pull bulk metadata from; they seem to re-index from scratch occasionally. **oaDOI** and **dissem.in** are services that provide an API and search interface over metadata and point to Open Access copies of the results. **PubMed** (index) and **PubMed Central** (fulltext) are large and well maintained. There are Pubmed records and identifiers ("PMID") going far back in history, though only for medical texts (there is increasing contemporary coversage out of medicine/biology, but only very recently). Annual and daily database dumps are available, so a good resource to pull from. **CiteSeerX** has been crawling the web for PDFs for a long time. Other than **Google Scholar** and maybe the **Internet Archive** I think they do the most serious paper crawling, though many folks do smaller or one-off crawls. They are academic/non-profit and are willing to share metadata and their collected papers; their systems are documented and open-source. Metadata and citations are extracted from PDFs themselves. They have collaborated with the Microsoft Research and the Allen Institute; I suspect they provided most or all content for **Semantic Scholar** and **Microsoft Academic Knowledge** (the later now defunct). NB: there are some interesting per-domain crawl statistics [available](http://csxcrawlweb01.ist.psu.edu//), though half-broken. It's worth noting that there is probably a lot of redundancy between **pre-prints** and the final published papers, even though semantically most people would consider them versions or editions of the same paper, not totally distinct works. This might inflate both the record counts and the DOI counts. A large number of other resources are not listed because they are very subject-specific or relatively small. They may or may not be worth pursuing, depending on how redundant they are with the larger resources. Eg, CogPrints (cognative science, ~thousands of fulltext), MathSciNet (proprietary math bibliogrpahy, ERIC (educational resources and grey lit), paperity.org (similar to CORE), etc. *Note: We don't do a very good job promoting it, but as of June 2017 The Internet Archive is hiring! In particular we're looking for an all-around web designer and a project manager for an existing 5 person python-web-app team. Check out those and more on our [jobs page](https://archive.org/about/jobs.php)* [0]: "The Number of Scholarly Documents on the Public Web", PLoS One, 1994, Khabsa and Giles. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0093949