1 files changed, 209 insertions, 0 deletions
diff --git a/posts/metadata_collections.md b/posts/metadata_collections.md
new file mode 100644
index 0000000..d7f8713
--- /dev/null
+++ b/posts/metadata_collections.md
@@ -0,0 +1,209 @@
+Title: Bibliographic Metadata Dumps
+Author: bnewbold
+Date: 2017-06-07
+Tags: tech, archive, scholar
+Status: draft
+
+# TODO:
+# - does BASE link to fulltext PDFs? is that helpful?
+# - can we actually get academia.edu and researchgate.net papers? maybe?
+
+I've recently been lucky enough to start working on a new big project at the
+[Internet Archive][]: collecting, indexing, and expanding access to research
+publications and datasets in the open world. This is perhaps *the* original
+goal of networked information technology, and thanks to a decade of hard
+work by the Open Access movement it feels like intertia
+[is building][nature-elsevier] towards this one small piece of "universal
+access to all knowledge".
+
+[Internet Archive]: https://archive.org
+[nature-elsevier]: http://www.nature.com/news/scientists-in-germany-peru-and-taiwan-to-lose-access-to-elsevier-journals-1.21223
+
+<div class="sidebar">
+<img src="/static/fig/ia_logo.png" width="150px" alt="internet archive logo" />
+</div>
+
+This is a snapshot-in-time look at "what's already out there" regarding indexes
+of scholarly papers and books (aka, "things that get cited"). There are a ton
+of resources out there, and many of them are just re-aggregating or building on
+top of each other.
+
+Here's a table of index-only resources for papers. These are databases or
+corpuses of metadata that might include links/URLs to full text, but don't seem
+to host fulltext copies themselves:
+
+<table>
+ <tr>
+   <th>What
+   <th>Record Count (millions)
+   <th>Notes
+ <tr>
+   <td>Total digital English language papers
+   <td>114
+   <td>estimated[0], 2014
+ <tr>
+   <td>Total open access
+   <td>27
+   <td>estimated[0], 2014. Meaning "available somewhere"? MS academic had 35
+       million.
+ <tr>
+   <td>Number of DOIs
+   <td>143
+   <td>Global; includes non-journals. 
+ <tr>
+   <td>CrossRef DOIs
+   <td>88
+   <td>Primary registrar for journals/paper in western world
+ <tr>
+   <td>BASE Search
+   <td>109
+   <td>Data from OAI-PMH
+ <tr>
+   <td>Google Scholar
+   <td>100
+   <td>"records", not URLs
+ <tr>
+   <td>Web of Science
+   <td>90
+   <td>proprietary; 1 billion citation graph
+ <tr>
+   <td>Scopus
+   <td>55
+   <td>proprietary/Elsevier
+ <tr>
+   <td>PubMed
+   <td>26
+   <td>Only half (13mil) have abstract or link to fulltext
+ <tr>
+   <td>CORE
+   <td>24
+   <td>
+ <tr>
+   <td>Semantic Scholar
+   <td>10 to 20
+   <td>Sometimes mirror fulltext?
+ <tr>
+   <td>OpenCitations
+   <td>5
+   <td>Paper entries; Spring 2017
+ <tr>
+   <td>dblp
+   <td>3.7
+   <td>computer science bibliography; Spring 2017
+</table>
+
+A big open question to me is how many pre-digital scholarly articles there are
+which have not been digitized or assigned DOI numbers. Eg, how good is JSTOR
+coverage? I'm unsure how to even compute this number.
+
+And here are full-text collections of papers (which also include metadata):
+
+<table>
+ <tr>
+   <th>What
+   <th>Fulltext Count (millions)
+   <th>Notes
+ <tr>
+   <td>Sci-Hub/scimag
+   <td>62
+   <td>one-file-per-DOI, 2017
+ <tr>
+   <td>CiteSeerX
+   <td>6
+   <td>(2010; presumably many more now?). Crawled from the web
+ <tr>
+   <td>CORE
+   <td>4
+   <td>Extracted fulltext, not PDF? Complete "gold" OA?
+ <tr>
+   <td>PubMed Central
+   <td>4
+   <td>Open Access. 2017
+ <tr>
+   <td>OSF Preprints (COS)
+   <td>2
+   <td>2017
+ <tr>
+   <td>Internet Archive
+   <td>1.5
+   <td>"Clean" mirrored items in Journal collections; we probably have far more
+ <tr>
+   <td>arxiv.org
+   <td>1.2
+   <td>physics+math. articles, not files, 2017
+ <tr>
+   <td>JSTOR Total
+   <td>10
+   <td>mostly locked down. includes books, grey lit
+ <tr>
+   <td>JSTOR Early Articles
+   <td>0.5
+   <td>open access subset
+ <tr>
+   <td>biorxiv.org
+   <td>0.01
+   <td>2017
+</table>
+
+Numbers aside, here are the useful resources to build on top of:
+
+**CrossRef** is the primary **DOI** registrar in the western (english speaking
+world). They are a non-profit, one of only a dozen or so DOI registrars; almost
+all scholarly publishers go through them. They provide some basic metadata
+(title, authors, publication), and have excellent data access: bulk datasets, a
+query API, and a streaming update API. This is a good, authoritative foundation
+for building indexes. China, Korea, and Japan have their own DOI registries,
+and published datasets end up in DataCite instead of CrossRef. Other holes in
+DOI coverage are "grey literature" (unpublished or informally published
+documents, like government reports or technical memos), documents pre-2000 with
+absentee publishers, and books (only a small fraction of books/chapters have
+DOIs).
+
+Publishers and repositories seem to be pretty good about providing **OAI-PMH**
+API access to their metadata and records (and sometimes fulltext). Directories
+make it possible to look up thousands of API endpoints. **BASE** seems to be
+the best aggregation of all this metadata, and some projects build on top of
+BASE (eg, oaDOI). **CORE** finds all of it's fulltext this way. It's not
+clear if BASE is a good place to pull bulk metadata from; they seem to re-index
+from scratch occasionally. **oaDOI** and **dissem.in** are services that
+provide an API and search interface over metadata and point to Open Access
+copies of the results.
+
+**PubMed** (index) and **PubMed Central** (fulltext) are large and well
+maintained. There are Pubmed records and identifiers ("PMID") going far back in
+history, though only for medical texts (there is increasing contemporary
+coversage out of medicine/biology, but only very recently). Annual and daily
+database dumps are available, so a good resource to pull from.
+
+**CiteSeerX** has been crawling the web for PDFs for a long time. Other than
+**Google Scholar** and maybe the **Internet Archive** I think they do the most
+serious paper crawling, though many folks do smaller or one-off crawls. They
+are academic/non-profit and are willing to share metadata and their collected
+papers; their systems are documented and open-source. Metadata and citations
+are extracted from PDFs themselves. They have collaborated with the Microsoft
+Research and the Allen Institute; I suspect they provided most or all content
+for **Semantic Scholar** and **Microsoft Academic Knowledge** (the later now
+defunct). NB: there are some interesting per-domain crawl statistics
+[available](http://csxcrawlweb01.ist.psu.edu//), though half-broken.
+
+It's worth noting that there is probably a lot of redundancy between
+**pre-prints** and the final published papers, even though semantically most
+people would consider them versions or editions of the same paper, not totally
+distinct works. This might inflate both the record counts and the DOI counts.
+
+A large number of other resources are not listed because they are very
+subject-specific or relatively small. They may or may not be worth pursuing,
+depending on how redundant they are with the larger resources. Eg, CogPrints
+(cognative science, ~thousands of fulltext), MathSciNet (proprietary math
+bibliogrpahy, ERIC (educational resources and grey lit), paperity.org (similar
+to CORE), etc.
+
+*Note: We don't do a very good job promoting it, but as of June 2017 The
+Internet Archive is hiring! In particular we're looking for an all-around web
+designer and a project manager for an existing 5 person python-web-app team.
+Check out those and more on our
+[jobs page](https://archive.org/about/jobs.php)*
+
+[0]: "The Number of Scholarly Documents on the Public Web", PLoS One, 1994,
+Khabsa and Giles. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0093949
+