diff options
Diffstat (limited to 'posts/metadata_collections.md')
-rw-r--r-- | posts/metadata_collections.md | 209 |
1 files changed, 209 insertions, 0 deletions
diff --git a/posts/metadata_collections.md b/posts/metadata_collections.md new file mode 100644 index 0000000..d7f8713 --- /dev/null +++ b/posts/metadata_collections.md @@ -0,0 +1,209 @@ +Title: Bibliographic Metadata Dumps +Author: bnewbold +Date: 2017-06-07 +Tags: tech, archive, scholar +Status: draft + +# TODO: +# - does BASE link to fulltext PDFs? is that helpful? +# - can we actually get academia.edu and researchgate.net papers? maybe? + +I've recently been lucky enough to start working on a new big project at the +[Internet Archive][]: collecting, indexing, and expanding access to research +publications and datasets in the open world. This is perhaps *the* original +goal of networked information technology, and thanks to a decade of hard +work by the Open Access movement it feels like intertia +[is building][nature-elsevier] towards this one small piece of "universal +access to all knowledge". + +[Internet Archive]: https://archive.org +[nature-elsevier]: http://www.nature.com/news/scientists-in-germany-peru-and-taiwan-to-lose-access-to-elsevier-journals-1.21223 + +<div class="sidebar"> +<img src="/static/fig/ia_logo.png" width="150px" alt="internet archive logo" /> +</div> + +This is a snapshot-in-time look at "what's already out there" regarding indexes +of scholarly papers and books (aka, "things that get cited"). There are a ton +of resources out there, and many of them are just re-aggregating or building on +top of each other. + +Here's a table of index-only resources for papers. These are databases or +corpuses of metadata that might include links/URLs to full text, but don't seem +to host fulltext copies themselves: + +<table> + <tr> + <th>What + <th>Record Count (millions) + <th>Notes + <tr> + <td>Total digital English language papers + <td>114 + <td>estimated[0], 2014 + <tr> + <td>Total open access + <td>27 + <td>estimated[0], 2014. Meaning "available somewhere"? MS academic had 35 + million. + <tr> + <td>Number of DOIs + <td>143 + <td>Global; includes non-journals. + <tr> + <td>CrossRef DOIs + <td>88 + <td>Primary registrar for journals/paper in western world + <tr> + <td>BASE Search + <td>109 + <td>Data from OAI-PMH + <tr> + <td>Google Scholar + <td>100 + <td>"records", not URLs + <tr> + <td>Web of Science + <td>90 + <td>proprietary; 1 billion citation graph + <tr> + <td>Scopus + <td>55 + <td>proprietary/Elsevier + <tr> + <td>PubMed + <td>26 + <td>Only half (13mil) have abstract or link to fulltext + <tr> + <td>CORE + <td>24 + <td> + <tr> + <td>Semantic Scholar + <td>10 to 20 + <td>Sometimes mirror fulltext? + <tr> + <td>OpenCitations + <td>5 + <td>Paper entries; Spring 2017 + <tr> + <td>dblp + <td>3.7 + <td>computer science bibliography; Spring 2017 +</table> + +A big open question to me is how many pre-digital scholarly articles there are +which have not been digitized or assigned DOI numbers. Eg, how good is JSTOR +coverage? I'm unsure how to even compute this number. + +And here are full-text collections of papers (which also include metadata): + +<table> + <tr> + <th>What + <th>Fulltext Count (millions) + <th>Notes + <tr> + <td>Sci-Hub/scimag + <td>62 + <td>one-file-per-DOI, 2017 + <tr> + <td>CiteSeerX + <td>6 + <td>(2010; presumably many more now?). Crawled from the web + <tr> + <td>CORE + <td>4 + <td>Extracted fulltext, not PDF? Complete "gold" OA? + <tr> + <td>PubMed Central + <td>4 + <td>Open Access. 2017 + <tr> + <td>OSF Preprints (COS) + <td>2 + <td>2017 + <tr> + <td>Internet Archive + <td>1.5 + <td>"Clean" mirrored items in Journal collections; we probably have far more + <tr> + <td>arxiv.org + <td>1.2 + <td>physics+math. articles, not files, 2017 + <tr> + <td>JSTOR Total + <td>10 + <td>mostly locked down. includes books, grey lit + <tr> + <td>JSTOR Early Articles + <td>0.5 + <td>open access subset + <tr> + <td>biorxiv.org + <td>0.01 + <td>2017 +</table> + +Numbers aside, here are the useful resources to build on top of: + +**CrossRef** is the primary **DOI** registrar in the western (english speaking +world). They are a non-profit, one of only a dozen or so DOI registrars; almost +all scholarly publishers go through them. They provide some basic metadata +(title, authors, publication), and have excellent data access: bulk datasets, a +query API, and a streaming update API. This is a good, authoritative foundation +for building indexes. China, Korea, and Japan have their own DOI registries, +and published datasets end up in DataCite instead of CrossRef. Other holes in +DOI coverage are "grey literature" (unpublished or informally published +documents, like government reports or technical memos), documents pre-2000 with +absentee publishers, and books (only a small fraction of books/chapters have +DOIs). + +Publishers and repositories seem to be pretty good about providing **OAI-PMH** +API access to their metadata and records (and sometimes fulltext). Directories +make it possible to look up thousands of API endpoints. **BASE** seems to be +the best aggregation of all this metadata, and some projects build on top of +BASE (eg, oaDOI). **CORE** finds all of it's fulltext this way. It's not +clear if BASE is a good place to pull bulk metadata from; they seem to re-index +from scratch occasionally. **oaDOI** and **dissem.in** are services that +provide an API and search interface over metadata and point to Open Access +copies of the results. + +**PubMed** (index) and **PubMed Central** (fulltext) are large and well +maintained. There are Pubmed records and identifiers ("PMID") going far back in +history, though only for medical texts (there is increasing contemporary +coversage out of medicine/biology, but only very recently). Annual and daily +database dumps are available, so a good resource to pull from. + +**CiteSeerX** has been crawling the web for PDFs for a long time. Other than +**Google Scholar** and maybe the **Internet Archive** I think they do the most +serious paper crawling, though many folks do smaller or one-off crawls. They +are academic/non-profit and are willing to share metadata and their collected +papers; their systems are documented and open-source. Metadata and citations +are extracted from PDFs themselves. They have collaborated with the Microsoft +Research and the Allen Institute; I suspect they provided most or all content +for **Semantic Scholar** and **Microsoft Academic Knowledge** (the later now +defunct). NB: there are some interesting per-domain crawl statistics +[available](http://csxcrawlweb01.ist.psu.edu//), though half-broken. + +It's worth noting that there is probably a lot of redundancy between +**pre-prints** and the final published papers, even though semantically most +people would consider them versions or editions of the same paper, not totally +distinct works. This might inflate both the record counts and the DOI counts. + +A large number of other resources are not listed because they are very +subject-specific or relatively small. They may or may not be worth pursuing, +depending on how redundant they are with the larger resources. Eg, CogPrints +(cognative science, ~thousands of fulltext), MathSciNet (proprietary math +bibliogrpahy, ERIC (educational resources and grey lit), paperity.org (similar +to CORE), etc. + +*Note: We don't do a very good job promoting it, but as of June 2017 The +Internet Archive is hiring! In particular we're looking for an all-around web +designer and a project manager for an existing 5 person python-web-app team. +Check out those and more on our +[jobs page](https://archive.org/about/jobs.php)* + +[0]: "The Number of Scholarly Documents on the Public Web", PLoS One, 1994, +Khabsa and Giles. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0093949 + |