From cee4a5393132cc1bfb4733281fd5c46392a9586f Mon Sep 17 00:00:00 2001
From: Bryan Newbold <bnewbold@archive.org>
Date: Sun, 14 Apr 2019 18:16:55 -0700
Subject: (OLD) rename some drafts

---
 posts/biblio-metadata-collections.md | 209 +++++++++++++++++++++++++++++++++++
 posts/merkle-design.md               | 107 ++++++++++++++++++
 posts/merkle_design.md               | 107 ------------------
 posts/metadata_collections.md        | 209 -----------------------------------
 4 files changed, 316 insertions(+), 316 deletions(-)
 create mode 100644 posts/biblio-metadata-collections.md
 create mode 100644 posts/merkle-design.md
 delete mode 100644 posts/merkle_design.md
 delete mode 100644 posts/metadata_collections.md
diff --git a/posts/biblio-metadata-collections.md b/posts/biblio-metadata-collections.md
new file mode 100644
index 0000000..d7f8713
--- /dev/null
+++ b/posts/biblio-metadata-collections.md
@@ -0,0 +1,209 @@
+Title: Bibliographic Metadata Dumps
+Author: bnewbold
+Date: 2017-06-07
+Tags: tech, archive, scholar
+Status: draft
+
+# TODO:
+# - does BASE link to fulltext PDFs? is that helpful?
+# - can we actually get academia.edu and researchgate.net papers? maybe?
+
+I've recently been lucky enough to start working on a new big project at the
+[Internet Archive][]: collecting, indexing, and expanding access to research
+publications and datasets in the open world. This is perhaps *the* original
+goal of networked information technology, and thanks to a decade of hard
+work by the Open Access movement it feels like intertia
+[is building][nature-elsevier] towards this one small piece of "universal
+access to all knowledge".
+
+[Internet Archive]: https://archive.org
+[nature-elsevier]: http://www.nature.com/news/scientists-in-germany-peru-and-taiwan-to-lose-access-to-elsevier-journals-1.21223
+
+<div class="sidebar">
+<img src="/static/fig/ia_logo.png" width="150px" alt="internet archive logo" />
+</div>
+
+This is a snapshot-in-time look at "what's already out there" regarding indexes
+of scholarly papers and books (aka, "things that get cited"). There are a ton
+of resources out there, and many of them are just re-aggregating or building on
+top of each other.
+
+Here's a table of index-only resources for papers. These are databases or
+corpuses of metadata that might include links/URLs to full text, but don't seem
+to host fulltext copies themselves:
+
+<table>
+ <tr>
+   <th>What
+   <th>Record Count (millions)
+   <th>Notes
+ <tr>
+   <td>Total digital English language papers
+   <td>114
+   <td>estimated[0], 2014
+ <tr>
+   <td>Total open access
+   <td>27
+   <td>estimated[0], 2014. Meaning "available somewhere"? MS academic had 35
+       million.
+ <tr>
+   <td>Number of DOIs
+   <td>143
+   <td>Global; includes non-journals. 
+ <tr>
+   <td>CrossRef DOIs
+   <td>88
+   <td>Primary registrar for journals/paper in western world
+ <tr>
+   <td>BASE Search
+   <td>109
+   <td>Data from OAI-PMH
+ <tr>
+   <td>Google Scholar
+   <td>100
+   <td>"records", not URLs
+ <tr>
+   <td>Web of Science
+   <td>90
+   <td>proprietary; 1 billion citation graph
+ <tr>
+   <td>Scopus
+   <td>55
+   <td>proprietary/Elsevier
+ <tr>
+   <td>PubMed
+   <td>26
+   <td>Only half (13mil) have abstract or link to fulltext
+ <tr>
+   <td>CORE
+   <td>24
+   <td>
+ <tr>
+   <td>Semantic Scholar
+   <td>10 to 20
+   <td>Sometimes mirror fulltext?
+ <tr>
+   <td>OpenCitations
+   <td>5
+   <td>Paper entries; Spring 2017
+ <tr>
+   <td>dblp
+   <td>3.7
+   <td>computer science bibliography; Spring 2017
+</table>
+
+A big open question to me is how many pre-digital scholarly articles there are
+which have not been digitized or assigned DOI numbers. Eg, how good is JSTOR
+coverage? I'm unsure how to even compute this number.
+
+And here are full-text collections of papers (which also include metadata):
+
+<table>
+ <tr>
+   <th>What
+   <th>Fulltext Count (millions)
+   <th>Notes
+ <tr>
+   <td>Sci-Hub/scimag
+   <td>62
+   <td>one-file-per-DOI, 2017
+ <tr>
+   <td>CiteSeerX
+   <td>6
+   <td>(2010; presumably many more now?). Crawled from the web
+ <tr>
+   <td>CORE
+   <td>4
+   <td>Extracted fulltext, not PDF? Complete "gold" OA?
+ <tr>
+   <td>PubMed Central
+   <td>4
+   <td>Open Access. 2017
+ <tr>
+   <td>OSF Preprints (COS)
+   <td>2
+   <td>2017
+ <tr>
+   <td>Internet Archive
+   <td>1.5
+   <td>"Clean" mirrored items in Journal collections; we probably have far more
+ <tr>
+   <td>arxiv.org
+   <td>1.2
+   <td>physics+math. articles, not files, 2017
+ <tr>
+   <td>JSTOR Total
+   <td>10
+   <td>mostly locked down. includes books, grey lit
+ <tr>
+   <td>JSTOR Early Articles
+   <td>0.5
+   <td>open access subset
+ <tr>
+   <td>biorxiv.org
+   <td>0.01
+   <td>2017
+</table>
+
+Numbers aside, here are the useful resources to build on top of:
+
+**CrossRef** is the primary **DOI** registrar in the western (english speaking
+world). They are a non-profit, one of only a dozen or so DOI registrars; almost
+all scholarly publishers go through them. They provide some basic metadata
+(title, authors, publication), and have excellent data access: bulk datasets, a
+query API, and a streaming update API. This is a good, authoritative foundation
+for building indexes. China, Korea, and Japan have their own DOI registries,
+and published datasets end up in DataCite instead of CrossRef. Other holes in
+DOI coverage are "grey literature" (unpublished or informally published
+documents, like government reports or technical memos), documents pre-2000 with
+absentee publishers, and books (only a small fraction of books/chapters have
+DOIs).
+
+Publishers and repositories seem to be pretty good about providing **OAI-PMH**
+API access to their metadata and records (and sometimes fulltext). Directories
+make it possible to look up thousands of API endpoints. **BASE** seems to be
+the best aggregation of all this metadata, and some projects build on top of
+BASE (eg, oaDOI). **CORE** finds all of it's fulltext this way. It's not
+clear if BASE is a good place to pull bulk metadata from; they seem to re-index
+from scratch occasionally. **oaDOI** and **dissem.in** are services that
+provide an API and search interface over metadata and point to Open Access
+copies of the results.
+
+**PubMed** (index) and **PubMed Central** (fulltext) are large and well
+maintained. There are Pubmed records and identifiers ("PMID") going far back in
+history, though only for medical texts (there is increasing contemporary
+coversage out of medicine/biology, but only very recently). Annual and daily
+database dumps are available, so a good resource to pull from.
+
+**CiteSeerX** has been crawling the web for PDFs for a long time. Other than
+**Google Scholar** and maybe the **Internet Archive** I think they do the most
+serious paper crawling, though many folks do smaller or one-off crawls. They
+are academic/non-profit and are willing to share metadata and their collected
+papers; their systems are documented and open-source. Metadata and citations
+are extracted from PDFs themselves. They have collaborated with the Microsoft
+Research and the Allen Institute; I suspect they provided most or all content
+for **Semantic Scholar** and **Microsoft Academic Knowledge** (the later now
+defunct). NB: there are some interesting per-domain crawl statistics
+[available](http://csxcrawlweb01.ist.psu.edu//), though half-broken.
+
+It's worth noting that there is probably a lot of redundancy between
+**pre-prints** and the final published papers, even though semantically most
+people would consider them versions or editions of the same paper, not totally
+distinct works. This might inflate both the record counts and the DOI counts.
+
+A large number of other resources are not listed because they are very
+subject-specific or relatively small. They may or may not be worth pursuing,
+depending on how redundant they are with the larger resources. Eg, CogPrints
+(cognative science, ~thousands of fulltext), MathSciNet (proprietary math
+bibliogrpahy, ERIC (educational resources and grey lit), paperity.org (similar
+to CORE), etc.
+
+*Note: We don't do a very good job promoting it, but as of June 2017 The
+Internet Archive is hiring! In particular we're looking for an all-around web
+designer and a project manager for an existing 5 person python-web-app team.
+Check out those and more on our
+[jobs page](https://archive.org/about/jobs.php)*
+
+[0]: "The Number of Scholarly Documents on the Public Web", PLoS One, 1994,
+Khabsa and Giles. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0093949
+
diff --git a/posts/merkle-design.md b/posts/merkle-design.md
new file mode 100644
index 0000000..dc1e271
--- /dev/null
+++ b/posts/merkle-design.md
@@ -0,0 +1,107 @@
+Title: Design Considerations for Merkle-Tree Storage Systems
+Author: bnewbold
+Date: 2018-06-10
+Tags: tech, dweb
+Status: draft
+
+## Features we Want
+
+Four semantic properties we might want from a universal
+content-addressiblestorage system:
+
+**1. Deterministic file-level addressability, enabling file-level efficiency
+on-wire and at rest.** If every distinct file can be identified by only a
+single, reproducible name, then discovery, indexing, and de-duplicaiton is made
+easier.  If the same file can end up with different names, then that file might
+be transfered or stored separately by default; this creates pressure for the
+application layer to support the concept of "many identifiers for the same
+file", and requires additional coordination at scale.
+
+**2. Simple chunk-level de-duplication between files and versions.** This means
+that if you have two large files with only a few bytes *changed* between them,
+you don't need both full copies, only data proportional to the difference. A
+special case is when an existing file is copied and modified; in this case the
+system can track the history and change. This is distinct from adding two files
+at the same time and detecting that there is a large degree of overlap.
+
+**3. Offset-independent, chunk-level du-duplication between files.** Stronger
+than #2, this method is efficient even if the different between two files is
+one of inserting (or deleting) and *offset* of bytes; the challenge of
+detecting that two files are identical except for an offset is harder than that
+of identifying the identical bytes at the same locations.
+
+**4. File-level Interoperability with legacy and future systems.** Can the
+system be used as a transparent "layer" by other systems? Eg, can a thin proxy
+be implemented on top of existing file-systems and blob stores? Can thin
+file-system and blob-store gateways be layered on top of the storage system? A
+common source of friction here is when generic, off-the-shelf full-file hashes
+like SHA-1, SHA-256, or BLAKE2b are not used in a common manner.
+
+This last one doesn't matter if you are planning on Total World Domination with
+no need for future upgrades.
+
+## Existing Implementations
+
+git nails #1 (at the cost of not having an upgrade path for the hash function).
+It contains implementation work-arounds for #2 and #3: an internal compression
+format allows storing and transmitting only the diffs between two versions of a
+file, instead of the file files themselves. This isn't baked in to the
+structure of the system though, and doesn't always work (in particular, seems
+to get skipped for large files). By using SHA-1, it gets very close to #4, but
+decided to prepend the length of a file to the file's contents themselves
+before hashing, so the git address of a blob does not match the usual SHA-1 of
+the file.
+
+The Dat protocol provides a weak version of #2, but no existing implementation
+actually implements any form of de-duplication, even at the full-file level.
+Eg, if you delete a file from a Dat archive and then re-add it later, the file
+contents are duplicated in the content feed, even though the standard would
+allow pointing back to the previous copy.
+
+IPFS has a weak version of #1: the file digest is deterministic if the same
+software version and configuration is used, 
+
+## Challenges in Implementing These Features
+
+Features #1 and #3 seem very difficult to reconcile. A frequent trick to
+compress deltas between files is to take history into account, but using
+history makes the resulting hash (name) history dependent. Robust,
+deterministic, content-aware hashing is supposed enable both features at the
+same time, which is exciting, but seems to have been abandoned by all existing
+implementations because it's too slow.
+
+## Tangled Hierarchies
+
+git and other versioned storage systems are like catnip to programmers:
+folks love to think about re-inventing "everything" on top of such a system. I
+think this is because git supplies specific semantic features people love,
+while being deeply entangled with files and file systems. Computer engingeering
+is All About Files, and git is both made out of files (look in .git; it's
+simple files and directories all the way down!) and accomodating files.
+
+Consider:
+
+- on UNIX systems, a block storage device is a fixed size bytestream; a big
+  file, if you will. File systems on top of this are like an archive file
+  format (eg, tar, zip).
+- disk partitioning schemes (like GPT) and volume managers (like LVM) are
+  basically the same thing as file archive formats (like .tar)
+- a hypercore feed (which Dat is built upon) is a single long append-only
+  bytestream: a growing file, if you will, and hyperdrive is a file system (or
+  file format) on top of that.
+
+There's a tangled hierarchy here, in the same way that (at least on UNIX), one
+can create any variation of:
+
+- a file...
+- in an archive file-format (like .zip)...
+- stored in a file-system (like ext4 or ISO)...
+- serialized into a binary file...
+- on another file system (perhaps NTFS)...
+- residing in a partition...
+- on a block device.
+
+If we had a super-duper merkle-tree mechanism for storing files, and a
+consistent way of serializing it to a single file, we write it directly to our
+disk block devices, backup and synchronize file systems efficiently, etc.
+
diff --git a/posts/merkle_design.md b/posts/merkle_design.md
deleted file mode 100644
index dc1e271..0000000
--- a/posts/merkle_design.md
+++ /dev/null
@@ -1,107 +0,0 @@
-Title: Design Considerations for Merkle-Tree Storage Systems
-Author: bnewbold
-Date: 2018-06-10
-Tags: tech, dweb
-Status: draft
-
-## Features we Want
-
-Four semantic properties we might want from a universal
-content-addressiblestorage system:
-
-**1. Deterministic file-level addressability, enabling file-level efficiency
-on-wire and at rest.** If every distinct file can be identified by only a
-single, reproducible name, then discovery, indexing, and de-duplicaiton is made
-easier.  If the same file can end up with different names, then that file might
-be transfered or stored separately by default; this creates pressure for the
-application layer to support the concept of "many identifiers for the same
-file", and requires additional coordination at scale.
-
-**2. Simple chunk-level de-duplication between files and versions.** This means
-that if you have two large files with only a few bytes *changed* between them,
-you don't need both full copies, only data proportional to the difference. A
-special case is when an existing file is copied and modified; in this case the
-system can track the history and change. This is distinct from adding two files
-at the same time and detecting that there is a large degree of overlap.
-
-**3. Offset-independent, chunk-level du-duplication between files.** Stronger
-than #2, this method is efficient even if the different between two files is
-one of inserting (or deleting) and *offset* of bytes; the challenge of
-detecting that two files are identical except for an offset is harder than that
-of identifying the identical bytes at the same locations.
-
-**4. File-level Interoperability with legacy and future systems.** Can the
-system be used as a transparent "layer" by other systems? Eg, can a thin proxy
-be implemented on top of existing file-systems and blob stores? Can thin
-file-system and blob-store gateways be layered on top of the storage system? A
-common source of friction here is when generic, off-the-shelf full-file hashes
-like SHA-1, SHA-256, or BLAKE2b are not used in a common manner.
-
-This last one doesn't matter if you are planning on Total World Domination with
-no need for future upgrades.
-
-## Existing Implementations
-
-git nails #1 (at the cost of not having an upgrade path for the hash function).
-It contains implementation work-arounds for #2 and #3: an internal compression
-format allows storing and transmitting only the diffs between two versions of a
-file, instead of the file files themselves. This isn't baked in to the
-structure of the system though, and doesn't always work (in particular, seems
-to get skipped for large files). By using SHA-1, it gets very close to #4, but
-decided to prepend the length of a file to the file's contents themselves
-before hashing, so the git address of a blob does not match the usual SHA-1 of
-the file.
-
-The Dat protocol provides a weak version of #2, but no existing implementation
-actually implements any form of de-duplication, even at the full-file level.
-Eg, if you delete a file from a Dat archive and then re-add it later, the file
-contents are duplicated in the content feed, even though the standard would
-allow pointing back to the previous copy.
-
-IPFS has a weak version of #1: the file digest is deterministic if the same
-software version and configuration is used, 
-
-## Challenges in Implementing These Features
-
-Features #1 and #3 seem very difficult to reconcile. A frequent trick to
-compress deltas between files is to take history into account, but using
-history makes the resulting hash (name) history dependent. Robust,
-deterministic, content-aware hashing is supposed enable both features at the
-same time, which is exciting, but seems to have been abandoned by all existing
-implementations because it's too slow.
-
-## Tangled Hierarchies
-
-git and other versioned storage systems are like catnip to programmers:
-folks love to think about re-inventing "everything" on top of such a system. I
-think this is because git supplies specific semantic features people love,
-while being deeply entangled with files and file systems. Computer engingeering
-is All About Files, and git is both made out of files (look in .git; it's
-simple files and directories all the way down!) and accomodating files.
-
-Consider:
-
-- on UNIX systems, a block storage device is a fixed size bytestream; a big
-  file, if you will. File systems on top of this are like an archive file
-  format (eg, tar, zip).
-- disk partitioning schemes (like GPT) and volume managers (like LVM) are
-  basically the same thing as file archive formats (like .tar)
-- a hypercore feed (which Dat is built upon) is a single long append-only
-  bytestream: a growing file, if you will, and hyperdrive is a file system (or
-  file format) on top of that.
-
-There's a tangled hierarchy here, in the same way that (at least on UNIX), one
-can create any variation of:
-
-- a file...
-- in an archive file-format (like .zip)...
-- stored in a file-system (like ext4 or ISO)...
-- serialized into a binary file...
-- on another file system (perhaps NTFS)...
-- residing in a partition...
-- on a block device.
-
-If we had a super-duper merkle-tree mechanism for storing files, and a
-consistent way of serializing it to a single file, we write it directly to our
-disk block devices, backup and synchronize file systems efficiently, etc.
-
diff --git a/posts/metadata_collections.md b/posts/metadata_collections.md
deleted file mode 100644
index d7f8713..0000000
--- a/posts/metadata_collections.md
+++ /dev/null
@@ -1,209 +0,0 @@
-Title: Bibliographic Metadata Dumps
-Author: bnewbold
-Date: 2017-06-07
-Tags: tech, archive, scholar
-Status: draft
-
-# TODO:
-# - does BASE link to fulltext PDFs? is that helpful?
-# - can we actually get academia.edu and researchgate.net papers? maybe?
-
-I've recently been lucky enough to start working on a new big project at the
-[Internet Archive][]: collecting, indexing, and expanding access to research
-publications and datasets in the open world. This is perhaps *the* original
-goal of networked information technology, and thanks to a decade of hard
-work by the Open Access movement it feels like intertia
-[is building][nature-elsevier] towards this one small piece of "universal
-access to all knowledge".
-
-[Internet Archive]: https://archive.org
-[nature-elsevier]: http://www.nature.com/news/scientists-in-germany-peru-and-taiwan-to-lose-access-to-elsevier-journals-1.21223
-
-<div class="sidebar">
-<img src="/static/fig/ia_logo.png" width="150px" alt="internet archive logo" />
-</div>
-
-This is a snapshot-in-time look at "what's already out there" regarding indexes
-of scholarly papers and books (aka, "things that get cited"). There are a ton
-of resources out there, and many of them are just re-aggregating or building on
-top of each other.
-
-Here's a table of index-only resources for papers. These are databases or
-corpuses of metadata that might include links/URLs to full text, but don't seem
-to host fulltext copies themselves:
-
-<table>
- <tr>
-   <th>What
-   <th>Record Count (millions)
-   <th>Notes
- <tr>
-   <td>Total digital English language papers
-   <td>114
-   <td>estimated[0], 2014
- <tr>
-   <td>Total open access
-   <td>27
-   <td>estimated[0], 2014. Meaning "available somewhere"? MS academic had 35
-       million.
- <tr>
-   <td>Number of DOIs
-   <td>143
-   <td>Global; includes non-journals. 
- <tr>
-   <td>CrossRef DOIs
-   <td>88
-   <td>Primary registrar for journals/paper in western world
- <tr>
-   <td>BASE Search
-   <td>109
-   <td>Data from OAI-PMH
- <tr>
-   <td>Google Scholar
-   <td>100
-   <td>"records", not URLs
- <tr>
-   <td>Web of Science
-   <td>90
-   <td>proprietary; 1 billion citation graph
- <tr>
-   <td>Scopus
-   <td>55
-   <td>proprietary/Elsevier
- <tr>
-   <td>PubMed
-   <td>26
-   <td>Only half (13mil) have abstract or link to fulltext
- <tr>
-   <td>CORE
-   <td>24
-   <td>
- <tr>
-   <td>Semantic Scholar
-   <td>10 to 20
-   <td>Sometimes mirror fulltext?
- <tr>
-   <td>OpenCitations
-   <td>5
-   <td>Paper entries; Spring 2017
- <tr>
-   <td>dblp
-   <td>3.7
-   <td>computer science bibliography; Spring 2017
-</table>
-
-A big open question to me is how many pre-digital scholarly articles there are
-which have not been digitized or assigned DOI numbers. Eg, how good is JSTOR
-coverage? I'm unsure how to even compute this number.
-
-And here are full-text collections of papers (which also include metadata):
-
-<table>
- <tr>
-   <th>What
-   <th>Fulltext Count (millions)
-   <th>Notes
- <tr>
-   <td>Sci-Hub/scimag
-   <td>62
-   <td>one-file-per-DOI, 2017
- <tr>
-   <td>CiteSeerX
-   <td>6
-   <td>(2010; presumably many more now?). Crawled from the web
- <tr>
-   <td>CORE
-   <td>4
-   <td>Extracted fulltext, not PDF? Complete "gold" OA?
- <tr>
-   <td>PubMed Central
-   <td>4
-   <td>Open Access. 2017
- <tr>
-   <td>OSF Preprints (COS)
-   <td>2
-   <td>2017
- <tr>
-   <td>Internet Archive
-   <td>1.5
-   <td>"Clean" mirrored items in Journal collections; we probably have far more
- <tr>
-   <td>arxiv.org
-   <td>1.2
-   <td>physics+math. articles, not files, 2017
- <tr>
-   <td>JSTOR Total
-   <td>10
-   <td>mostly locked down. includes books, grey lit
- <tr>
-   <td>JSTOR Early Articles
-   <td>0.5
-   <td>open access subset
- <tr>
-   <td>biorxiv.org
-   <td>0.01
-   <td>2017
-</table>
-
-Numbers aside, here are the useful resources to build on top of:
-
-**CrossRef** is the primary **DOI** registrar in the western (english speaking
-world). They are a non-profit, one of only a dozen or so DOI registrars; almost
-all scholarly publishers go through them. They provide some basic metadata
-(title, authors, publication), and have excellent data access: bulk datasets, a
-query API, and a streaming update API. This is a good, authoritative foundation
-for building indexes. China, Korea, and Japan have their own DOI registries,
-and published datasets end up in DataCite instead of CrossRef. Other holes in
-DOI coverage are "grey literature" (unpublished or informally published
-documents, like government reports or technical memos), documents pre-2000 with
-absentee publishers, and books (only a small fraction of books/chapters have
-DOIs).
-
-Publishers and repositories seem to be pretty good about providing **OAI-PMH**
-API access to their metadata and records (and sometimes fulltext). Directories
-make it possible to look up thousands of API endpoints. **BASE** seems to be
-the best aggregation of all this metadata, and some projects build on top of
-BASE (eg, oaDOI). **CORE** finds all of it's fulltext this way. It's not
-clear if BASE is a good place to pull bulk metadata from; they seem to re-index
-from scratch occasionally. **oaDOI** and **dissem.in** are services that
-provide an API and search interface over metadata and point to Open Access
-copies of the results.
-
-**PubMed** (index) and **PubMed Central** (fulltext) are large and well
-maintained. There are Pubmed records and identifiers ("PMID") going far back in
-history, though only for medical texts (there is increasing contemporary
-coversage out of medicine/biology, but only very recently). Annual and daily
-database dumps are available, so a good resource to pull from.
-
-**CiteSeerX** has been crawling the web for PDFs for a long time. Other than
-**Google Scholar** and maybe the **Internet Archive** I think they do the most
-serious paper crawling, though many folks do smaller or one-off crawls. They
-are academic/non-profit and are willing to share metadata and their collected
-papers; their systems are documented and open-source. Metadata and citations
-are extracted from PDFs themselves. They have collaborated with the Microsoft
-Research and the Allen Institute; I suspect they provided most or all content
-for **Semantic Scholar** and **Microsoft Academic Knowledge** (the later now
-defunct). NB: there are some interesting per-domain crawl statistics
-[available](http://csxcrawlweb01.ist.psu.edu//), though half-broken.
-
-It's worth noting that there is probably a lot of redundancy between
-**pre-prints** and the final published papers, even though semantically most
-people would consider them versions or editions of the same paper, not totally
-distinct works. This might inflate both the record counts and the DOI counts.
-
-A large number of other resources are not listed because they are very
-subject-specific or relatively small. They may or may not be worth pursuing,
-depending on how redundant they are with the larger resources. Eg, CogPrints
-(cognative science, ~thousands of fulltext), MathSciNet (proprietary math
-bibliogrpahy, ERIC (educational resources and grey lit), paperity.org (similar
-to CORE), etc.
-
-*Note: We don't do a very good job promoting it, but as of June 2017 The
-Internet Archive is hiring! In particular we're looking for an all-around web
-designer and a project manager for an existing 5 person python-web-app team.
-Check out those and more on our
-[jobs page](https://archive.org/about/jobs.php)*
-
-[0]: "The Number of Scholarly Documents on the Public Web", PLoS One, 1994,
-Khabsa and Giles. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0093949
-
-- 
cgit v1.2.3


What +	Record Count (millions) +	Notes +
Total digital English language papers +	114 +	estimated[0], 2014 +
Total open access +	27 +	estimated[0], 2014. Meaning "available somewhere"? MS academic had 35 + million. +
Number of DOIs +	143 +	Global; includes non-journals. +
CrossRef DOIs +	88 +	Primary registrar for journals/paper in western world +
BASE Search +	109 +	Data from OAI-PMH +
Google Scholar +	100 +	"records", not URLs +
Web of Science +	90 +	proprietary; 1 billion citation graph +
Scopus +	55 +	proprietary/Elsevier +
PubMed +	26 +	Only half (13mil) have abstract or link to fulltext +
CORE +	24 +	+
Semantic Scholar +	10 to 20 +	Sometimes mirror fulltext? +
OpenCitations +	5 +	Paper entries; Spring 2017 +
dblp +	3.7 +	computer science bibliography; Spring 2017 +
What +	Fulltext Count (millions) +	Notes +
Sci-Hub/scimag +	62 +	one-file-per-DOI, 2017 +
CiteSeerX +	6 +	(2010; presumably many more now?). Crawled from the web +
CORE +	4 +	Extracted fulltext, not PDF? Complete "gold" OA? +
PubMed Central +	4 +	Open Access. 2017 +
OSF Preprints (COS) +	2 +	2017 +
Internet Archive +	1.5 +	"Clean" mirrored items in Journal collections; we probably have far more +
arxiv.org +	1.2 +	physics+math. articles, not files, 2017 +
JSTOR Total +	10 +	mostly locked down. includes books, grey lit +
JSTOR Early Articles +	0.5 +	open access subset +
biorxiv.org +	0.01 +	2017 +
What -	Record Count (millions) -	Notes -
Total digital English language papers -	114 -	estimated[0], 2014 -
Total open access -	27 -	estimated[0], 2014. Meaning "available somewhere"? MS academic had 35 - million. -
Number of DOIs -	143 -	Global; includes non-journals. -
CrossRef DOIs -	88 -	Primary registrar for journals/paper in western world -
BASE Search -	109 -	Data from OAI-PMH -
Google Scholar -	100 -	"records", not URLs -
Web of Science -	90 -	proprietary; 1 billion citation graph -
Scopus -	55 -	proprietary/Elsevier -
PubMed -	26 -	Only half (13mil) have abstract or link to fulltext -
CORE -	24 -	-
Semantic Scholar -	10 to 20 -	Sometimes mirror fulltext? -
OpenCitations -	5 -	Paper entries; Spring 2017 -
dblp -	3.7 -	computer science bibliography; Spring 2017 -
What -	Fulltext Count (millions) -	Notes -
Sci-Hub/scimag -	62 -	one-file-per-DOI, 2017 -
CiteSeerX -	6 -	(2010; presumably many more now?). Crawled from the web -
CORE -	4 -	Extracted fulltext, not PDF? Complete "gold" OA? -
PubMed Central -	4 -	Open Access. 2017 -
OSF Preprints (COS) -	2 -	2017 -
Internet Archive -	1.5 -	"Clean" mirrored items in Journal collections; we probably have far more -
arxiv.org -	1.2 -	physics+math. articles, not files, 2017 -
JSTOR Total -	10 -	mostly locked down. includes books, grey lit -
JSTOR Early Articles -	0.5 -	open access subset -
biorxiv.org -	0.01 -	2017 -