From cee4a5393132cc1bfb4733281fd5c46392a9586f Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Sun, 14 Apr 2019 18:16:55 -0700 Subject: (OLD) rename some drafts --- posts/biblio-metadata-collections.md | 209 +++++++++++++++++++++++++++++++++++ posts/merkle-design.md | 107 ++++++++++++++++++ posts/merkle_design.md | 107 ------------------ posts/metadata_collections.md | 209 ----------------------------------- 4 files changed, 316 insertions(+), 316 deletions(-) create mode 100644 posts/biblio-metadata-collections.md create mode 100644 posts/merkle-design.md delete mode 100644 posts/merkle_design.md delete mode 100644 posts/metadata_collections.md diff --git a/posts/biblio-metadata-collections.md b/posts/biblio-metadata-collections.md new file mode 100644 index 0000000..d7f8713 --- /dev/null +++ b/posts/biblio-metadata-collections.md @@ -0,0 +1,209 @@ +Title: Bibliographic Metadata Dumps +Author: bnewbold +Date: 2017-06-07 +Tags: tech, archive, scholar +Status: draft + +# TODO: +# - does BASE link to fulltext PDFs? is that helpful? +# - can we actually get academia.edu and researchgate.net papers? maybe? + +I've recently been lucky enough to start working on a new big project at the +[Internet Archive][]: collecting, indexing, and expanding access to research +publications and datasets in the open world. This is perhaps *the* original +goal of networked information technology, and thanks to a decade of hard +work by the Open Access movement it feels like intertia +[is building][nature-elsevier] towards this one small piece of "universal +access to all knowledge". + +[Internet Archive]: https://archive.org +[nature-elsevier]: http://www.nature.com/news/scientists-in-germany-peru-and-taiwan-to-lose-access-to-elsevier-journals-1.21223 + + + +This is a snapshot-in-time look at "what's already out there" regarding indexes +of scholarly papers and books (aka, "things that get cited"). There are a ton +of resources out there, and many of them are just re-aggregating or building on +top of each other. + +Here's a table of index-only resources for papers. These are databases or +corpuses of metadata that might include links/URLs to full text, but don't seem +to host fulltext copies themselves: + + + + + + + + + + + + + + + + +
What + Record Count (millions) + Notes +
Total digital English language papers + 114 + estimated[0], 2014 +
Total open access + 27 + estimated[0], 2014. Meaning "available somewhere"? MS academic had 35 + million. +
Number of DOIs + 143 + Global; includes non-journals. +
CrossRef DOIs + 88 + Primary registrar for journals/paper in western world +
BASE Search + 109 + Data from OAI-PMH +
Google Scholar + 100 + "records", not URLs +
Web of Science + 90 + proprietary; 1 billion citation graph +
Scopus + 55 + proprietary/Elsevier +
PubMed + 26 + Only half (13mil) have abstract or link to fulltext +
CORE + 24 + +
Semantic Scholar + 10 to 20 + Sometimes mirror fulltext? +
OpenCitations + 5 + Paper entries; Spring 2017 +
dblp + 3.7 + computer science bibliography; Spring 2017 +
+ +A big open question to me is how many pre-digital scholarly articles there are +which have not been digitized or assigned DOI numbers. Eg, how good is JSTOR +coverage? I'm unsure how to even compute this number. + +And here are full-text collections of papers (which also include metadata): + + + + + + + + + + + + + +
What + Fulltext Count (millions) + Notes +
Sci-Hub/scimag + 62 + one-file-per-DOI, 2017 +
CiteSeerX + 6 + (2010; presumably many more now?). Crawled from the web +
CORE + 4 + Extracted fulltext, not PDF? Complete "gold" OA? +
PubMed Central + 4 + Open Access. 2017 +
OSF Preprints (COS) + 2 + 2017 +
Internet Archive + 1.5 + "Clean" mirrored items in Journal collections; we probably have far more +
arxiv.org + 1.2 + physics+math. articles, not files, 2017 +
JSTOR Total + 10 + mostly locked down. includes books, grey lit +
JSTOR Early Articles + 0.5 + open access subset +
biorxiv.org + 0.01 + 2017 +
+ +Numbers aside, here are the useful resources to build on top of: + +**CrossRef** is the primary **DOI** registrar in the western (english speaking +world). They are a non-profit, one of only a dozen or so DOI registrars; almost +all scholarly publishers go through them. They provide some basic metadata +(title, authors, publication), and have excellent data access: bulk datasets, a +query API, and a streaming update API. This is a good, authoritative foundation +for building indexes. China, Korea, and Japan have their own DOI registries, +and published datasets end up in DataCite instead of CrossRef. Other holes in +DOI coverage are "grey literature" (unpublished or informally published +documents, like government reports or technical memos), documents pre-2000 with +absentee publishers, and books (only a small fraction of books/chapters have +DOIs). + +Publishers and repositories seem to be pretty good about providing **OAI-PMH** +API access to their metadata and records (and sometimes fulltext). Directories +make it possible to look up thousands of API endpoints. **BASE** seems to be +the best aggregation of all this metadata, and some projects build on top of +BASE (eg, oaDOI). **CORE** finds all of it's fulltext this way. It's not +clear if BASE is a good place to pull bulk metadata from; they seem to re-index +from scratch occasionally. **oaDOI** and **dissem.in** are services that +provide an API and search interface over metadata and point to Open Access +copies of the results. + +**PubMed** (index) and **PubMed Central** (fulltext) are large and well +maintained. There are Pubmed records and identifiers ("PMID") going far back in +history, though only for medical texts (there is increasing contemporary +coversage out of medicine/biology, but only very recently). Annual and daily +database dumps are available, so a good resource to pull from. + +**CiteSeerX** has been crawling the web for PDFs for a long time. Other than +**Google Scholar** and maybe the **Internet Archive** I think they do the most +serious paper crawling, though many folks do smaller or one-off crawls. They +are academic/non-profit and are willing to share metadata and their collected +papers; their systems are documented and open-source. Metadata and citations +are extracted from PDFs themselves. They have collaborated with the Microsoft +Research and the Allen Institute; I suspect they provided most or all content +for **Semantic Scholar** and **Microsoft Academic Knowledge** (the later now +defunct). NB: there are some interesting per-domain crawl statistics +[available](http://csxcrawlweb01.ist.psu.edu//), though half-broken. + +It's worth noting that there is probably a lot of redundancy between +**pre-prints** and the final published papers, even though semantically most +people would consider them versions or editions of the same paper, not totally +distinct works. This might inflate both the record counts and the DOI counts. + +A large number of other resources are not listed because they are very +subject-specific or relatively small. They may or may not be worth pursuing, +depending on how redundant they are with the larger resources. Eg, CogPrints +(cognative science, ~thousands of fulltext), MathSciNet (proprietary math +bibliogrpahy, ERIC (educational resources and grey lit), paperity.org (similar +to CORE), etc. + +*Note: We don't do a very good job promoting it, but as of June 2017 The +Internet Archive is hiring! In particular we're looking for an all-around web +designer and a project manager for an existing 5 person python-web-app team. +Check out those and more on our +[jobs page](https://archive.org/about/jobs.php)* + +[0]: "The Number of Scholarly Documents on the Public Web", PLoS One, 1994, +Khabsa and Giles. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0093949 + diff --git a/posts/merkle-design.md b/posts/merkle-design.md new file mode 100644 index 0000000..dc1e271 --- /dev/null +++ b/posts/merkle-design.md @@ -0,0 +1,107 @@ +Title: Design Considerations for Merkle-Tree Storage Systems +Author: bnewbold +Date: 2018-06-10 +Tags: tech, dweb +Status: draft + +## Features we Want + +Four semantic properties we might want from a universal +content-addressiblestorage system: + +**1. Deterministic file-level addressability, enabling file-level efficiency +on-wire and at rest.** If every distinct file can be identified by only a +single, reproducible name, then discovery, indexing, and de-duplicaiton is made +easier. If the same file can end up with different names, then that file might +be transfered or stored separately by default; this creates pressure for the +application layer to support the concept of "many identifiers for the same +file", and requires additional coordination at scale. + +**2. Simple chunk-level de-duplication between files and versions.** This means +that if you have two large files with only a few bytes *changed* between them, +you don't need both full copies, only data proportional to the difference. A +special case is when an existing file is copied and modified; in this case the +system can track the history and change. This is distinct from adding two files +at the same time and detecting that there is a large degree of overlap. + +**3. Offset-independent, chunk-level du-duplication between files.** Stronger +than #2, this method is efficient even if the different between two files is +one of inserting (or deleting) and *offset* of bytes; the challenge of +detecting that two files are identical except for an offset is harder than that +of identifying the identical bytes at the same locations. + +**4. File-level Interoperability with legacy and future systems.** Can the +system be used as a transparent "layer" by other systems? Eg, can a thin proxy +be implemented on top of existing file-systems and blob stores? Can thin +file-system and blob-store gateways be layered on top of the storage system? A +common source of friction here is when generic, off-the-shelf full-file hashes +like SHA-1, SHA-256, or BLAKE2b are not used in a common manner. + +This last one doesn't matter if you are planning on Total World Domination with +no need for future upgrades. + +## Existing Implementations + +git nails #1 (at the cost of not having an upgrade path for the hash function). +It contains implementation work-arounds for #2 and #3: an internal compression +format allows storing and transmitting only the diffs between two versions of a +file, instead of the file files themselves. This isn't baked in to the +structure of the system though, and doesn't always work (in particular, seems +to get skipped for large files). By using SHA-1, it gets very close to #4, but +decided to prepend the length of a file to the file's contents themselves +before hashing, so the git address of a blob does not match the usual SHA-1 of +the file. + +The Dat protocol provides a weak version of #2, but no existing implementation +actually implements any form of de-duplication, even at the full-file level. +Eg, if you delete a file from a Dat archive and then re-add it later, the file +contents are duplicated in the content feed, even though the standard would +allow pointing back to the previous copy. + +IPFS has a weak version of #1: the file digest is deterministic if the same +software version and configuration is used, + +## Challenges in Implementing These Features + +Features #1 and #3 seem very difficult to reconcile. A frequent trick to +compress deltas between files is to take history into account, but using +history makes the resulting hash (name) history dependent. Robust, +deterministic, content-aware hashing is supposed enable both features at the +same time, which is exciting, but seems to have been abandoned by all existing +implementations because it's too slow. + +## Tangled Hierarchies + +git and other versioned storage systems are like catnip to programmers: +folks love to think about re-inventing "everything" on top of such a system. I +think this is because git supplies specific semantic features people love, +while being deeply entangled with files and file systems. Computer engingeering +is All About Files, and git is both made out of files (look in .git; it's +simple files and directories all the way down!) and accomodating files. + +Consider: + +- on UNIX systems, a block storage device is a fixed size bytestream; a big + file, if you will. File systems on top of this are like an archive file + format (eg, tar, zip). +- disk partitioning schemes (like GPT) and volume managers (like LVM) are + basically the same thing as file archive formats (like .tar) +- a hypercore feed (which Dat is built upon) is a single long append-only + bytestream: a growing file, if you will, and hyperdrive is a file system (or + file format) on top of that. + +There's a tangled hierarchy here, in the same way that (at least on UNIX), one +can create any variation of: + +- a file... +- in an archive file-format (like .zip)... +- stored in a file-system (like ext4 or ISO)... +- serialized into a binary file... +- on another file system (perhaps NTFS)... +- residing in a partition... +- on a block device. + +If we had a super-duper merkle-tree mechanism for storing files, and a +consistent way of serializing it to a single file, we write it directly to our +disk block devices, backup and synchronize file systems efficiently, etc. + diff --git a/posts/merkle_design.md b/posts/merkle_design.md deleted file mode 100644 index dc1e271..0000000 --- a/posts/merkle_design.md +++ /dev/null @@ -1,107 +0,0 @@ -Title: Design Considerations for Merkle-Tree Storage Systems -Author: bnewbold -Date: 2018-06-10 -Tags: tech, dweb -Status: draft - -## Features we Want - -Four semantic properties we might want from a universal -content-addressiblestorage system: - -**1. Deterministic file-level addressability, enabling file-level efficiency -on-wire and at rest.** If every distinct file can be identified by only a -single, reproducible name, then discovery, indexing, and de-duplicaiton is made -easier. If the same file can end up with different names, then that file might -be transfered or stored separately by default; this creates pressure for the -application layer to support the concept of "many identifiers for the same -file", and requires additional coordination at scale. - -**2. Simple chunk-level de-duplication between files and versions.** This means -that if you have two large files with only a few bytes *changed* between them, -you don't need both full copies, only data proportional to the difference. A -special case is when an existing file is copied and modified; in this case the -system can track the history and change. This is distinct from adding two files -at the same time and detecting that there is a large degree of overlap. - -**3. Offset-independent, chunk-level du-duplication between files.** Stronger -than #2, this method is efficient even if the different between two files is -one of inserting (or deleting) and *offset* of bytes; the challenge of -detecting that two files are identical except for an offset is harder than that -of identifying the identical bytes at the same locations. - -**4. File-level Interoperability with legacy and future systems.** Can the -system be used as a transparent "layer" by other systems? Eg, can a thin proxy -be implemented on top of existing file-systems and blob stores? Can thin -file-system and blob-store gateways be layered on top of the storage system? A -common source of friction here is when generic, off-the-shelf full-file hashes -like SHA-1, SHA-256, or BLAKE2b are not used in a common manner. - -This last one doesn't matter if you are planning on Total World Domination with -no need for future upgrades. - -## Existing Implementations - -git nails #1 (at the cost of not having an upgrade path for the hash function). -It contains implementation work-arounds for #2 and #3: an internal compression -format allows storing and transmitting only the diffs between two versions of a -file, instead of the file files themselves. This isn't baked in to the -structure of the system though, and doesn't always work (in particular, seems -to get skipped for large files). By using SHA-1, it gets very close to #4, but -decided to prepend the length of a file to the file's contents themselves -before hashing, so the git address of a blob does not match the usual SHA-1 of -the file. - -The Dat protocol provides a weak version of #2, but no existing implementation -actually implements any form of de-duplication, even at the full-file level. -Eg, if you delete a file from a Dat archive and then re-add it later, the file -contents are duplicated in the content feed, even though the standard would -allow pointing back to the previous copy. - -IPFS has a weak version of #1: the file digest is deterministic if the same -software version and configuration is used, - -## Challenges in Implementing These Features - -Features #1 and #3 seem very difficult to reconcile. A frequent trick to -compress deltas between files is to take history into account, but using -history makes the resulting hash (name) history dependent. Robust, -deterministic, content-aware hashing is supposed enable both features at the -same time, which is exciting, but seems to have been abandoned by all existing -implementations because it's too slow. - -## Tangled Hierarchies - -git and other versioned storage systems are like catnip to programmers: -folks love to think about re-inventing "everything" on top of such a system. I -think this is because git supplies specific semantic features people love, -while being deeply entangled with files and file systems. Computer engingeering -is All About Files, and git is both made out of files (look in .git; it's -simple files and directories all the way down!) and accomodating files. - -Consider: - -- on UNIX systems, a block storage device is a fixed size bytestream; a big - file, if you will. File systems on top of this are like an archive file - format (eg, tar, zip). -- disk partitioning schemes (like GPT) and volume managers (like LVM) are - basically the same thing as file archive formats (like .tar) -- a hypercore feed (which Dat is built upon) is a single long append-only - bytestream: a growing file, if you will, and hyperdrive is a file system (or - file format) on top of that. - -There's a tangled hierarchy here, in the same way that (at least on UNIX), one -can create any variation of: - -- a file... -- in an archive file-format (like .zip)... -- stored in a file-system (like ext4 or ISO)... -- serialized into a binary file... -- on another file system (perhaps NTFS)... -- residing in a partition... -- on a block device. - -If we had a super-duper merkle-tree mechanism for storing files, and a -consistent way of serializing it to a single file, we write it directly to our -disk block devices, backup and synchronize file systems efficiently, etc. - diff --git a/posts/metadata_collections.md b/posts/metadata_collections.md deleted file mode 100644 index d7f8713..0000000 --- a/posts/metadata_collections.md +++ /dev/null @@ -1,209 +0,0 @@ -Title: Bibliographic Metadata Dumps -Author: bnewbold -Date: 2017-06-07 -Tags: tech, archive, scholar -Status: draft - -# TODO: -# - does BASE link to fulltext PDFs? is that helpful? -# - can we actually get academia.edu and researchgate.net papers? maybe? - -I've recently been lucky enough to start working on a new big project at the -[Internet Archive][]: collecting, indexing, and expanding access to research -publications and datasets in the open world. This is perhaps *the* original -goal of networked information technology, and thanks to a decade of hard -work by the Open Access movement it feels like intertia -[is building][nature-elsevier] towards this one small piece of "universal -access to all knowledge". - -[Internet Archive]: https://archive.org -[nature-elsevier]: http://www.nature.com/news/scientists-in-germany-peru-and-taiwan-to-lose-access-to-elsevier-journals-1.21223 - - - -This is a snapshot-in-time look at "what's already out there" regarding indexes -of scholarly papers and books (aka, "things that get cited"). There are a ton -of resources out there, and many of them are just re-aggregating or building on -top of each other. - -Here's a table of index-only resources for papers. These are databases or -corpuses of metadata that might include links/URLs to full text, but don't seem -to host fulltext copies themselves: - - - - - - - - - - - - - - - - -
What - Record Count (millions) - Notes -
Total digital English language papers - 114 - estimated[0], 2014 -
Total open access - 27 - estimated[0], 2014. Meaning "available somewhere"? MS academic had 35 - million. -
Number of DOIs - 143 - Global; includes non-journals. -
CrossRef DOIs - 88 - Primary registrar for journals/paper in western world -
BASE Search - 109 - Data from OAI-PMH -
Google Scholar - 100 - "records", not URLs -
Web of Science - 90 - proprietary; 1 billion citation graph -
Scopus - 55 - proprietary/Elsevier -
PubMed - 26 - Only half (13mil) have abstract or link to fulltext -
CORE - 24 - -
Semantic Scholar - 10 to 20 - Sometimes mirror fulltext? -
OpenCitations - 5 - Paper entries; Spring 2017 -
dblp - 3.7 - computer science bibliography; Spring 2017 -
- -A big open question to me is how many pre-digital scholarly articles there are -which have not been digitized or assigned DOI numbers. Eg, how good is JSTOR -coverage? I'm unsure how to even compute this number. - -And here are full-text collections of papers (which also include metadata): - - - - - - - - - - - - - -
What - Fulltext Count (millions) - Notes -
Sci-Hub/scimag - 62 - one-file-per-DOI, 2017 -
CiteSeerX - 6 - (2010; presumably many more now?). Crawled from the web -
CORE - 4 - Extracted fulltext, not PDF? Complete "gold" OA? -
PubMed Central - 4 - Open Access. 2017 -
OSF Preprints (COS) - 2 - 2017 -
Internet Archive - 1.5 - "Clean" mirrored items in Journal collections; we probably have far more -
arxiv.org - 1.2 - physics+math. articles, not files, 2017 -
JSTOR Total - 10 - mostly locked down. includes books, grey lit -
JSTOR Early Articles - 0.5 - open access subset -
biorxiv.org - 0.01 - 2017 -
- -Numbers aside, here are the useful resources to build on top of: - -**CrossRef** is the primary **DOI** registrar in the western (english speaking -world). They are a non-profit, one of only a dozen or so DOI registrars; almost -all scholarly publishers go through them. They provide some basic metadata -(title, authors, publication), and have excellent data access: bulk datasets, a -query API, and a streaming update API. This is a good, authoritative foundation -for building indexes. China, Korea, and Japan have their own DOI registries, -and published datasets end up in DataCite instead of CrossRef. Other holes in -DOI coverage are "grey literature" (unpublished or informally published -documents, like government reports or technical memos), documents pre-2000 with -absentee publishers, and books (only a small fraction of books/chapters have -DOIs). - -Publishers and repositories seem to be pretty good about providing **OAI-PMH** -API access to their metadata and records (and sometimes fulltext). Directories -make it possible to look up thousands of API endpoints. **BASE** seems to be -the best aggregation of all this metadata, and some projects build on top of -BASE (eg, oaDOI). **CORE** finds all of it's fulltext this way. It's not -clear if BASE is a good place to pull bulk metadata from; they seem to re-index -from scratch occasionally. **oaDOI** and **dissem.in** are services that -provide an API and search interface over metadata and point to Open Access -copies of the results. - -**PubMed** (index) and **PubMed Central** (fulltext) are large and well -maintained. There are Pubmed records and identifiers ("PMID") going far back in -history, though only for medical texts (there is increasing contemporary -coversage out of medicine/biology, but only very recently). Annual and daily -database dumps are available, so a good resource to pull from. - -**CiteSeerX** has been crawling the web for PDFs for a long time. Other than -**Google Scholar** and maybe the **Internet Archive** I think they do the most -serious paper crawling, though many folks do smaller or one-off crawls. They -are academic/non-profit and are willing to share metadata and their collected -papers; their systems are documented and open-source. Metadata and citations -are extracted from PDFs themselves. They have collaborated with the Microsoft -Research and the Allen Institute; I suspect they provided most or all content -for **Semantic Scholar** and **Microsoft Academic Knowledge** (the later now -defunct). NB: there are some interesting per-domain crawl statistics -[available](http://csxcrawlweb01.ist.psu.edu//), though half-broken. - -It's worth noting that there is probably a lot of redundancy between -**pre-prints** and the final published papers, even though semantically most -people would consider them versions or editions of the same paper, not totally -distinct works. This might inflate both the record counts and the DOI counts. - -A large number of other resources are not listed because they are very -subject-specific or relatively small. They may or may not be worth pursuing, -depending on how redundant they are with the larger resources. Eg, CogPrints -(cognative science, ~thousands of fulltext), MathSciNet (proprietary math -bibliogrpahy, ERIC (educational resources and grey lit), paperity.org (similar -to CORE), etc. - -*Note: We don't do a very good job promoting it, but as of June 2017 The -Internet Archive is hiring! In particular we're looking for an all-around web -designer and a project manager for an existing 5 person python-web-app team. -Check out those and more on our -[jobs page](https://archive.org/about/jobs.php)* - -[0]: "The Number of Scholarly Documents on the Public Web", PLoS One, 1994, -Khabsa and Giles. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0093949 - -- cgit v1.2.3