From 5d67946807fb9b6878915735b1e0e1938eb7c02a Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Tue, 7 May 2019 17:33:43 -0700 Subject: WIP metadata corpus changelog --- notes/CHANGELOG_corpus_prod.md | 41 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) create mode 100644 notes/CHANGELOG_corpus_prod.md (limited to 'notes') diff --git a/notes/CHANGELOG_corpus_prod.md b/notes/CHANGELOG_corpus_prod.md new file mode 100644 index 00000000..b4435afb --- /dev/null +++ b/notes/CHANGELOG_corpus_prod.md @@ -0,0 +1,41 @@ + +# Fatcat Production Import CHANGELOG + +This file tracks major content (metadata) imports to the Fatcat production +database (at https://fatcat.wiki). It complements the code CHANGELOG file. + +In general, changes that impact more than 50k entities will get logged here; +this file should probably get merged into the guide at some point. + +This file should not turn in to a TODO list! + +## 2019-04 + +Imported files (matched to releases by DOI) from Semantic Scholar +(`DIRECT-OA-CRAWL-2019` crawl). + + Arabesque importer + crawl-bot + `s2_doi.sqlite` + TODO: archive.org link + TODO: rough count + TODO: date + +Imported files (matched to releases by DOI) from pre-1923/pre-1909 items uploaded +by a user to archive.org. + + Matched importer + internetarchive-bot (TODO:) + TODO: archive.org link + TODO: counts + TODO: date + +Imported files (matched to releases by DOI) from CORE.ac.uk +(`DIRECT-OA-CRAWL-2019` crawl). + +Imported files (matched to releases by DOI) from the public web (including many +repositories) from the `UNPAYWALL` 2018 crawl. + +## 2019-02 + +Bootstrapped! -- cgit v1.2.3