aboutsummaryrefslogtreecommitdiffstats
path: root/extra/bulk_edits/CHANGELOG.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2021-11-29 14:34:02 -0800
committerBryan Newbold <bnewbold@robocracy.org>2021-11-29 14:34:02 -0800
commitc32154f2875a7fb9aac727013e1475cdd811e180 (patch)
treef0e061498a101fa824995fb6ec9f91e7e44257e1 /extra/bulk_edits/CHANGELOG.md
parentc5ea2dba358624f4c14da0a1a988ae14d0edfd59 (diff)
downloadfatcat-c32154f2875a7fb9aac727013e1475cdd811e180.tar.gz
fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.zip
move notes/bulk_edits/ to extra/bulk_edits/
Diffstat (limited to 'extra/bulk_edits/CHANGELOG.md')
-rw-r--r--extra/bulk_edits/CHANGELOG.md131
1 files changed, 131 insertions, 0 deletions
diff --git a/extra/bulk_edits/CHANGELOG.md b/extra/bulk_edits/CHANGELOG.md
new file mode 100644
index 00000000..6156721c
--- /dev/null
+++ b/extra/bulk_edits/CHANGELOG.md
@@ -0,0 +1,131 @@
+
+# Fatcat Production Import CHANGELOG
+
+This file tracks major content (metadata) imports to the Fatcat production
+database (at https://fatcat.wiki). It complements the code CHANGELOG file.
+
+In general, changes that impact more than 50k entities will get logged here;
+this file should probably get merged into the guide at some point.
+
+This file should not turn in to a TODO list!
+
+
+## 2021-11
+
+Ran a series of cleanups. See background and prep notes in `notes/cleanups/`
+and specific final commands in this directory. Quick summary:
+
+- more than 9.5 million file entities had truncated timestamps wayback URLs,
+ and were fixed with the full timestamps. there are still a small fraction
+ (0.5%) which were identified but not corrected in this first pass
+- over 140k release entities with non-lowercase DOIs were updated with
+ lowercase DOI. all DOIs in current release entities now lowercase (at least,
+ no ASCII uppercase characters found)
+- over 220k file entities with incorrect release relation, due to an
+ import-time code bug, were fixed. a couple hundred questionable cases remain,
+ but are all mismatched due to DOI slash/double-slash issues and will not be
+ fixed in an automated way.
+- de-uplicated a few thousand file entities, on the basis of SHA-1 hash
+- updated file metadata for around 160k file entities (a couple hundred
+ thousand remain with partial metadata)
+
+
+## 2021-06
+
+Created new containers via chocula pipeline. Did not update any existing
+chocula entities.
+
+Ran DOAJ import manually, yielding almost 130k new release entities.
+
+Ran dblp import manually, resulting in about 17k new release entities, as well
+as 108 new containers. Note that 146k releases were not inserted due to
+`skip-dblp-container-missing` and 203k due to `exists-fuzzy`.
+
+## 2020-12
+
+Updated ORCIDs from 2020 dump. About 2.4 million new `creator` entities.
+
+Imported DOAJ article metadata from a 2020-11 dump. Crawled and imported
+several hundred thousand file entities matched by DOAJ identifier. Updated
+journal metadata using chocula took (before the release ingest). Filtered out
+fuzzy-matching papers before importing.
+
+Imported dblp from a 2020 snapshot, both containers (primarily for conferences
+lacking an ISSN) and release entities (primarily conference papers). Filtered
+out fuzzy-matching papers before importing.
+
+## 2020-03
+
+Started harvesting both Arxiv and Pubmed metadata daily and importing to
+fatcat. Did backfill imports for both sources.
+
+JALC DOI registry update from 2019 dump.
+
+## 2020-01
+
+Imported around 2,500 new containers (journals, by ISSN-L) from chocula
+analysis script.
+
+Imported DOIs from Datacite (around 16 million, plus or minus a couple
+million).
+
+Imported new release entities from 2020 Pubmed/MEDLINE baseline. This import
+included only new Pubmed works cataloged in 2019 (up until December or so).
+Only a few hundred thousand new release entities.
+
+Daily "ingest" (crawling) pipeline running.
+
+## 2019-12
+
+Started continuous harvesting Datacite DOI metadata; first date harvested was
+`2019-12-13`. No importer running yet.
+
+Imported about 3.3m new ORCID identifiers from 2019 bulk dump (after converting
+from XML to JSON): <https://archive.org/details/orcid-dump-2019>
+
+Inserted about 154k new arxiv release entities. Still no automatic daily
+harvesting.
+
+"Save Paper Now" importer running. This bot only *submits* editgroups for
+review, doesn't auto-accept them.
+
+## 2019-11
+
+Daily ingest of fulltext for OA releases now enabled. New file entities created
+and merged automatically.
+
+## 2019-10
+
+Inserted 1.45m new release entities from Crossref which had been missed during
+a previous gap in continuous metadata harvesting.
+
+Updated 304,308 file entities to remove broken
+"https://web.archive.org/web/None/*" URLs.
+
+## 2019-09
+
+Created and updated metadata for tens of thousands of containers, using
+"chocula" pipeline.
+
+## 2019-08
+
+Merged/fixed roughly 100 container entities with invalid ISSN-L numbers (eg,
+invalid ISSN checksum).
+
+## 2019-04
+
+Imported files (matched to releases by DOI) from Semantic Scholar
+(`DIRECT-OA-CRAWL-2019` crawl).
+
+Imported files (matched to releases by DOI) from pre-1923/pre-1909 items uploaded
+by a user to archive.org.
+
+Imported files (matched to releases by DOI) from CORE.ac.uk
+(`DIRECT-OA-CRAWL-2019` crawl).
+
+Imported files (matched to releases by DOI) from the public web (including many
+repositories) from the `UNPAYWALL` 2018 crawl.
+
+## 2019-02
+
+Bootstrapped!