aboutsummaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits/CHANGELOG.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2021-11-29 14:34:02 -0800
committerBryan Newbold <bnewbold@robocracy.org>2021-11-29 14:34:02 -0800
commitc32154f2875a7fb9aac727013e1475cdd811e180 (patch)
treef0e061498a101fa824995fb6ec9f91e7e44257e1 /notes/bulk_edits/CHANGELOG.md
parentc5ea2dba358624f4c14da0a1a988ae14d0edfd59 (diff)
downloadfatcat-c32154f2875a7fb9aac727013e1475cdd811e180.tar.gz
fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.zip
move notes/bulk_edits/ to extra/bulk_edits/
Diffstat (limited to 'notes/bulk_edits/CHANGELOG.md')
-rw-r--r--notes/bulk_edits/CHANGELOG.md131
1 files changed, 0 insertions, 131 deletions
diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md
deleted file mode 100644
index 6156721c..00000000
--- a/notes/bulk_edits/CHANGELOG.md
+++ /dev/null
@@ -1,131 +0,0 @@
-
-# Fatcat Production Import CHANGELOG
-
-This file tracks major content (metadata) imports to the Fatcat production
-database (at https://fatcat.wiki). It complements the code CHANGELOG file.
-
-In general, changes that impact more than 50k entities will get logged here;
-this file should probably get merged into the guide at some point.
-
-This file should not turn in to a TODO list!
-
-
-## 2021-11
-
-Ran a series of cleanups. See background and prep notes in `notes/cleanups/`
-and specific final commands in this directory. Quick summary:
-
-- more than 9.5 million file entities had truncated timestamps wayback URLs,
- and were fixed with the full timestamps. there are still a small fraction
- (0.5%) which were identified but not corrected in this first pass
-- over 140k release entities with non-lowercase DOIs were updated with
- lowercase DOI. all DOIs in current release entities now lowercase (at least,
- no ASCII uppercase characters found)
-- over 220k file entities with incorrect release relation, due to an
- import-time code bug, were fixed. a couple hundred questionable cases remain,
- but are all mismatched due to DOI slash/double-slash issues and will not be
- fixed in an automated way.
-- de-uplicated a few thousand file entities, on the basis of SHA-1 hash
-- updated file metadata for around 160k file entities (a couple hundred
- thousand remain with partial metadata)
-
-
-## 2021-06
-
-Created new containers via chocula pipeline. Did not update any existing
-chocula entities.
-
-Ran DOAJ import manually, yielding almost 130k new release entities.
-
-Ran dblp import manually, resulting in about 17k new release entities, as well
-as 108 new containers. Note that 146k releases were not inserted due to
-`skip-dblp-container-missing` and 203k due to `exists-fuzzy`.
-
-## 2020-12
-
-Updated ORCIDs from 2020 dump. About 2.4 million new `creator` entities.
-
-Imported DOAJ article metadata from a 2020-11 dump. Crawled and imported
-several hundred thousand file entities matched by DOAJ identifier. Updated
-journal metadata using chocula took (before the release ingest). Filtered out
-fuzzy-matching papers before importing.
-
-Imported dblp from a 2020 snapshot, both containers (primarily for conferences
-lacking an ISSN) and release entities (primarily conference papers). Filtered
-out fuzzy-matching papers before importing.
-
-## 2020-03
-
-Started harvesting both Arxiv and Pubmed metadata daily and importing to
-fatcat. Did backfill imports for both sources.
-
-JALC DOI registry update from 2019 dump.
-
-## 2020-01
-
-Imported around 2,500 new containers (journals, by ISSN-L) from chocula
-analysis script.
-
-Imported DOIs from Datacite (around 16 million, plus or minus a couple
-million).
-
-Imported new release entities from 2020 Pubmed/MEDLINE baseline. This import
-included only new Pubmed works cataloged in 2019 (up until December or so).
-Only a few hundred thousand new release entities.
-
-Daily "ingest" (crawling) pipeline running.
-
-## 2019-12
-
-Started continuous harvesting Datacite DOI metadata; first date harvested was
-`2019-12-13`. No importer running yet.
-
-Imported about 3.3m new ORCID identifiers from 2019 bulk dump (after converting
-from XML to JSON): <https://archive.org/details/orcid-dump-2019>
-
-Inserted about 154k new arxiv release entities. Still no automatic daily
-harvesting.
-
-"Save Paper Now" importer running. This bot only *submits* editgroups for
-review, doesn't auto-accept them.
-
-## 2019-11
-
-Daily ingest of fulltext for OA releases now enabled. New file entities created
-and merged automatically.
-
-## 2019-10
-
-Inserted 1.45m new release entities from Crossref which had been missed during
-a previous gap in continuous metadata harvesting.
-
-Updated 304,308 file entities to remove broken
-"https://web.archive.org/web/None/*" URLs.
-
-## 2019-09
-
-Created and updated metadata for tens of thousands of containers, using
-"chocula" pipeline.
-
-## 2019-08
-
-Merged/fixed roughly 100 container entities with invalid ISSN-L numbers (eg,
-invalid ISSN checksum).
-
-## 2019-04
-
-Imported files (matched to releases by DOI) from Semantic Scholar
-(`DIRECT-OA-CRAWL-2019` crawl).
-
-Imported files (matched to releases by DOI) from pre-1923/pre-1909 items uploaded
-by a user to archive.org.
-
-Imported files (matched to releases by DOI) from CORE.ac.uk
-(`DIRECT-OA-CRAWL-2019` crawl).
-
-Imported files (matched to releases by DOI) from the public web (including many
-repositories) from the `UNPAYWALL` 2018 crawl.
-
-## 2019-02
-
-Bootstrapped!