diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-29 14:34:02 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-29 14:34:02 -0800 |
commit | c32154f2875a7fb9aac727013e1475cdd811e180 (patch) | |
tree | f0e061498a101fa824995fb6ec9f91e7e44257e1 /extra/bulk_edits/CHANGELOG.md | |
parent | c5ea2dba358624f4c14da0a1a988ae14d0edfd59 (diff) | |
download | fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.tar.gz fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.zip |
move notes/bulk_edits/ to extra/bulk_edits/
Diffstat (limited to 'extra/bulk_edits/CHANGELOG.md')
-rw-r--r-- | extra/bulk_edits/CHANGELOG.md | 131 |
1 files changed, 131 insertions, 0 deletions
diff --git a/extra/bulk_edits/CHANGELOG.md b/extra/bulk_edits/CHANGELOG.md new file mode 100644 index 00000000..6156721c --- /dev/null +++ b/extra/bulk_edits/CHANGELOG.md @@ -0,0 +1,131 @@ + +# Fatcat Production Import CHANGELOG + +This file tracks major content (metadata) imports to the Fatcat production +database (at https://fatcat.wiki). It complements the code CHANGELOG file. + +In general, changes that impact more than 50k entities will get logged here; +this file should probably get merged into the guide at some point. + +This file should not turn in to a TODO list! + + +## 2021-11 + +Ran a series of cleanups. See background and prep notes in `notes/cleanups/` +and specific final commands in this directory. Quick summary: + +- more than 9.5 million file entities had truncated timestamps wayback URLs, + and were fixed with the full timestamps. there are still a small fraction + (0.5%) which were identified but not corrected in this first pass +- over 140k release entities with non-lowercase DOIs were updated with + lowercase DOI. all DOIs in current release entities now lowercase (at least, + no ASCII uppercase characters found) +- over 220k file entities with incorrect release relation, due to an + import-time code bug, were fixed. a couple hundred questionable cases remain, + but are all mismatched due to DOI slash/double-slash issues and will not be + fixed in an automated way. +- de-uplicated a few thousand file entities, on the basis of SHA-1 hash +- updated file metadata for around 160k file entities (a couple hundred + thousand remain with partial metadata) + + +## 2021-06 + +Created new containers via chocula pipeline. Did not update any existing +chocula entities. + +Ran DOAJ import manually, yielding almost 130k new release entities. + +Ran dblp import manually, resulting in about 17k new release entities, as well +as 108 new containers. Note that 146k releases were not inserted due to +`skip-dblp-container-missing` and 203k due to `exists-fuzzy`. + +## 2020-12 + +Updated ORCIDs from 2020 dump. About 2.4 million new `creator` entities. + +Imported DOAJ article metadata from a 2020-11 dump. Crawled and imported +several hundred thousand file entities matched by DOAJ identifier. Updated +journal metadata using chocula took (before the release ingest). Filtered out +fuzzy-matching papers before importing. + +Imported dblp from a 2020 snapshot, both containers (primarily for conferences +lacking an ISSN) and release entities (primarily conference papers). Filtered +out fuzzy-matching papers before importing. + +## 2020-03 + +Started harvesting both Arxiv and Pubmed metadata daily and importing to +fatcat. Did backfill imports for both sources. + +JALC DOI registry update from 2019 dump. + +## 2020-01 + +Imported around 2,500 new containers (journals, by ISSN-L) from chocula +analysis script. + +Imported DOIs from Datacite (around 16 million, plus or minus a couple +million). + +Imported new release entities from 2020 Pubmed/MEDLINE baseline. This import +included only new Pubmed works cataloged in 2019 (up until December or so). +Only a few hundred thousand new release entities. + +Daily "ingest" (crawling) pipeline running. + +## 2019-12 + +Started continuous harvesting Datacite DOI metadata; first date harvested was +`2019-12-13`. No importer running yet. + +Imported about 3.3m new ORCID identifiers from 2019 bulk dump (after converting +from XML to JSON): <https://archive.org/details/orcid-dump-2019> + +Inserted about 154k new arxiv release entities. Still no automatic daily +harvesting. + +"Save Paper Now" importer running. This bot only *submits* editgroups for +review, doesn't auto-accept them. + +## 2019-11 + +Daily ingest of fulltext for OA releases now enabled. New file entities created +and merged automatically. + +## 2019-10 + +Inserted 1.45m new release entities from Crossref which had been missed during +a previous gap in continuous metadata harvesting. + +Updated 304,308 file entities to remove broken +"https://web.archive.org/web/None/*" URLs. + +## 2019-09 + +Created and updated metadata for tens of thousands of containers, using +"chocula" pipeline. + +## 2019-08 + +Merged/fixed roughly 100 container entities with invalid ISSN-L numbers (eg, +invalid ISSN checksum). + +## 2019-04 + +Imported files (matched to releases by DOI) from Semantic Scholar +(`DIRECT-OA-CRAWL-2019` crawl). + +Imported files (matched to releases by DOI) from pre-1923/pre-1909 items uploaded +by a user to archive.org. + +Imported files (matched to releases by DOI) from CORE.ac.uk +(`DIRECT-OA-CRAWL-2019` crawl). + +Imported files (matched to releases by DOI) from the public web (including many +repositories) from the `UNPAYWALL` 2018 crawl. + +## 2019-02 + +Bootstrapped! |