4 files changed, 209 insertions, 1 deletions
diff --git a/notes/bulk_edits/2020-10-08_chocula.md b/notes/bulk_edits/2020-10-08_chocula.md
new file mode 100644
index 00000000..d60b6842
--- /dev/null
+++ b/notes/bulk_edits/2020-10-08_chocula.md
@@ -0,0 +1,44 @@
+
+Another update of journal metadata. In this case due to expanding "Keepers"
+coverage to PKP PLN, Hathitrust, Scholar's Portal, and Carniniana.
+
+Using `journal-metadata-bot` and `chocula.2020-10-08.json` export.
+
+## QA Testing
+
+    shuf -n1000 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
+    Counter({'total': 1000, 'exists': 640, 'exists-skip-update': 532, 'update': 348, 'exists-not-found': 108, 'insert': 12, 'skip': 0})
+
+Expecting roughly a 1/3 update rate. Most of these seem to be true updates (eg,
+adding kbart metadata). A smaller fraction are just updating DOAJ timestamp or
+not updating any metadata at all.
+
+    head -n500 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
+    Counter({'total': 500, 'exists': 372, 'exists-skip-update': 328, 'update': 121, 'exists-not-found': 44, 'insert': 7, 'skip': 0})
+
+    head -n500 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
+    Counter({'total': 500, 'exists': 481, 'exists-skip-update': 430, 'exists-not-found': 44, 'update': 19, 'exists-by-issnl': 7, 'skip': 0, 'insert': 0})
+
+Made some changes in `27fe31d5ffcac700c30b2b10d56685ef0fa4f3a8` which seem to
+have removed the spurious null updates, while retaining DOAJ date-only updates.
+
+Also as a small nit notice that occasionally `kbart` metadata gets added with
+no year spans. This seems to be common with cariniana. Presumably this happens
+when there is no year span info available, only volumes. Seems like a valuable
+thing to include as a flag anyways.
+
+## Prod Import
+
+Start small:
+
+    head -n100 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
+    => Counter({'total': 100, 'exists': 69, 'exists-skip-update': 68, 'update': 30, 'insert': 1, 'exists-by-issnl': 1, 'skip': 0})
+
+Full batch:
+
+    time cat /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
+    => Counter({'total': 167092, 'exists': 110594, 'exists-skip-update': 109852, 'update': 55274, 'insert': 1224, 'exists-by-issnl': 742, 'skip': 0})
+
+    real    10m45.714s
+    user    4m51.680s
+    sys     0m12.236s
diff --git a/notes/bulk_edits/2020-12-01_orcid.md b/notes/bulk_edits/2020-12-01_orcid.md
new file mode 100644
index 00000000..b6883b17
--- /dev/null
+++ b/notes/bulk_edits/2020-12-01_orcid.md
@@ -0,0 +1,55 @@
+
+Another annual ORCID dump, basically the same as last year (2019). Expecting
+around 10 million total ORCIDs, compared to 7.3 million last year, so maybe 2.5
+million new creator entities.
+
+In particular motivated to run this import before a potential dblp import
+and/or creator creation run.
+
+Files download from:
+
+- <https://orcid.figshare.com/articles/dataset/ORCID_Public_Data_File_2020/13066970>
+- <https://archive.org/details/orcid-dump-2020>
+
+## Prep
+
+    wget https://github.com/ORCID/orcid-conversion-lib/raw/master/target/orcid-conversion-lib-0.0.2-full.jar
+
+    java -jar orcid-conversion-lib-0.0.2-full.jar --tarball -i ORCID_2020_10_summaries.tar.gz -v v3_0rc1 -o ORCID_2020_10_summaries_json.tar.gz
+
+    tar xvf ORCID_2020_10_summaries_json.tar.gz
+
+    fd .json ORCID_2020_10_summaries/ | parallel cat {} | jq . -c | pv -l | gzip > ORCID_2020_10_summaries.json.gz
+
+    zcat ORCID_2020_10_summaries.json.gz | shuf -n10000 | gzip > ORCID_2020_10_summaries.sample_10k.json.gz
+
+    ia upload orcid-dump-2020 ORCID_2020_10_summaries_json.tar.gz ORCID_2020_10_summaries.sample_10k.json.gz
+
+## Import
+
+Fetch to prod machine:
+
+    wget https://archive.org/download/orcid-dump-2020/ORCID_2020_10_summaries.json.gz
+    wget https://archive.org/download/orcid-dump-2020/ORCID_2020_10_summaries.sample_10k.json.gz
+
+Sample:
+
+    export FATCAT_AUTH_WORKER_ORCID=[...]
+    zcat /srv/fatcat/datasets/ORCID_2020_10_summaries.sample_10k.json.gz | ./fatcat_import.py orcid -
+    => Counter({'total': 10000, 'exists': 7356, 'insert': 2465, 'skip': 179, 'update': 0})
+
+Bulk import:
+
+    export FATCAT_AUTH_WORKER_ORCID=[...]
+    time zcat /srv/fatcat/datasets/ORCID_2020_10_summaries.json.gz | pv -l | parallel -j8 --round-robin --pipe ./fatcat_import.py orcid -
+    => Counter({'total': 1208991, 'exists': 888696, 'insert': 299008, 'skip': 21287, 'update': 0})
+    => (8x of the above, roughly)
+
+    real    88m40.960s
+    user    389m35.344s
+    sys     23m18.396s
+
+
+    Before: Size:  673.36G
+    After:  Size:  675.55G
+
diff --git a/notes/bulk_edits/2020-12-14_doaj.md b/notes/bulk_edits/2020-12-14_doaj.md
new file mode 100644
index 00000000..7e746082
--- /dev/null
+++ b/notes/bulk_edits/2020-12-14_doaj.md
@@ -0,0 +1,105 @@
+
+## Earlier QA Testing (November 2020)
+
+    export FATCAT_API_AUTH_TOKEN=... (FATCAT_AUTH_WORKER_DOAJ)
+
+    # small test:
+    zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | head | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
+
+    # full run
+    zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
+
+    before: 519.17G
+    after:  542.08G
+
+
+    5.45M 6:29:17 [ 233 /s]
+
+    12x of:
+    Counter({'total': 455504, 'insert': 394437, 'exists': 60615, 'skip': 452, 'skip-title': 452, 'update': 0})
+
+    total:  ~5,466,048
+    insert: ~4,733,244 
+    exists:   ~727,380
+
+Initial imports (before crash) were like:
+
+    Counter({'total': 9339, 'insert': 9330, 'skip': 9, 'skip-title': 9, 'update': 0, 'exists': 0})
+
+Seems like there is a bug, not finding existing by DOI?
+
+## Prod Container Metadata Update (chocula)
+
+Generic update of container metadata using chocula pipeline. Need to run this
+before DOAJ import to ensure we have all the containers already updated.
+
+Also updating ISSN-L index at the same time. Using a 2020-11-19 metadata
+snapshot, which was generated on 2020-12-07; more recent snapshots had small
+upstream changes in some formats so it wasn't trivial to run with a newer
+snapshot.
+
+    # git rev: 9f67c82ce8952bbe9a7a07b732830363c7865485
+
+    # from laptop, then unzip on prod machine
+    scp chocula_fatcat_export.2020-11-19.json.gz fatcat-prod1-vm:/srv/fatcat/datasets/
+
+    # check ISSN-L symlink
+    # ISSN-to-ISSN-L.txt -> 20201119.ISSN-to-ISSN-L.txt
+
+    export FATCAT_AUTH_WORKER_JOURNAL_METADATA=...
+    head -n200 /srv/fatcat/datasets/chocula_fatcat_export.2020-11-19.json | ./fatcat_import.py chocula -
+    Counter({'total': 200, 'exists': 200, 'exists-by-issnl': 6, 'skip': 0, 'insert': 0, 'update': 0})
+
+    head -n200 /srv/fatcat/datasets/chocula_fatcat_export.2020-11-19.json | ./fatcat_import.py chocula - --do-updates
+    Counter({'total': 200, 'exists': 157, 'exists-skip-update': 151, 'update': 43, 'exists-by-issnl': 6, 'skip': 0, 'insert': 0})
+
+Some of these are very minor updates, so going to do just creation (no
+`--do-updates`) to start.
+
+    time ./fatcat_import.py chocula /srv/fatcat/datasets/chocula_fatcat_export.2020-11-19.json
+    Counter({'total': 168165, 'exists': 167497, 'exists-by-issnl': 2371, 'insert': 668, 'skip': 0, 'update': 0})
+
+    real    5m37.081s
+    user    3m1.648s
+    sys     0m9.488s
+
+TODO: tweak chocula import script to not update on `extra.state` metadata.
+
+
+## Release Metadata Bulk Import
+
+This is the first production bulk import of DOAJ metadata!
+
+    # git rev: 9f67c82ce8952bbe9a7a07b732830363c7865485
+    # DB before: Size:  678.15G
+
+    # ensure fatcatd is updated to have support for DOAJ identifier
+
+    # create new bot user
+    ./target/release/fatcat-auth create-editor --admin --bot doaj-bot
+    => mir5imb3v5ctxcaqnbstvmri2a
+
+    ./target/release/fatcat-auth create-token mir5imb3v5ctxcaqnbstvmri2a
+    => ...
+
+    # download dataset
+    wget https://archive.org/download/doaj_data_2020-11-13/doaj_article_data_2020-11-13.sample_10k.json.gz
+    wget https://archive.org/download/doaj_data_2020-11-13/doaj_article_data_2020-11-13_all.json.gz
+
+    export FATCAT_AUTH_WORKER_DOAJ=...
+
+    # start small
+    zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13.sample_10k.json.gz | head -n100 | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
+    => Counter({'total': 100, 'exists': 70, 'insert': 30, 'skip': 0, 'update': 0})
+
+That is about expected, in terms of fraction without DOI. However, 6 out of 10
+(randomly checked) of the inserted releases seem to be dupes, which feels too
+high. So going to pause this import until basic fuzzy matching ready from
+Martin's fuzzycat work, and will check against elasticsearch before import.
+Will shuffle the entire file, import in a single thread, and just skip
+importing if there is any fuzzy match (not try to merge/update). Expecting
+about 500k new releases after such filtering.
+
+    # full run (TODO)
+    zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
+
diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md
index be53d10c..bef25e84 100644
--- a/notes/bulk_edits/CHANGELOG.md
+++ b/notes/bulk_edits/CHANGELOG.md
@@ -9,12 +9,16 @@ this file should probably get merged into the guide at some point.
 
 This file should not turn in to a TODO list!
 
+## 2020-12
+
+Updated ORCIDs from 2020 dump. About 2.4 million new `creator` entities.
+
 ## 2020-03
 
 Started harvesting both Arxiv and Pubmed metadata daily and importing to
 fatcat. Did backfill imports for both sources.
 
-JALC DOI register update from 2019 dump.
+JALC DOI registry update from 2019 dump.
 
 ## 2020-01