diff options
Diffstat (limited to 'notes/bulk_edits')
-rw-r--r-- | notes/bulk_edits/2020-10-08_chocula.md | 44 | ||||
-rw-r--r-- | notes/bulk_edits/2020-12-01_orcid.md | 55 | ||||
-rw-r--r-- | notes/bulk_edits/2020-12-14_doaj.md | 105 | ||||
-rw-r--r-- | notes/bulk_edits/CHANGELOG.md | 6 |
4 files changed, 209 insertions, 1 deletions
diff --git a/notes/bulk_edits/2020-10-08_chocula.md b/notes/bulk_edits/2020-10-08_chocula.md new file mode 100644 index 00000000..d60b6842 --- /dev/null +++ b/notes/bulk_edits/2020-10-08_chocula.md @@ -0,0 +1,44 @@ + +Another update of journal metadata. In this case due to expanding "Keepers" +coverage to PKP PLN, Hathitrust, Scholar's Portal, and Carniniana. + +Using `journal-metadata-bot` and `chocula.2020-10-08.json` export. + +## QA Testing + + shuf -n1000 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates - + Counter({'total': 1000, 'exists': 640, 'exists-skip-update': 532, 'update': 348, 'exists-not-found': 108, 'insert': 12, 'skip': 0}) + +Expecting roughly a 1/3 update rate. Most of these seem to be true updates (eg, +adding kbart metadata). A smaller fraction are just updating DOAJ timestamp or +not updating any metadata at all. + + head -n500 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates - + Counter({'total': 500, 'exists': 372, 'exists-skip-update': 328, 'update': 121, 'exists-not-found': 44, 'insert': 7, 'skip': 0}) + + head -n500 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates - + Counter({'total': 500, 'exists': 481, 'exists-skip-update': 430, 'exists-not-found': 44, 'update': 19, 'exists-by-issnl': 7, 'skip': 0, 'insert': 0}) + +Made some changes in `27fe31d5ffcac700c30b2b10d56685ef0fa4f3a8` which seem to +have removed the spurious null updates, while retaining DOAJ date-only updates. + +Also as a small nit notice that occasionally `kbart` metadata gets added with +no year spans. This seems to be common with cariniana. Presumably this happens +when there is no year span info available, only volumes. Seems like a valuable +thing to include as a flag anyways. + +## Prod Import + +Start small: + + head -n100 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates - + => Counter({'total': 100, 'exists': 69, 'exists-skip-update': 68, 'update': 30, 'insert': 1, 'exists-by-issnl': 1, 'skip': 0}) + +Full batch: + + time cat /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates - + => Counter({'total': 167092, 'exists': 110594, 'exists-skip-update': 109852, 'update': 55274, 'insert': 1224, 'exists-by-issnl': 742, 'skip': 0}) + + real 10m45.714s + user 4m51.680s + sys 0m12.236s diff --git a/notes/bulk_edits/2020-12-01_orcid.md b/notes/bulk_edits/2020-12-01_orcid.md new file mode 100644 index 00000000..b6883b17 --- /dev/null +++ b/notes/bulk_edits/2020-12-01_orcid.md @@ -0,0 +1,55 @@ + +Another annual ORCID dump, basically the same as last year (2019). Expecting +around 10 million total ORCIDs, compared to 7.3 million last year, so maybe 2.5 +million new creator entities. + +In particular motivated to run this import before a potential dblp import +and/or creator creation run. + +Files download from: + +- <https://orcid.figshare.com/articles/dataset/ORCID_Public_Data_File_2020/13066970> +- <https://archive.org/details/orcid-dump-2020> + +## Prep + + wget https://github.com/ORCID/orcid-conversion-lib/raw/master/target/orcid-conversion-lib-0.0.2-full.jar + + java -jar orcid-conversion-lib-0.0.2-full.jar --tarball -i ORCID_2020_10_summaries.tar.gz -v v3_0rc1 -o ORCID_2020_10_summaries_json.tar.gz + + tar xvf ORCID_2020_10_summaries_json.tar.gz + + fd .json ORCID_2020_10_summaries/ | parallel cat {} | jq . -c | pv -l | gzip > ORCID_2020_10_summaries.json.gz + + zcat ORCID_2020_10_summaries.json.gz | shuf -n10000 | gzip > ORCID_2020_10_summaries.sample_10k.json.gz + + ia upload orcid-dump-2020 ORCID_2020_10_summaries_json.tar.gz ORCID_2020_10_summaries.sample_10k.json.gz + +## Import + +Fetch to prod machine: + + wget https://archive.org/download/orcid-dump-2020/ORCID_2020_10_summaries.json.gz + wget https://archive.org/download/orcid-dump-2020/ORCID_2020_10_summaries.sample_10k.json.gz + +Sample: + + export FATCAT_AUTH_WORKER_ORCID=[...] + zcat /srv/fatcat/datasets/ORCID_2020_10_summaries.sample_10k.json.gz | ./fatcat_import.py orcid - + => Counter({'total': 10000, 'exists': 7356, 'insert': 2465, 'skip': 179, 'update': 0}) + +Bulk import: + + export FATCAT_AUTH_WORKER_ORCID=[...] + time zcat /srv/fatcat/datasets/ORCID_2020_10_summaries.json.gz | pv -l | parallel -j8 --round-robin --pipe ./fatcat_import.py orcid - + => Counter({'total': 1208991, 'exists': 888696, 'insert': 299008, 'skip': 21287, 'update': 0}) + => (8x of the above, roughly) + + real 88m40.960s + user 389m35.344s + sys 23m18.396s + + + Before: Size: 673.36G + After: Size: 675.55G + diff --git a/notes/bulk_edits/2020-12-14_doaj.md b/notes/bulk_edits/2020-12-14_doaj.md new file mode 100644 index 00000000..7e746082 --- /dev/null +++ b/notes/bulk_edits/2020-12-14_doaj.md @@ -0,0 +1,105 @@ + +## Earlier QA Testing (November 2020) + + export FATCAT_API_AUTH_TOKEN=... (FATCAT_AUTH_WORKER_DOAJ) + + # small test: + zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | head | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - + + # full run + zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - + + before: 519.17G + after: 542.08G + + + 5.45M 6:29:17 [ 233 /s] + + 12x of: + Counter({'total': 455504, 'insert': 394437, 'exists': 60615, 'skip': 452, 'skip-title': 452, 'update': 0}) + + total: ~5,466,048 + insert: ~4,733,244 + exists: ~727,380 + +Initial imports (before crash) were like: + + Counter({'total': 9339, 'insert': 9330, 'skip': 9, 'skip-title': 9, 'update': 0, 'exists': 0}) + +Seems like there is a bug, not finding existing by DOI? + +## Prod Container Metadata Update (chocula) + +Generic update of container metadata using chocula pipeline. Need to run this +before DOAJ import to ensure we have all the containers already updated. + +Also updating ISSN-L index at the same time. Using a 2020-11-19 metadata +snapshot, which was generated on 2020-12-07; more recent snapshots had small +upstream changes in some formats so it wasn't trivial to run with a newer +snapshot. + + # git rev: 9f67c82ce8952bbe9a7a07b732830363c7865485 + + # from laptop, then unzip on prod machine + scp chocula_fatcat_export.2020-11-19.json.gz fatcat-prod1-vm:/srv/fatcat/datasets/ + + # check ISSN-L symlink + # ISSN-to-ISSN-L.txt -> 20201119.ISSN-to-ISSN-L.txt + + export FATCAT_AUTH_WORKER_JOURNAL_METADATA=... + head -n200 /srv/fatcat/datasets/chocula_fatcat_export.2020-11-19.json | ./fatcat_import.py chocula - + Counter({'total': 200, 'exists': 200, 'exists-by-issnl': 6, 'skip': 0, 'insert': 0, 'update': 0}) + + head -n200 /srv/fatcat/datasets/chocula_fatcat_export.2020-11-19.json | ./fatcat_import.py chocula - --do-updates + Counter({'total': 200, 'exists': 157, 'exists-skip-update': 151, 'update': 43, 'exists-by-issnl': 6, 'skip': 0, 'insert': 0}) + +Some of these are very minor updates, so going to do just creation (no +`--do-updates`) to start. + + time ./fatcat_import.py chocula /srv/fatcat/datasets/chocula_fatcat_export.2020-11-19.json + Counter({'total': 168165, 'exists': 167497, 'exists-by-issnl': 2371, 'insert': 668, 'skip': 0, 'update': 0}) + + real 5m37.081s + user 3m1.648s + sys 0m9.488s + +TODO: tweak chocula import script to not update on `extra.state` metadata. + + +## Release Metadata Bulk Import + +This is the first production bulk import of DOAJ metadata! + + # git rev: 9f67c82ce8952bbe9a7a07b732830363c7865485 + # DB before: Size: 678.15G + + # ensure fatcatd is updated to have support for DOAJ identifier + + # create new bot user + ./target/release/fatcat-auth create-editor --admin --bot doaj-bot + => mir5imb3v5ctxcaqnbstvmri2a + + ./target/release/fatcat-auth create-token mir5imb3v5ctxcaqnbstvmri2a + => ... + + # download dataset + wget https://archive.org/download/doaj_data_2020-11-13/doaj_article_data_2020-11-13.sample_10k.json.gz + wget https://archive.org/download/doaj_data_2020-11-13/doaj_article_data_2020-11-13_all.json.gz + + export FATCAT_AUTH_WORKER_DOAJ=... + + # start small + zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13.sample_10k.json.gz | head -n100 | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - + => Counter({'total': 100, 'exists': 70, 'insert': 30, 'skip': 0, 'update': 0}) + +That is about expected, in terms of fraction without DOI. However, 6 out of 10 +(randomly checked) of the inserted releases seem to be dupes, which feels too +high. So going to pause this import until basic fuzzy matching ready from +Martin's fuzzycat work, and will check against elasticsearch before import. +Will shuffle the entire file, import in a single thread, and just skip +importing if there is any fuzzy match (not try to merge/update). Expecting +about 500k new releases after such filtering. + + # full run (TODO) + zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - + diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md index be53d10c..bef25e84 100644 --- a/notes/bulk_edits/CHANGELOG.md +++ b/notes/bulk_edits/CHANGELOG.md @@ -9,12 +9,16 @@ this file should probably get merged into the guide at some point. This file should not turn in to a TODO list! +## 2020-12 + +Updated ORCIDs from 2020 dump. About 2.4 million new `creator` entities. + ## 2020-03 Started harvesting both Arxiv and Pubmed metadata daily and importing to fatcat. Did backfill imports for both sources. -JALC DOI register update from 2019 dump. +JALC DOI registry update from 2019 dump. ## 2020-01 |