diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2021-06-02 23:00:55 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2021-06-02 23:00:55 -0700 |
commit | cfe08553682bbd790c6b13e5c4cdcf55c3b6a994 (patch) | |
tree | a4157c66176f56daa02cf38436981deb5d4dbdf9 /notes | |
parent | 8522cfe919402f74cb73ac8c1b8181fb8177adcd (diff) | |
download | fatcat-cfe08553682bbd790c6b13e5c4cdcf55c3b6a994.tar.gz fatcat-cfe08553682bbd790c6b13e5c4cdcf55c3b6a994.zip |
DOAJ bulk import notes, and update bulk edit changelog
Diffstat (limited to 'notes')
-rw-r--r-- | notes/bulk_edits/2021-05-28_doaj.md | 80 | ||||
-rw-r--r-- | notes/bulk_edits/CHANGELOG.md | 9 |
2 files changed, 89 insertions, 0 deletions
diff --git a/notes/bulk_edits/2021-05-28_doaj.md b/notes/bulk_edits/2021-05-28_doaj.md new file mode 100644 index 00000000..e5925eeb --- /dev/null +++ b/notes/bulk_edits/2021-05-28_doaj.md @@ -0,0 +1,80 @@ + +Note: running 2021-05-28 pacific time, but 2021-05-29 UTC. + +First downloaded bulk metadata from doaj.org and uploaded into archive.org item +as a snapshot. + +## Journal Metadata Import + +Before doing article import, want to ensure journals all exist. + +Use chocula pipeline, and to be simple/conservative don't update any +containers, just create if they don't already exist. + +Run the usual chocula source update, copy to data dir, update sources.toml, +etc. Didn't bother with updating container counts or homepage status. Then +export container schema: + + python -m chocula export_fatcat | gzip > chocula_fatcat_export.2021-05-28.json.gz + +Upload to fatcat prod machine, unzip, and then import to prod: + + export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...] + ./fatcat_import.py chocula /srv/fatcat/datasets/chocula_fatcat_export.2021-05-28.json + => Counter({'total': 175837, 'exists': 170358, 'insert': 5479, 'exists-by-issnl': 5232, 'skip': 0, 'update': 0}) + +That is a healthy batch of new records! + +## Article Import + +Transform all the articles into a single JSON file: + + cat doaj_article_data_*/article_batch*.json | jq .[] -c | pv -l | gzip > doaj_article_data_2021-05-25_all.json.gz + => 6.1M 0:18:45 [5.42k/s] + + zcat doaj_article_data_2021-05-25_all.json.gz | shuf -n10000 > doaj_article_data_2021-05-25_sample_10k.json + +Also upload this `_all` file to archive.org item. + +Ready to import! Start with sample: + + export FATCAT_AUTH_WORKER_DOAJ=... + zcat /srv/fatcat/tasks/202105_doaj/doaj_article_data_2021-05-25_sample_10k.json.gz | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - + => Counter({'total': 10000, 'exists': 8743, 'exists-fuzzy': 1044, 'insert': 197, 'skip': 14, 'skip-title': 14, 'skip-doaj-id-mismatch': 2, 'update': 0}) + +Then the full import, in parallel, shuffled (because we shuffled last time): + + zcat /srv/fatcat/tasks/202105_doaj/doaj_article_data_2021-05-25_all.json.gz | shuf | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - + => Counter({'total': 512351, 'exists': 449996, 'exists-fuzzy': 50858, 'insert': 10826, 'skip': 551, 'skip-title': 551, 'skip-doaj-id-mismatch': 120, 'update': 0}) + => extrapolating, about 129,912 new release entities. 2.1% insert rate + +NOTE: large number of warnings like: + + UserWarning: unexpected DOAJ ext_id match after lookup failed doaj=5d2ebb760ad24ce68ec8079bc82c8d78 ident=dvtl2xpn4nespfrj6gad6mrk44 + +## Extra Article Imports + +Manually disabled fuzzy matching with patch: + + diff --git a/python/fatcat_import.py b/python/fatcat_import.py + index 1dcfec2..cb787cb 100755 + --- a/python/fatcat_import.py + +++ b/python/fatcat_import.py + @@ -260,6 +260,7 @@ def run_doaj_article(args): + args.issn_map_file, + edit_batch_size=args.batch_size, + do_updates=args.do_updates, + + do_fuzzy_match=False, + ) + if args.kafka_mode: + KafkaJsonPusher( + +Filtered out some specific articles: + + zcat doaj_article_data_2021-05-25_all.json.gz | rg 1665-1596 | pv -l > doaj_article_data_2021-05-25_voces.json + => 154 0:02:05 [1.22 /s] + +And used this for some imports: + + cat /srv/fatcat/tasks/202105_doaj/doaj_article_data_2021-05-25_voces.json | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt - + diff --git a/notes/bulk_edits/CHANGELOG.md b/notes/bulk_edits/CHANGELOG.md index c5f133f8..a6660dc9 100644 --- a/notes/bulk_edits/CHANGELOG.md +++ b/notes/bulk_edits/CHANGELOG.md @@ -9,6 +9,15 @@ this file should probably get merged into the guide at some point. This file should not turn in to a TODO list! +## 2021-06 + +Created new containers via chocula pipeline. Did not update any existing +chocula entities. + +Ran DOAJ import manually, yielding almost 130k new release entities. + +Running dblp import manually. + ## 2020-12 Updated ORCIDs from 2020 dump. About 2.4 million new `creator` entities. |