summaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits/2021-05-28_doaj.md
diff options
context:
space:
mode:
Diffstat (limited to 'notes/bulk_edits/2021-05-28_doaj.md')
-rw-r--r--notes/bulk_edits/2021-05-28_doaj.md80
1 files changed, 0 insertions, 80 deletions
diff --git a/notes/bulk_edits/2021-05-28_doaj.md b/notes/bulk_edits/2021-05-28_doaj.md
deleted file mode 100644
index e5925eeb..00000000
--- a/notes/bulk_edits/2021-05-28_doaj.md
+++ /dev/null
@@ -1,80 +0,0 @@
-
-Note: running 2021-05-28 pacific time, but 2021-05-29 UTC.
-
-First downloaded bulk metadata from doaj.org and uploaded into archive.org item
-as a snapshot.
-
-## Journal Metadata Import
-
-Before doing article import, want to ensure journals all exist.
-
-Use chocula pipeline, and to be simple/conservative don't update any
-containers, just create if they don't already exist.
-
-Run the usual chocula source update, copy to data dir, update sources.toml,
-etc. Didn't bother with updating container counts or homepage status. Then
-export container schema:
-
- python -m chocula export_fatcat | gzip > chocula_fatcat_export.2021-05-28.json.gz
-
-Upload to fatcat prod machine, unzip, and then import to prod:
-
- export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...]
- ./fatcat_import.py chocula /srv/fatcat/datasets/chocula_fatcat_export.2021-05-28.json
- => Counter({'total': 175837, 'exists': 170358, 'insert': 5479, 'exists-by-issnl': 5232, 'skip': 0, 'update': 0})
-
-That is a healthy batch of new records!
-
-## Article Import
-
-Transform all the articles into a single JSON file:
-
- cat doaj_article_data_*/article_batch*.json | jq .[] -c | pv -l | gzip > doaj_article_data_2021-05-25_all.json.gz
- => 6.1M 0:18:45 [5.42k/s]
-
- zcat doaj_article_data_2021-05-25_all.json.gz | shuf -n10000 > doaj_article_data_2021-05-25_sample_10k.json
-
-Also upload this `_all` file to archive.org item.
-
-Ready to import! Start with sample:
-
- export FATCAT_AUTH_WORKER_DOAJ=...
- zcat /srv/fatcat/tasks/202105_doaj/doaj_article_data_2021-05-25_sample_10k.json.gz | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
- => Counter({'total': 10000, 'exists': 8743, 'exists-fuzzy': 1044, 'insert': 197, 'skip': 14, 'skip-title': 14, 'skip-doaj-id-mismatch': 2, 'update': 0})
-
-Then the full import, in parallel, shuffled (because we shuffled last time):
-
- zcat /srv/fatcat/tasks/202105_doaj/doaj_article_data_2021-05-25_all.json.gz | shuf | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
- => Counter({'total': 512351, 'exists': 449996, 'exists-fuzzy': 50858, 'insert': 10826, 'skip': 551, 'skip-title': 551, 'skip-doaj-id-mismatch': 120, 'update': 0})
- => extrapolating, about 129,912 new release entities. 2.1% insert rate
-
-NOTE: large number of warnings like:
-
- UserWarning: unexpected DOAJ ext_id match after lookup failed doaj=5d2ebb760ad24ce68ec8079bc82c8d78 ident=dvtl2xpn4nespfrj6gad6mrk44
-
-## Extra Article Imports
-
-Manually disabled fuzzy matching with patch:
-
- diff --git a/python/fatcat_import.py b/python/fatcat_import.py
- index 1dcfec2..cb787cb 100755
- --- a/python/fatcat_import.py
- +++ b/python/fatcat_import.py
- @@ -260,6 +260,7 @@ def run_doaj_article(args):
- args.issn_map_file,
- edit_batch_size=args.batch_size,
- do_updates=args.do_updates,
- + do_fuzzy_match=False,
- )
- if args.kafka_mode:
- KafkaJsonPusher(
-
-Filtered out some specific articles:
-
- zcat doaj_article_data_2021-05-25_all.json.gz | rg 1665-1596 | pv -l > doaj_article_data_2021-05-25_voces.json
- => 154 0:02:05 [1.22 /s]
-
-And used this for some imports:
-
- cat /srv/fatcat/tasks/202105_doaj/doaj_article_data_2021-05-25_voces.json | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
-