aboutsummaryrefslogtreecommitdiffstats
path: root/extra
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2022-03-22 13:18:21 -0700
committerBryan Newbold <bnewbold@robocracy.org>2022-03-22 13:18:21 -0700
commit51d754c3fd0cbabb2e81195e2cf70384ed36dad8 (patch)
tree5df20724fd87e4c8f21d7a5aed5f67a7caddab89 /extra
parent91e4cedb00d1d2a5003f331880290a6e600ee6b5 (diff)
downloadfatcat-51d754c3fd0cbabb2e81195e2cf70384ed36dad8.tar.gz
fatcat-51d754c3fd0cbabb2e81195e2cf70384ed36dad8.zip
document recent bulk metadata edits/imports
Diffstat (limited to 'extra')
-rw-r--r--extra/bulk_edits/2022-03-08_chocula.md31
-rw-r--r--extra/bulk_edits/2022-03-08_doaj.md23
-rw-r--r--extra/bulk_edits/CHANGELOG.md8
3 files changed, 62 insertions, 0 deletions
diff --git a/extra/bulk_edits/2022-03-08_chocula.md b/extra/bulk_edits/2022-03-08_chocula.md
new file mode 100644
index 00000000..1877a236
--- /dev/null
+++ b/extra/bulk_edits/2022-03-08_chocula.md
@@ -0,0 +1,31 @@
+
+Periodic import of chocula metadata updates.
+
+## Prod Import
+
+ date
+ # Wed Mar 9 02:13:55 UTC 2022
+
+ git log -n1
+ # commit 72e3825893ae614fcd6c6ae8a513745bfefe36b2
+
+ export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...]
+ head -n100 /srv/fatcat/datasets/chocula_fatcat_export.2022-03-08.json | ./fatcat_import.py chocula --do-updates -
+ # Counter({'total': 100, 'exists': 85, 'exists-skip-update': 85, 'update': 14, 'insert': 1, 'skip': 0})
+
+Some of these are just "as of" date updates on DOAJ metadata, but most are
+"good". Lots of KBART holding dates incremented by a year (to include 2022).
+
+ time cat /srv/fatcat/datasets/chocula_fatcat_export.2022-03-08.json | ./fatcat_import.py chocula --do-updates -
+
+
+ Counter({'total': 184950, 'exists': 151925, 'exists-skip-update': 151655, 'update': 29953, 'insert': 3072
+ , 'exists-by-issnl': 270, 'skip': 0})
+
+ real 11m7.011s
+ user 4m48.705s
+ sys 0m16.761s
+
+Great!
+
+Now update stats, following `extra/container_count_update/README.md`.
diff --git a/extra/bulk_edits/2022-03-08_doaj.md b/extra/bulk_edits/2022-03-08_doaj.md
new file mode 100644
index 00000000..fc6438d5
--- /dev/null
+++ b/extra/bulk_edits/2022-03-08_doaj.md
@@ -0,0 +1,23 @@
+
+Simple periodic update of DOAJ article-level metadata.
+
+ cat doaj_article_data_*/article_batch*.json | jq .[] -c | pv -l | gzip > doaj_article_data_2021-05-25_all.json.gz
+ => 6.1M 0:18:45 [5.42k/s]
+ => 7.26M 0:30:45 [3.94k/s]
+
+ export FATCAT_AUTH_WORKER_DOAJ=...
+ cat /srv/fatcat/tasks/doaj_article_data_2022-03-07_sample_10k.json | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
+ # Counter({'total': 10000, 'exists': 8827, 'exists-fuzzy': 944, 'insert': 219, 'skip': 8, 'skip-title': 8, 'skip-doaj-id-mismatch': 2, 'update': 0})
+
+ zcat /srv/fatcat/tasks/doaj_article_data_2022-03-07_all.json.gz | shuf | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
+
+The above seemed to use too much CPU, and caused a brief outage. Very high CPU
+use for just the python import processes, for whatever reason. Turned down
+parallelism and trying again:
+
+ zcat /srv/fatcat/tasks/doaj_article_data_2022-03-07_all.json.gz | pv -l | parallel -j6 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
+ # multiple counts of:
+ # Counter({'total': 1196313, 'exists': 1055412, 'exists-fuzzy': 111490, 'insert': 27835, 'skip': 1280, 'skip-title': 1280, 'skip-doaj-id-mismatch': 296, 'update': 0})
+ # estimated only 167,010 new entities
+
+Then did a follow-up sandcrawler ingest, see notes in that repository.
diff --git a/extra/bulk_edits/CHANGELOG.md b/extra/bulk_edits/CHANGELOG.md
index 278dc1d8..b6bfcb96 100644
--- a/extra/bulk_edits/CHANGELOG.md
+++ b/extra/bulk_edits/CHANGELOG.md
@@ -9,6 +9,14 @@ this file should probably get merged into the guide at some point.
This file should not turn in to a TODO list!
+## 2022-03
+
+Ran a journal-level metadata update, using chocula.
+
+Run a DOAJ article-level metadata import, yielding a couple hundred thousand
+new release entities. Crawling and bulk ingest of HTML and PDF fulltext for
+these articles also started.
+
## 2022-02
- removed `container_id` linkage for some Datacite DOI releases which are