summaryrefslogtreecommitdiffstats
path: root/notes
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2020-07-01 16:34:37 -0700
committerBryan Newbold <bnewbold@robocracy.org>2020-07-01 16:34:37 -0700
commitf53ada2addef33a0096af079281ad81143339136 (patch)
treefab8feb8da23d4685c3496fb62635e9b61888871 /notes
parent274da5d3994e9f1a4ddabf2d3ddba06c5db1aa73 (diff)
downloadfatcat-f53ada2addef33a0096af079281ad81143339136.tar.gz
fatcat-f53ada2addef33a0096af079281ad81143339136.zip
JALC bulk edit notes from 2020-03-23
Diffstat (limited to 'notes')
-rw-r--r--notes/bulk_edits/2020-03-23_jalc.md23
1 files changed, 23 insertions, 0 deletions
diff --git a/notes/bulk_edits/2020-03-23_jalc.md b/notes/bulk_edits/2020-03-23_jalc.md
new file mode 100644
index 00000000..d63c3759
--- /dev/null
+++ b/notes/bulk_edits/2020-03-23_jalc.md
@@ -0,0 +1,23 @@
+
+2019-10-01 JaLC metadata snapshot: <https://archive.org/download/jalc-bulk-metadata-2019>
+
+Extracted .rdf file instead of piping it through zcat.
+
+Use correct bot:
+
+ export FATCAT_AUTH_WORKER_JALC=blah
+
+Start small; do a random bunch (10k) single-threaded to pre-create containers:
+
+ head -n100 /srv/fatcat/datasets/JALC-LOD-20191001.rdf | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+ shuf -n100 /srv/fatcat/datasets/JALC-LOD-20191001.rdf | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+ shuf -n10000 /srv/fatcat/datasets/JALC-LOD-20191001.rdf | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+
+Seemed like lots of individual containers getting added after repeating, so
+just going to import single-threaded to avoid duplicate container creation:
+
+ cat /srv/fatcat/datasets/JALC-LOD-20191001.rdf | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+ => Counter({'total': 8419745, 'exists': 6480683, 'insert': 1934082, 'skip': 4980, 'inserted.container': 134, 'update': 0})
+
+Had a bit fewer than 4,568,120 "doi_registrar:jalc" releases before this
+import, 6,502,202 after (based on `doi_registrar:jalc` query).