aboutsummaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits/2020-03-23_jalc.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2021-11-29 14:34:02 -0800
committerBryan Newbold <bnewbold@robocracy.org>2021-11-29 14:34:02 -0800
commitc32154f2875a7fb9aac727013e1475cdd811e180 (patch)
treef0e061498a101fa824995fb6ec9f91e7e44257e1 /notes/bulk_edits/2020-03-23_jalc.md
parentc5ea2dba358624f4c14da0a1a988ae14d0edfd59 (diff)
downloadfatcat-c32154f2875a7fb9aac727013e1475cdd811e180.tar.gz
fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.zip
move notes/bulk_edits/ to extra/bulk_edits/
Diffstat (limited to 'notes/bulk_edits/2020-03-23_jalc.md')
-rw-r--r--notes/bulk_edits/2020-03-23_jalc.md23
1 files changed, 0 insertions, 23 deletions
diff --git a/notes/bulk_edits/2020-03-23_jalc.md b/notes/bulk_edits/2020-03-23_jalc.md
deleted file mode 100644
index d63c3759..00000000
--- a/notes/bulk_edits/2020-03-23_jalc.md
+++ /dev/null
@@ -1,23 +0,0 @@
-
-2019-10-01 JaLC metadata snapshot: <https://archive.org/download/jalc-bulk-metadata-2019>
-
-Extracted .rdf file instead of piping it through zcat.
-
-Use correct bot:
-
- export FATCAT_AUTH_WORKER_JALC=blah
-
-Start small; do a random bunch (10k) single-threaded to pre-create containers:
-
- head -n100 /srv/fatcat/datasets/JALC-LOD-20191001.rdf | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
- shuf -n100 /srv/fatcat/datasets/JALC-LOD-20191001.rdf | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
- shuf -n10000 /srv/fatcat/datasets/JALC-LOD-20191001.rdf | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
-
-Seemed like lots of individual containers getting added after repeating, so
-just going to import single-threaded to avoid duplicate container creation:
-
- cat /srv/fatcat/datasets/JALC-LOD-20191001.rdf | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
- => Counter({'total': 8419745, 'exists': 6480683, 'insert': 1934082, 'skip': 4980, 'inserted.container': 134, 'update': 0})
-
-Had a bit fewer than 4,568,120 "doi_registrar:jalc" releases before this
-import, 6,502,202 after (based on `doi_registrar:jalc` query).