aboutsummaryrefslogtreecommitdiffstats
path: root/extra/bulk_edits/2022-07-12_jalc.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2022-07-20 18:12:56 -0700
committerBryan Newbold <bnewbold@robocracy.org>2022-07-20 18:12:56 -0700
commit35b19455b8014a513a9c31ca5795b2611bad06ce (patch)
treef1d974288b759054cda6ef39668ead543dfbe6d3 /extra/bulk_edits/2022-07-12_jalc.md
parent582495f66e5e08b6e257360097807711e53008d4 (diff)
downloadfatcat-35b19455b8014a513a9c31ca5795b2611bad06ce.tar.gz
fatcat-35b19455b8014a513a9c31ca5795b2611bad06ce.zip
bulk edit/import notes
Diffstat (limited to 'extra/bulk_edits/2022-07-12_jalc.md')
-rw-r--r--extra/bulk_edits/2022-07-12_jalc.md47
1 files changed, 47 insertions, 0 deletions
diff --git a/extra/bulk_edits/2022-07-12_jalc.md b/extra/bulk_edits/2022-07-12_jalc.md
new file mode 100644
index 00000000..d9f09fee
--- /dev/null
+++ b/extra/bulk_edits/2022-07-12_jalc.md
@@ -0,0 +1,47 @@
+
+Import of a 2022-04 JALC DOI metadata snapshot.
+
+Note that we had downloaded a prior 2021-04 snapshot, but don't seem to have
+ever imported it.
+
+## Download and Archive
+
+URL for bulk snapshot is available at the bottom of this page: <https://form.jst.go.jp/enquetes/jalcmetadatadl_1703>
+
+More info: <http://japanlinkcenter.org/top/service/service_data.html>
+
+ wget 'https://japanlinkcenter.org/lod/JALC-LOD-20220401.gz?jalcmetadatadl_1703'
+ wget 'http://japanlinkcenter.org/top/doc/JaLC_LOD_format.pdf'
+ wget 'http://japanlinkcenter.org/top/doc/JaLC_LOD_sample.pdf'
+
+ mv 'JALC-LOD-20220401.gz?jalcmetadatadl_1703' JALC-LOD-20220401.gz
+
+ ia upload jalc-bulk-metadata-2022-04 -m collection:ia_biblio_metadata jalc_logo.png JALC-LOD-20220401.gz JaLC_LOD_format.pdf JaLC_LOD_sample.pdf
+
+## Import
+
+As of 2022-07-19, 6,502,202 release hits for `doi_registrar:jalc`.
+
+Re-download the file:
+
+ cd /srv/fatcat/datasets
+ wget 'https://archive.org/download/jalc-bulk-metadata-2022-04/JALC-LOD-20220401.gz'
+ gunzip JALC-LOD-20220401.gz
+ cd /srv/fatcat/src/python
+
+ wc -l /srv/fatcat/datasets/JALC-LOD-20220401
+ 9525225
+
+Start with some samples:
+
+ export FATCAT_AUTH_WORKER_JALC=[...]
+ shuf -n100 /srv/fatcat/datasets/JALC-LOD-20220401 | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+ # Counter({'total': 100, 'exists': 89, 'insert': 11, 'skip': 0, 'update': 0})
+
+Full import (single threaded):
+
+ cat /srv/fatcat/datasets/JALC-LOD-20220401 | pv -l | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+ # 9.53M 22:26:06 [ 117 /s]
+ # Counter({'total': 9510096, 'exists': 8589731, 'insert': 915032, 'skip': 5333, 'inserted.container': 119, 'update': 0})
+
+Wow, almost a million new releases! 7,417,245 results for `doi_registrar:jalc`.