aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2022-07-20 18:12:56 -0700
committerBryan Newbold <bnewbold@robocracy.org>2022-07-20 18:12:56 -0700
commit35b19455b8014a513a9c31ca5795b2611bad06ce (patch)
treef1d974288b759054cda6ef39668ead543dfbe6d3
parent582495f66e5e08b6e257360097807711e53008d4 (diff)
downloadfatcat-35b19455b8014a513a9c31ca5795b2611bad06ce.tar.gz
fatcat-35b19455b8014a513a9c31ca5795b2611bad06ce.zip
bulk edit/import notes
-rw-r--r--extra/bulk_edits/2022-07-12_jalc.md47
-rw-r--r--extra/bulk_edits/2022-07-19_doaj.md78
-rw-r--r--extra/bulk_edits/CHANGELOG.md11
3 files changed, 136 insertions, 0 deletions
diff --git a/extra/bulk_edits/2022-07-12_jalc.md b/extra/bulk_edits/2022-07-12_jalc.md
new file mode 100644
index 00000000..d9f09fee
--- /dev/null
+++ b/extra/bulk_edits/2022-07-12_jalc.md
@@ -0,0 +1,47 @@
+
+Import of a 2022-04 JALC DOI metadata snapshot.
+
+Note that we had downloaded a prior 2021-04 snapshot, but don't seem to have
+ever imported it.
+
+## Download and Archive
+
+URL for bulk snapshot is available at the bottom of this page: <https://form.jst.go.jp/enquetes/jalcmetadatadl_1703>
+
+More info: <http://japanlinkcenter.org/top/service/service_data.html>
+
+ wget 'https://japanlinkcenter.org/lod/JALC-LOD-20220401.gz?jalcmetadatadl_1703'
+ wget 'http://japanlinkcenter.org/top/doc/JaLC_LOD_format.pdf'
+ wget 'http://japanlinkcenter.org/top/doc/JaLC_LOD_sample.pdf'
+
+ mv 'JALC-LOD-20220401.gz?jalcmetadatadl_1703' JALC-LOD-20220401.gz
+
+ ia upload jalc-bulk-metadata-2022-04 -m collection:ia_biblio_metadata jalc_logo.png JALC-LOD-20220401.gz JaLC_LOD_format.pdf JaLC_LOD_sample.pdf
+
+## Import
+
+As of 2022-07-19, 6,502,202 release hits for `doi_registrar:jalc`.
+
+Re-download the file:
+
+ cd /srv/fatcat/datasets
+ wget 'https://archive.org/download/jalc-bulk-metadata-2022-04/JALC-LOD-20220401.gz'
+ gunzip JALC-LOD-20220401.gz
+ cd /srv/fatcat/src/python
+
+ wc -l /srv/fatcat/datasets/JALC-LOD-20220401
+ 9525225
+
+Start with some samples:
+
+ export FATCAT_AUTH_WORKER_JALC=[...]
+ shuf -n100 /srv/fatcat/datasets/JALC-LOD-20220401 | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+ # Counter({'total': 100, 'exists': 89, 'insert': 11, 'skip': 0, 'update': 0})
+
+Full import (single threaded):
+
+ cat /srv/fatcat/datasets/JALC-LOD-20220401 | pv -l | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+ # 9.53M 22:26:06 [ 117 /s]
+ # Counter({'total': 9510096, 'exists': 8589731, 'insert': 915032, 'skip': 5333, 'inserted.container': 119, 'update': 0})
+
+Wow, almost a million new releases! 7,417,245 results for `doi_registrar:jalc`.
diff --git a/extra/bulk_edits/2022-07-19_doaj.md b/extra/bulk_edits/2022-07-19_doaj.md
new file mode 100644
index 00000000..d25f2dda
--- /dev/null
+++ b/extra/bulk_edits/2022-07-19_doaj.md
@@ -0,0 +1,78 @@
+
+Doing a batch import of DOAJ articles. Will need to do another one of these
+soon after setting up daily (OAI-PMH feed) ingest.
+
+## Prep
+
+ wget https://doaj.org/csv
+ wget https://doaj.org/public-data-dump/journal
+ wget https://doaj.org/public-data-dump/article
+
+ mv csv journalcsv__doaj_20220719_2135_utf8.csv
+ mv journal doaj_journal_data_2022-07-19.tar.gz
+ mv article doaj_article_data_2022-07-19.tar.gz
+
+ ia upload doaj_data_2022-07-19 -m collection:ia_biblio_metadata ../logo_cropped.jpg journalcsv__doaj_20220719_2135_utf8.csv doaj_journal_data_2022-07-19.tar.gz doaj_article_data_2022-07-19.tar.gz
+
+ tar xvf doaj_journal_data_2022-07-19.tar.gz
+ cat doaj_journal_data_*/journal_batch_*.json | jq .[] -c | pv -l | gzip > doaj_journal_data_2022-07-19_all.json.gz
+
+ tar xvf doaj_article_data_2022-07-19.tar.gz
+ cat doaj_article_data_*/article_batch*.json | jq .[] -c | pv -l | gzip > doaj_article_data_2022-07-19_all.json.gz
+
+ ia upload doaj_data_2022-07-19 doaj_journal_data_2022-07-19_all.json.gz doaj_article_data_2022-07-19_all.json.gz
+
+On fatcat machine:
+
+ cd /srv/fatcat/datasets
+ wget https://archive.org/download/doaj_data_2022-07-19/doaj_article_data_2022-07-19_all.json.gz
+
+## Prod Article Import
+
+ git rev: 582495f66e5e08b6e257360097807711e53008d4
+ (includes DOAJ container-id required patch)
+
+ date: Tue Jul 19 22:46:42 UTC 2022
+
+ `doaj_id:*`: 1,335,195 hits
+
+Start with sample:
+
+ zcat /srv/fatcat/datasets/doaj_article_data_2022-07-19_all.json.gz | shuf -n1000 > /srv/fatcat/datasets/doaj_article_data_2022-07-19_sample.json
+
+ export FATCAT_AUTH_WORKER_DOAJ=[...]
+ cat /srv/fatcat/datasets/doaj_article_data_2022-07-19_sample.json | pv -l | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
+ # Counter({'total': 1000, 'exists': 895, 'exists-fuzzy': 93, 'insert': 9, 'skip': 3, 'skip-no-container': 3, 'update': 0})
+
+Pretty few imports.
+
+Full ingest:
+
+ export FATCAT_AUTH_WORKER_DOAJ=[...]
+ zcat /srv/fatcat/datasets/doaj_article_data_2022-07-19_all.json.gz | pv -l | parallel -j6 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
+ # Counter({'total': 1282908, 'exists': 1145439, 'exists-fuzzy': 117120, 'insert': 16357, 'skip': 3831, 'skip-no-container': 2641, 'skip-title': 1190, 'skip-doaj-id-mismatch': 161, 'update': 0})
+
+Times 6x, around 100k releases added.
+
+Got a bunch of:
+
+ /1/srv/fatcat/src/python/fatcat_tools/importers/doaj_article.py:233: UserWarning: unexpected DOAJ ext_id match after lookup failed doaj=fcdb7a7a9729403d8d99a21f6970dd1d ident=wesvmjwihvblzayfmrvvgr4ulm
+ warnings.warn(warn_str)
+ /1/srv/fatcat/src/python/fatcat_tools/importers/doaj_article.py:233: UserWarning: unexpected DOAJ ext_id match after lookup failed doaj=1455dfe24583480883dbbb293a4bc0c6 ident=lfw57esesjbotms3grvvods5dq
+ warnings.warn(warn_str)
+ /1/srv/fatcat/src/python/fatcat_tools/importers/doaj_article.py:233: UserWarning: unexpected DOAJ ext_id match after lookup failed doaj=88fa65a33c8e484091fc76f4cda59c25 ident=22abqt5qe5e7ngjd5fkyvzyc4q
+ warnings.warn(warn_str)
+ /1/srv/fatcat/src/python/fatcat_tools/importers/doaj_article.py:233: UserWarning: unexpected DOAJ ext_id match after lookup failed doaj=eb7b03dc3dc340cea36891a68a50cce7 ident=ljedohlfyzdkxebgpcswjtd77q
+ warnings.warn(warn_str)
+ /1/srv/fatcat/src/python/fatcat_tools/importers/doaj_article.py:233: UserWarning: unexpected DOAJ ext_id match after lookup failed doaj=519617147ce248ea88d45ab098342153 ident=a63bqkttrbhyxavfr7li2w2xf4
+
+Should investigate!
+
+Also, noticed that DOAJ importer is hitting `api.fatcat.wiki`, not the public
+API endpoint. Guessing this is via fuzzycat.
+
+1,434,266 results for `doaj_id:*`.
+
+Then did a follow-up sandcrawler ingest, see notes in that repository. Note
+that newer ingest can crawl doaj.org, bypassing the sandcrawler SQL load, but
+the direct crawling is probably still faster.
diff --git a/extra/bulk_edits/CHANGELOG.md b/extra/bulk_edits/CHANGELOG.md
index f7b9e536..732cbb2f 100644
--- a/extra/bulk_edits/CHANGELOG.md
+++ b/extra/bulk_edits/CHANGELOG.md
@@ -16,6 +16,17 @@ Ran a journal-level metadata update, using chocula.
Cleaned up just under 500 releases with missing `container_id` from an older
DOAJ article import.
+Imported roughly 100k releases from DOAJ, new since 2022-04.
+
+Imported roughly 2.7 million new ORCiD `creator` entities, using the 2021 dump
+(first update since 2020 dump).
+
+Imported almost 1 million new DOI release entities from JALC, first update in
+more than a year.
+
+Imported at least 400 new dblp containers, and an unknown number of new dblp
+release entities.
+
## 2022-04