aboutsummaryrefslogtreecommitdiffstats
path: root/extra/bulk_edits
diff options
context:
space:
mode:
Diffstat (limited to 'extra/bulk_edits')
-rw-r--r--extra/bulk_edits/2022-03-08_chocula.md31
-rw-r--r--extra/bulk_edits/2022-03-08_doaj.md23
-rw-r--r--extra/bulk_edits/2022-04-07_initial_datasets.md22
-rw-r--r--extra/bulk_edits/2022-04-20_isiarticles.md39
-rw-r--r--extra/bulk_edits/2022-07-06_chocula.md25
-rw-r--r--extra/bulk_edits/2022-07-12_cleanup_doaj_missing_container_id.md38
-rw-r--r--extra/bulk_edits/2022-07-12_jalc.md47
-rw-r--r--extra/bulk_edits/2022-07-12_orcid.md64
-rw-r--r--extra/bulk_edits/2022-07-13_dblp.md114
-rw-r--r--extra/bulk_edits/2022-07-19_doaj.md78
-rw-r--r--extra/bulk_edits/2022-07-29_chocula.md47
-rw-r--r--extra/bulk_edits/CHANGELOG.md42
12 files changed, 570 insertions, 0 deletions
diff --git a/extra/bulk_edits/2022-03-08_chocula.md b/extra/bulk_edits/2022-03-08_chocula.md
new file mode 100644
index 00000000..1877a236
--- /dev/null
+++ b/extra/bulk_edits/2022-03-08_chocula.md
@@ -0,0 +1,31 @@
+
+Periodic import of chocula metadata updates.
+
+## Prod Import
+
+ date
+ # Wed Mar 9 02:13:55 UTC 2022
+
+ git log -n1
+ # commit 72e3825893ae614fcd6c6ae8a513745bfefe36b2
+
+ export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...]
+ head -n100 /srv/fatcat/datasets/chocula_fatcat_export.2022-03-08.json | ./fatcat_import.py chocula --do-updates -
+ # Counter({'total': 100, 'exists': 85, 'exists-skip-update': 85, 'update': 14, 'insert': 1, 'skip': 0})
+
+Some of these are just "as of" date updates on DOAJ metadata, but most are
+"good". Lots of KBART holding dates incremented by a year (to include 2022).
+
+ time cat /srv/fatcat/datasets/chocula_fatcat_export.2022-03-08.json | ./fatcat_import.py chocula --do-updates -
+
+
+ Counter({'total': 184950, 'exists': 151925, 'exists-skip-update': 151655, 'update': 29953, 'insert': 3072
+ , 'exists-by-issnl': 270, 'skip': 0})
+
+ real 11m7.011s
+ user 4m48.705s
+ sys 0m16.761s
+
+Great!
+
+Now update stats, following `extra/container_count_update/README.md`.
diff --git a/extra/bulk_edits/2022-03-08_doaj.md b/extra/bulk_edits/2022-03-08_doaj.md
new file mode 100644
index 00000000..fc6438d5
--- /dev/null
+++ b/extra/bulk_edits/2022-03-08_doaj.md
@@ -0,0 +1,23 @@
+
+Simple periodic update of DOAJ article-level metadata.
+
+ cat doaj_article_data_*/article_batch*.json | jq .[] -c | pv -l | gzip > doaj_article_data_2021-05-25_all.json.gz
+ => 6.1M 0:18:45 [5.42k/s]
+ => 7.26M 0:30:45 [3.94k/s]
+
+ export FATCAT_AUTH_WORKER_DOAJ=...
+ cat /srv/fatcat/tasks/doaj_article_data_2022-03-07_sample_10k.json | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
+ # Counter({'total': 10000, 'exists': 8827, 'exists-fuzzy': 944, 'insert': 219, 'skip': 8, 'skip-title': 8, 'skip-doaj-id-mismatch': 2, 'update': 0})
+
+ zcat /srv/fatcat/tasks/doaj_article_data_2022-03-07_all.json.gz | shuf | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
+
+The above seemed to use too much CPU, and caused a brief outage. Very high CPU
+use for just the python import processes, for whatever reason. Turned down
+parallelism and trying again:
+
+ zcat /srv/fatcat/tasks/doaj_article_data_2022-03-07_all.json.gz | pv -l | parallel -j6 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
+ # multiple counts of:
+ # Counter({'total': 1196313, 'exists': 1055412, 'exists-fuzzy': 111490, 'insert': 27835, 'skip': 1280, 'skip-title': 1280, 'skip-doaj-id-mismatch': 296, 'update': 0})
+ # estimated only 167,010 new entities
+
+Then did a follow-up sandcrawler ingest, see notes in that repository.
diff --git a/extra/bulk_edits/2022-04-07_initial_datasets.md b/extra/bulk_edits/2022-04-07_initial_datasets.md
new file mode 100644
index 00000000..90827a38
--- /dev/null
+++ b/extra/bulk_edits/2022-04-07_initial_datasets.md
@@ -0,0 +1,22 @@
+
+Importing fileset and file entities from initial sandcrawler ingests.
+
+Git commit: `ede98644a89afd15d903061e0998dbd08851df6d`
+
+Filesets:
+
+ export FATCAT_AUTH_SANDCRAWLER=[...]
+ cat /tmp/ingest_dataset_combined_results.2022-04-04.partial.json \
+ | ./fatcat_import.py ingest-fileset-results -
+ # editgroup_5l47i7bscvfmpf4ddytauoekea
+ # Counter({'total': 195, 'skip': 176, 'skip-hit': 160, 'insert': 19, 'skip-single-file': 14, 'skip-partial-file-info': 2, 'update': 0, 'exists': 0})
+
+ cat /srv/fatcat/datasets/ingest_dataset_combined_results.2022-04-04.partial.json \
+ | ./fatcat_import.py ingest-fileset-file-results -
+ # editgroup_i2k2ucon7nap3gui3z7amuiug4
+ # Counter({'total': 195, 'skip': 184, 'skip-hit': 160, 'skip-status': 24, 'insert': 11, 'update': 0, 'exists': 0})
+
+Tried running again, to ensure that there are not duplicate inserts, and that
+worked ('exists' instead of 'insert' counts).
+
+Finally!
diff --git a/extra/bulk_edits/2022-04-20_isiarticles.md b/extra/bulk_edits/2022-04-20_isiarticles.md
new file mode 100644
index 00000000..b0177a46
--- /dev/null
+++ b/extra/bulk_edits/2022-04-20_isiarticles.md
@@ -0,0 +1,39 @@
+
+See metadata cleanups for context. Basically a couple tens of thousands of sample/spam articles hosted on the domain isiarticles.com.
+
+## Prod Updates
+
+Start small:
+
+ export FATCAT_API_HOST=https://api.fatcat.wiki
+ export FATCAT_AUTH_WORKER_CLEANUP=[...]
+ export FATCAT_API_AUTH_TOKEN=$FATCAT_AUTH_WORKER_CLEANUP
+
+ fatcat-cli search file domain:isiarticles.com --entity-json -n0 \
+ | rg -v '"content_scope"' \
+ | rg 'isiarticles.com/' \
+ | head -n50 \
+ | pv -l \
+ | fatcat-cli batch update file release_ids= content_scope=sample --description 'Un-link and mark isiarticles PDFs as content_scope=sample' --auto-accept
+ # editgroup_ihx75kzsebgzfisgjrv67zew5e
+
+The full batch:
+
+ fatcat-cli search file domain:isiarticles.com --entity-json -n0 \
+ | rg -v '"content_scope"' \
+ | rg 'isiarticles.com/' \
+ | pv -l \
+ | fatcat-cli batch update file release_ids= content_scope=sample --description 'Un-link and mark isiarticles PDFs as content_scope=sample' --auto-accept
+
+And some more with ':80' in the URL:
+
+ fatcat-cli search file domain:isiarticles.com '!content_scope:*' --entity-json -n0 \
+ | rg -v '"content_scope"' \
+ | rg 'isiarticles.com:80/' \
+ | pv -l \
+ | fatcat-cli batch update file release_ids= content_scope=sample --description 'Un-link and mark isiarticles PDFs as content_scope=sample' --auto-accept
+
+Verify:
+
+ fatcat-cli search file domain:isiarticles.com '!content_scope:*' --count
+ 0
diff --git a/extra/bulk_edits/2022-07-06_chocula.md b/extra/bulk_edits/2022-07-06_chocula.md
new file mode 100644
index 00000000..86bf36fb
--- /dev/null
+++ b/extra/bulk_edits/2022-07-06_chocula.md
@@ -0,0 +1,25 @@
+
+Periodic import of chocula metadata updates.
+
+## Prod Import
+
+ date
+ # Wed Jul 6 23:29:47 UTC 2022
+
+ git log -n1
+ # aff3f40a5177dd6de4eee8ea7bca78df7a595bf3
+
+ export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...]
+ head -n100 /srv/fatcat/datasets/chocula_fatcat_export.2022-07-06.json | ./fatcat_import.py chocula --do-updates -
+ # Counter({'total': 100, 'exists': 86, 'exists-skip-update': 83, 'update': 13, 'exists-by-issnl': 3, 'insert': 1, 'skip': 0})
+
+Many updates are just KBART holding dates or DOAJ as-of dates, but that is fine
+and expected.
+
+ time cat /srv/fatcat/datasets/chocula_fatcat_export.2022-07-06.json | ./fatcat_import.py chocula --do-updates -
+ # Counter({'total': 187480, 'exists': 155943, 'exists-skip-update': 151171, 'update': 30437, 'exists-by-issnl': 4772, 'insert': 1100, 'skip': 0})
+ # real 10m28.081s
+ # user 4m37.447s
+ # sys 0m16.063s
+
+Now update stats, following `extra/container_count_update/README.md`.
diff --git a/extra/bulk_edits/2022-07-12_cleanup_doaj_missing_container_id.md b/extra/bulk_edits/2022-07-12_cleanup_doaj_missing_container_id.md
new file mode 100644
index 00000000..b17e799d
--- /dev/null
+++ b/extra/bulk_edits/2022-07-12_cleanup_doaj_missing_container_id.md
@@ -0,0 +1,38 @@
+
+There is a batch of about 480 releases with DOAJ identifiers but no container
+linkage. These seem to all be from the same actual container:
+
+ fatcat-cli search releases 'doaj_id:*' '!container_id:*' --count
+ # 486
+
+ fatcat-cli search releases 'doaj_id:*' '!container_id:*' --index-json -n 0 | jq .containe
+ # Got 486 hits in 138ms
+ # "Revista de Sistemas, Cibernética e Informática"
+
+Edit pipeline:
+
+ export FATCAT_AUTH_WORKER_CLEANUP=[...]
+ export FATCAT_API_AUTH_TOKEN=$FATCAT_AUTH_WORKER_CLEANUP
+
+ # start small
+ fatcat-cli search releases 'doaj_id:*' '!container_id:*' 'journal:Cibernética' --entity-json --limit 50 \
+ | jq 'select(.container_id == null)' -c \
+ | rg 'Cibernética' \
+ | fatcat-cli batch update release container_id=ubwuhr4obzgr7aadszhurhef5m --description "Add container linkage for DOAJ articles with ISSN 1690-8627"
+ # editgroup_g2zrm3wkmneoldtqfxpbkaoeh4
+
+Looks good, merged.
+
+ # full auto
+ fatcat-cli search releases 'doaj_id:*' '!container_id:*' 'journal:Cibernética' --entity-json --limit 500 \
+ | jq 'select(.container_id == null)' -c \
+ | rg 'Cibernética' \
+ | fatcat-cli batch update release container_id=ubwuhr4obzgr7aadszhurhef5m --description "Add container linkage for DOAJ articles with ISSN 1690-8627" --auto-accept
+
+Verify:
+
+ fatcat-cli search releases 'doaj_id:*' '!container_id:*' --count
+ # 0
+
+Also planning to have DOAJ article importer 'skip' in the future for articles
+with no `container_id` match.
diff --git a/extra/bulk_edits/2022-07-12_jalc.md b/extra/bulk_edits/2022-07-12_jalc.md
new file mode 100644
index 00000000..d9f09fee
--- /dev/null
+++ b/extra/bulk_edits/2022-07-12_jalc.md
@@ -0,0 +1,47 @@
+
+Import of a 2022-04 JALC DOI metadata snapshot.
+
+Note that we had downloaded a prior 2021-04 snapshot, but don't seem to have
+ever imported it.
+
+## Download and Archive
+
+URL for bulk snapshot is available at the bottom of this page: <https://form.jst.go.jp/enquetes/jalcmetadatadl_1703>
+
+More info: <http://japanlinkcenter.org/top/service/service_data.html>
+
+ wget 'https://japanlinkcenter.org/lod/JALC-LOD-20220401.gz?jalcmetadatadl_1703'
+ wget 'http://japanlinkcenter.org/top/doc/JaLC_LOD_format.pdf'
+ wget 'http://japanlinkcenter.org/top/doc/JaLC_LOD_sample.pdf'
+
+ mv 'JALC-LOD-20220401.gz?jalcmetadatadl_1703' JALC-LOD-20220401.gz
+
+ ia upload jalc-bulk-metadata-2022-04 -m collection:ia_biblio_metadata jalc_logo.png JALC-LOD-20220401.gz JaLC_LOD_format.pdf JaLC_LOD_sample.pdf
+
+## Import
+
+As of 2022-07-19, 6,502,202 release hits for `doi_registrar:jalc`.
+
+Re-download the file:
+
+ cd /srv/fatcat/datasets
+ wget 'https://archive.org/download/jalc-bulk-metadata-2022-04/JALC-LOD-20220401.gz'
+ gunzip JALC-LOD-20220401.gz
+ cd /srv/fatcat/src/python
+
+ wc -l /srv/fatcat/datasets/JALC-LOD-20220401
+ 9525225
+
+Start with some samples:
+
+ export FATCAT_AUTH_WORKER_JALC=[...]
+ shuf -n100 /srv/fatcat/datasets/JALC-LOD-20220401 | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+ # Counter({'total': 100, 'exists': 89, 'insert': 11, 'skip': 0, 'update': 0})
+
+Full import (single threaded):
+
+ cat /srv/fatcat/datasets/JALC-LOD-20220401 | pv -l | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
+ # 9.53M 22:26:06 [ 117 /s]
+ # Counter({'total': 9510096, 'exists': 8589731, 'insert': 915032, 'skip': 5333, 'inserted.container': 119, 'update': 0})
+
+Wow, almost a million new releases! 7,417,245 results for `doi_registrar:jalc`.
diff --git a/extra/bulk_edits/2022-07-12_orcid.md b/extra/bulk_edits/2022-07-12_orcid.md
new file mode 100644
index 00000000..760a16c8
--- /dev/null
+++ b/extra/bulk_edits/2022-07-12_orcid.md
@@ -0,0 +1,64 @@
+
+Annual ORCID import, using 2021 public data file. Didn't do this last year, so
+a catch-up, and will need to do another update later in 2022 (presumably in
+November/December).
+
+Not sure how many records this year. Current count on the orcid.org website is
+over 14 million ORCIDs, in July 2022.
+
+Files download from:
+
+- <https://info.orcid.org/orcids-2021-public-data-file-is-now-available>
+- <https://orcid.figshare.com/articles/dataset/ORCID_Public_Data_File_2021/16750535>
+- <https://archive.org/details/orcid-dump-2021>
+
+## Prep
+
+ ia upload orcid-dump-2021 -m collection:ia_biblio_metadata ORCID_2021_10_* orcid-logo.png
+
+ wget https://github.com/ORCID/orcid-conversion-lib/raw/master/target/orcid-conversion-lib-3.0.7-full.jar
+
+ java -jar orcid-conversion-lib-3.0.7-full.jar --tarball -i ORCID_2021_10_summaries.tar.gz -v v3_0 -o ORCID_2021_10_summaries_json.tar.gz
+
+ tar xvf ORCID_2021_10_summaries_json.tar.gz
+
+ fd .json ORCID_2021_10_summaries/ | parallel cat {} | jq . -c | pv -l | gzip > ORCID_2021_10_summaries.json.gz
+ # 12.6M 27:59:25 [ 125 /s]
+
+ zcat ORCID_2021_10_summaries.json.gz | shuf -n10000 | gzip > ORCID_2021_10_summaries.sample_10k.json.gz
+
+ ia upload orcid-dump-2021 ORCID_2021_10_summaries.json.gz ORCID_2021_10_summaries.sample_10k.json.gz
+
+## Import
+
+Fetch to prod machine:
+
+ wget https://archive.org/download/orcid-dump-2021/ORCID_2021_10_summaries.json.gz
+ wget https://archive.org/download/orcid-dump-2021/ORCID_2021_10_summaries.sample_10k.json.gz
+
+Sample:
+
+ export FATCAT_AUTH_WORKER_ORCID=[...]
+ zcat /srv/fatcat/datasets/ORCID_2021_10_summaries.sample_10k.json.gz | ./fatcat_import.py orcid -
+ # in 2020: Counter({'total': 10000, 'exists': 7356, 'insert': 2465, 'skip': 179, 'update': 0})
+ # this time: Counter({'total': 10000, 'exists': 7577, 'insert': 2191, 'skip': 232, 'update': 0})
+
+Bulk import:
+
+ export FATCAT_AUTH_WORKER_ORCID=[...]
+ time zcat /srv/fatcat/datasets/ORCID_2021_10_summaries.json.gz | pv -l | parallel -j8 --round-robin --pipe ./fatcat_import.py orcid -
+ 12.6M 1:24:04 [2.51k/s]
+ Counter({'total': 1574111, 'exists': 1185437, 'insert': 347039, 'skip': 41635, 'update': 0})
+ Counter({'total': 1583157, 'exists': 1193341, 'insert': 348187, 'skip': 41629, 'update': 0})
+ Counter({'total': 1584441, 'exists': 1193385, 'insert': 349424, 'skip': 41632, 'update': 0})
+ Counter({'total': 1575971, 'exists': 1187270, 'insert': 347190, 'skip': 41511, 'update': 0})
+ Counter({'total': 1577323, 'exists': 1188892, 'insert': 346759, 'skip': 41672, 'update': 0})
+ Counter({'total': 1586719, 'exists': 1195610, 'insert': 349115, 'skip': 41994, 'update': 0})
+ Counter({'total': 1578484, 'exists': 1189423, 'insert': 347276, 'skip': 41785, 'update': 0})
+ Counter({'total': 1578728, 'exists': 1190316, 'insert': 346445, 'skip': 41967, 'update': 0})
+
+ real 84m5.297s
+ user 436m26.428s
+ sys 41m36.959s
+
+Roughly 2.7 million new ORCIDs, great!
diff --git a/extra/bulk_edits/2022-07-13_dblp.md b/extra/bulk_edits/2022-07-13_dblp.md
new file mode 100644
index 00000000..25405132
--- /dev/null
+++ b/extra/bulk_edits/2022-07-13_dblp.md
@@ -0,0 +1,114 @@
+
+## Prep
+
+ 2022-07-13 05:24:33 (177 KB/s) - ‘dblp.xml.gz’ saved [715701831/715701831]
+
+ Counter({'total': 9186263, 'skip': 9186263, 'has-doi': 4960506, 'skip-key-type': 3037457, 'skip-arxiv-corr': 439104, 'skip-title': 1, 'insert': 0, 'update': 0, 'exists': 0})
+ 5.71M 3:37:38 [ 437 /s]
+
+ 7.48k 0:38:18 [3.25 /s]
+
+
+## Container Import
+
+Run 2022-07-15, after a database backup/snapshot.
+
+ export FATCAT_AUTH_WORKER_DBLP=[...]
+ ./fatcat_import.py dblp-container --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --dblp-container-map-file ../extra/dblp/existing_dblp_containers.tsv --dblp-container-map-output ../extra/dblp/all_dblp_containers.tsv ../extra/dblp/dblp_container_meta.json
+ # Got 5310 existing dblp container mappings.
+ # Counter({'total': 7471, 'exists': 7130, 'insert': 341, 'skip': 0, 'update': 0})
+
+ wc -l existing_dblp_containers.tsv all_dblp_containers.tsv dblp_container_meta.json prefix_list.txt
+ 5310 existing_dblp_containers.tsv
+ 12782 all_dblp_containers.tsv
+ 7471 dblp_container_meta.json
+ 7476 prefix_list.txt
+
+
+## Release Import
+
+ export FATCAT_AUTH_WORKER_DBLP=[...]
+ ./fatcat_import.py dblp-release --dblp-container-map-file ../extra/dblp/all_dblp_containers.tsv ../extra/dblp/dblp.xml
+ # Got 7480 dblp container mappings.
+
+ /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=conf/gg/X90 ident=gfvkxubvsfdede7ps4af3oa34q
+ warnings.warn(warn_str)
+ /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=conf/visalg/X88 ident=lvfyrd3lvva3hjuaaokzyoscmm
+ warnings.warn(warn_str)
+ /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=conf/msr/PerumaANMO22 ident=2grlescl2bcpvd5yoc4npad3bm
+ warnings.warn(warn_str)
+ /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=conf/dagstuhl/Brodlie97 ident=l6nh222fpjdzfotchu7vfjh6qu
+ warnings.warn(warn_str)
+ /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=series/gidiss/2018 ident=x6t7ze4z55enrlq2dnac4qqbve
+
+ Counter({'total': 9186263, 'exists': 5356574, 'has-doi': 4960506, 'skip': 3633039, 'skip-key-type': 3037457, 'skip-arxiv-corr': 439104, 'exists-fuzzy': 192376, 'skip-dblp-container-missing': 156477, 'insert': 4216, 'skip-arxiv': 53, 'skip-dblp-id-mismatch': 5, 'skip-title': 1, 'update': 0})
+
+NOTE: had to re-try in the middle, so these counts not accurate overall.
+
+Seems like a large number of `skip-dblp-container-missing`. Maybe should have
+re-generated that file differently?
+
+After this import there are 2,217,670 releases with a dblp ID, and 478,983 with
+a dblp ID and no DOI.
+
+
+## Sandcrawler Seedlist Generation
+
+Almost none of the ~487k dblp releases with no DOI have an associated file.
+This implies that no ingest has happened yet, even though the fatcat importer
+does parse and filter the "fulltext" URLs out of dblp records.
+
+ cat dblp_releases_partial.json | pipenv run ./dblp2ingestrequest.py - | pv -l | gzip > dblp_sandcrawler_ingest_requests.json.gz
+ # 631k 0:02:39 [3.96k/s]
+
+ zcat dblp_sandcrawler_ingest_requests.json.gz | jq -r .base_url | cut -f3 -d/ | sort | uniq -c | sort -nr | head -n25
+ 43851 ceur-ws.org
+ 33638 aclanthology.org
+ 32077 aisel.aisnet.org
+ 31017 ieeexplore.ieee.org
+ 26426 dl.acm.org
+ 23817 hdl.handle.net
+ 22400 www.isca-speech.org
+ 20072 tel.archives-ouvertes.fr
+ 18609 www.aaai.org
+ 18244 eprint.iacr.org
+ 15720 ethos.bl.uk
+ 14727 nbn-resolving.org
+ 14470 proceedings.mlr.press
+ 14095 dl.gi.de
+ 12159 proceedings.neurips.cc
+ 10890 knowledge.amia.org
+ 10049 www.usenix.org
+ 9675 papers.nips.cc
+ 7541 subs.emis.de
+ 7396 openaccess.thecvf.com
+ 7345 mindmodeling.org
+ 6574 ojs.aaai.org
+ 5814 www.lrec-conf.org
+ 5773 search.ndltd.org
+ 5311 ijcai.org
+
+This is the first ingest, so let's do some sampling in the 'daily' queue:
+
+ zcat dblp_sandcrawler_ingest_requests.json.gz
+
+ zcat dblp_sandcrawler_ingest_requests.json.gz | shuf -n100 | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-daily -p -1
+
+Looks like we can probably get away with doing these in the daily ingest queue,
+instead of bulk? Try a larger batch:
+
+ zcat dblp_sandcrawler_ingest_requests.json.gz | shuf -n10000 | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-daily -p -1
+
+Nope, these are going to need bulk ingest then follow-up crawling. Will
+heritrix crawl along with JALC and DOAJ stuff.
+
+ zcat dblp_sandcrawler_ingest_requests.json.gz | rg -v "\\\\" | jq . -c | pv -l | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+ # 631k 0:00:11 [54.0k/s]
+
+
+TODO:
+x python or jq transform of JSON objects
+x filter out german book/library URLs
+x ensure fatcat importer will actually import dblp matches
+x test with a small batch in daily or priority queue
+- enqueue all in bulk mode, even if processed before? many probably MAG or OAI-PMH previously
diff --git a/extra/bulk_edits/2022-07-19_doaj.md b/extra/bulk_edits/2022-07-19_doaj.md
new file mode 100644
index 00000000..d25f2dda
--- /dev/null
+++ b/extra/bulk_edits/2022-07-19_doaj.md
@@ -0,0 +1,78 @@
+
+Doing a batch import of DOAJ articles. Will need to do another one of these
+soon after setting up daily (OAI-PMH feed) ingest.
+
+## Prep
+
+ wget https://doaj.org/csv
+ wget https://doaj.org/public-data-dump/journal
+ wget https://doaj.org/public-data-dump/article
+
+ mv csv journalcsv__doaj_20220719_2135_utf8.csv
+ mv journal doaj_journal_data_2022-07-19.tar.gz
+ mv article doaj_article_data_2022-07-19.tar.gz
+
+ ia upload doaj_data_2022-07-19 -m collection:ia_biblio_metadata ../logo_cropped.jpg journalcsv__doaj_20220719_2135_utf8.csv doaj_journal_data_2022-07-19.tar.gz doaj_article_data_2022-07-19.tar.gz
+
+ tar xvf doaj_journal_data_2022-07-19.tar.gz
+ cat doaj_journal_data_*/journal_batch_*.json | jq .[] -c | pv -l | gzip > doaj_journal_data_2022-07-19_all.json.gz
+
+ tar xvf doaj_article_data_2022-07-19.tar.gz
+ cat doaj_article_data_*/article_batch*.json | jq .[] -c | pv -l | gzip > doaj_article_data_2022-07-19_all.json.gz
+
+ ia upload doaj_data_2022-07-19 doaj_journal_data_2022-07-19_all.json.gz doaj_article_data_2022-07-19_all.json.gz
+
+On fatcat machine:
+
+ cd /srv/fatcat/datasets
+ wget https://archive.org/download/doaj_data_2022-07-19/doaj_article_data_2022-07-19_all.json.gz
+
+## Prod Article Import
+
+ git rev: 582495f66e5e08b6e257360097807711e53008d4
+ (includes DOAJ container-id required patch)
+
+ date: Tue Jul 19 22:46:42 UTC 2022
+
+ `doaj_id:*`: 1,335,195 hits
+
+Start with sample:
+
+ zcat /srv/fatcat/datasets/doaj_article_data_2022-07-19_all.json.gz | shuf -n1000 > /srv/fatcat/datasets/doaj_article_data_2022-07-19_sample.json
+
+ export FATCAT_AUTH_WORKER_DOAJ=[...]
+ cat /srv/fatcat/datasets/doaj_article_data_2022-07-19_sample.json | pv -l | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
+ # Counter({'total': 1000, 'exists': 895, 'exists-fuzzy': 93, 'insert': 9, 'skip': 3, 'skip-no-container': 3, 'update': 0})
+
+Pretty few imports.
+
+Full ingest:
+
+ export FATCAT_AUTH_WORKER_DOAJ=[...]
+ zcat /srv/fatcat/datasets/doaj_article_data_2022-07-19_all.json.gz | pv -l | parallel -j6 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
+ # Counter({'total': 1282908, 'exists': 1145439, 'exists-fuzzy': 117120, 'insert': 16357, 'skip': 3831, 'skip-no-container': 2641, 'skip-title': 1190, 'skip-doaj-id-mismatch': 161, 'update': 0})
+
+Times 6x, around 100k releases added.
+
+Got a bunch of:
+
+ /1/srv/fatcat/src/python/fatcat_tools/importers/doaj_article.py:233: UserWarning: unexpected DOAJ ext_id match after lookup failed doaj=fcdb7a7a9729403d8d99a21f6970dd1d ident=wesvmjwihvblzayfmrvvgr4ulm
+ warnings.warn(warn_str)
+ /1/srv/fatcat/src/python/fatcat_tools/importers/doaj_article.py:233: UserWarning: unexpected DOAJ ext_id match after lookup failed doaj=1455dfe24583480883dbbb293a4bc0c6 ident=lfw57esesjbotms3grvvods5dq
+ warnings.warn(warn_str)
+ /1/srv/fatcat/src/python/fatcat_tools/importers/doaj_article.py:233: UserWarning: unexpected DOAJ ext_id match after lookup failed doaj=88fa65a33c8e484091fc76f4cda59c25 ident=22abqt5qe5e7ngjd5fkyvzyc4q
+ warnings.warn(warn_str)
+ /1/srv/fatcat/src/python/fatcat_tools/importers/doaj_article.py:233: UserWarning: unexpected DOAJ ext_id match after lookup failed doaj=eb7b03dc3dc340cea36891a68a50cce7 ident=ljedohlfyzdkxebgpcswjtd77q
+ warnings.warn(warn_str)
+ /1/srv/fatcat/src/python/fatcat_tools/importers/doaj_article.py:233: UserWarning: unexpected DOAJ ext_id match after lookup failed doaj=519617147ce248ea88d45ab098342153 ident=a63bqkttrbhyxavfr7li2w2xf4
+
+Should investigate!
+
+Also, noticed that DOAJ importer is hitting `api.fatcat.wiki`, not the public
+API endpoint. Guessing this is via fuzzycat.
+
+1,434,266 results for `doaj_id:*`.
+
+Then did a follow-up sandcrawler ingest, see notes in that repository. Note
+that newer ingest can crawl doaj.org, bypassing the sandcrawler SQL load, but
+the direct crawling is probably still faster.
diff --git a/extra/bulk_edits/2022-07-29_chocula.md b/extra/bulk_edits/2022-07-29_chocula.md
new file mode 100644
index 00000000..1f6f36ca
--- /dev/null
+++ b/extra/bulk_edits/2022-07-29_chocula.md
@@ -0,0 +1,47 @@
+
+Periodic import of chocula metadata updates.
+
+In particular, expecting a bunch of `publisher_type` updates.
+
+Going to explicitly not do DOAJ-only updates this time around. That is, if
+container would have been updated, then new DOAJ 'extra' metadata will pass
+through. But don't only update entity for this reason. This is to reduce churn
+based only on the `as-of` key. Should probably change the behavior next time
+around.
+
+## Prod Import
+
+ date
+ # Sat Jul 30 01:18:41 UTC 2022
+
+ git log -n1
+ # 5ecf72cbb488a9a50eb869ea55b4c2bfc1440731
+
+ diff --git a/python/fatcat_tools/importers/chocula.py b/python/fatcat_tools/importers/chocula.py
+ index 38802bcb..762c44dd 100644
+ --- a/python/fatcat_tools/importers/chocula.py
+ +++ b/python/fatcat_tools/importers/chocula.py
+ @@ -139,7 +139,7 @@ class ChoculaImporter(EntityImporter):
+ if ce.extra.get("publisher_type") and not ce.extra.get("publisher_type"):
+ # many older containers were missing this metadata
+ do_update = True
+ - for k in ("kbart", "ia", "doaj"):
+ + for k in ("kbart", "ia"):
+ # always update these fields if not equal (chocula override)
+ if ce.extra.get(k) and ce.extra[k] != existing.extra.get(k):
+ do_update = True
+
+ export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...]
+ shuf -n100 /srv/fatcat/datasets/chocula_fatcat_export.2022-07-30.json | ./fatcat_import.py chocula --do-updates -
+ # Counter({'total': 100, 'exists': 98, 'exists-skip-update': 98, 'update': 2, 'skip': 0, 'insert': 0})
+
+ shuf -n1000 /srv/fatcat/datasets/chocula_fatcat_export.2022-07-30.json | ./fatcat_import.py chocula --do-updates -
+ # Counter({'total': 1000, 'exists': 986, 'exists-skip-update': 986, 'update': 12, 'insert': 2, 'skip': 0})
+
+Huh, not seeing any `publisher_type` changes, which I was expecting more of.
+
+ time cat /srv/fatcat/datasets/chocula_fatcat_export.2022-07-30.json | ./fatcat_import.py chocula --do-updates -
+ # Counter({'total': 188506, 'exists': 185808, 'exists-skip-update': 185806, 'update': 2495, 'insert': 203, 'exists-by-issnl': 2, 'skip': 0})
+
+Looking through the changelog, some did through with `publisher_type` updates.
+Whew!
diff --git a/extra/bulk_edits/CHANGELOG.md b/extra/bulk_edits/CHANGELOG.md
index 278dc1d8..716c95d6 100644
--- a/extra/bulk_edits/CHANGELOG.md
+++ b/extra/bulk_edits/CHANGELOG.md
@@ -9,6 +9,48 @@ this file should probably get merged into the guide at some point.
This file should not turn in to a TODO list!
+## 2022-07
+
+Ran a journal-level metadata update, using chocula.
+
+Cleaned up just under 500 releases with missing `container_id` from an older
+DOAJ article import.
+
+Imported roughly 100k releases from DOAJ, new since 2022-04.
+
+Imported roughly 2.7 million new ORCiD `creator` entities, using the 2021 dump
+(first update since 2020 dump).
+
+Imported almost 1 million new DOI release entities from JALC, first update in
+more than a year.
+
+Imported at least 400 new dblp containers, and an unknown number of new dblp
+release entities.
+
+Cleaned up about a thousand containers with incorrect `publisher_type`, based
+on current publisher name. Further updates will populate after the next chocula
+import.
+
+Ran a second batch of journal-level metadata updates, from chocula, resulting
+in a couple thousand updated entities.
+
+
+## 2022-04
+
+Imported some initial fileset entities.
+
+Updated about 25k file entities from isiarticles.com, which are samples (spam
+for translation service) to remove release linkage and set
+`content_scope=sample` (similar to the springer "page one" case).
+
+## 2022-03
+
+Ran a journal-level metadata update, using chocula.
+
+Run a DOAJ article-level metadata import, yielding a couple hundred thousand
+new release entities. Crawling and bulk ingest of HTML and PDF fulltext for
+these articles also started.
+
## 2022-02
- removed `container_id` linkage for some Datacite DOI releases which are