aboutsummaryrefslogtreecommitdiffstats
path: root/extra/bulk_edits/2022-02-08_nci_cambridge_datasets.md
diff options
context:
space:
mode:
Diffstat (limited to 'extra/bulk_edits/2022-02-08_nci_cambridge_datasets.md')
-rw-r--r--extra/bulk_edits/2022-02-08_nci_cambridge_datasets.md56
1 files changed, 56 insertions, 0 deletions
diff --git a/extra/bulk_edits/2022-02-08_nci_cambridge_datasets.md b/extra/bulk_edits/2022-02-08_nci_cambridge_datasets.md
new file mode 100644
index 00000000..3172f16f
--- /dev/null
+++ b/extra/bulk_edits/2022-02-08_nci_cambridge_datasets.md
@@ -0,0 +1,56 @@
+
+Spectra DSpace Instance Cleanups
+================================
+
+Basic query:
+
+ doi_prefix:10.14469
+
+There were a big spike of these in 2014, marked as `article`, but should be
+`dataset` (or `entry`). On the order of 150k releases. In particular, causes a
+weird bump in unarchived OA papers in coverage plots for the year 2014.
+
+This is technically a dspace instance and might have various types of content
+in it, so might want to narrow down the filter in some way. Eg, title prefix,
+DOI pattern, etc.
+
+ fatcat-cli search releases doi_prefix:10.14469 type:article --count
+ 196236
+
+ fatcat-cli search releases doi_prefix:10.14469 type:article 'title:NSC*' --count
+ 158380
+
+ fatcat-cli search releases doi_prefix:10.14469 type:article 'title:NSC*' author:"Imperial College High Performance Computing Service" --count
+ 158380
+
+That seems to nail it down pretty well; these only fall under 2014 and a bit in
+2015.
+
+Want to just mark these as `release_type:entry` (they are sort of datasets, but
+really it is all one big database and these are individual entries within
+that).
+
+Commands:
+
+ export FATCAT_AUTH_WORKER_CLEANUP=[...]
+ export FATCAT_API_AUTH_TOKEN=$FATCAT_AUTH_WORKER_CLEANUP
+
+ # start small
+ fatcat-cli search releases doi_prefix:10.14469 type:article 'title:NSC*' author:"Imperial College High Performance Computing Service" --entity-json --limit 50 \
+ | jq 'select(.release_type == "article")' -c \
+ | pv -l \
+ | fatcat-cli batch update release release_type=entry --description "Correct release_type for 'Revised Cambridge NCI database' entries"
+ # Got 158380 hits
+ # editgroup_mwuqpc5j3fhtjg5vxvr2xnitda
+
+Looks good, do the full batch (!):
+
+ fatcat-cli search releases doi_prefix:10.14469 type:article 'title:NSC*' author:"Imperial College High Performance Computing Service" --entity-json --limit 160000 \
+ | jq 'select(.release_type == "article")' -c \
+ | pv -l \
+ | fatcat-cli batch update release release_type=entry --description "Correct release_type for 'Revised Cambridge NCI database' entries" --auto-accept
+ # 158k 1:00:21 [43.7 /s]
+
+Off it goes!
+
+There are more patterns from this repository, but this is a good start.