aboutsummaryrefslogtreecommitdiffstats
path: root/extra/bulk_edits/2022-02-08_nci_cambridge_datasets.md
blob: 3172f16fd539e74ab3f270bfd2069befdd81e9ad (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

Spectra DSpace Instance Cleanups
================================

Basic query:

    doi_prefix:10.14469

There were a big spike of these in 2014, marked as `article`, but should be
`dataset` (or `entry`). On the order of 150k releases. In particular, causes a
weird bump in unarchived OA papers in coverage plots for the year 2014.

This is technically a dspace instance and might have various types of content
in it, so might want to narrow down the filter in some way. Eg, title prefix,
DOI pattern, etc.

    fatcat-cli search releases doi_prefix:10.14469 type:article --count
    196236

    fatcat-cli search releases doi_prefix:10.14469 type:article 'title:NSC*' --count
    158380

    fatcat-cli search releases doi_prefix:10.14469 type:article 'title:NSC*' author:"Imperial College High Performance Computing Service" --count
    158380

That seems to nail it down pretty well; these only fall under 2014 and a bit in
2015.

Want to just mark these as `release_type:entry` (they are sort of datasets, but
really it is all one big database and these are individual entries within
that).

Commands: 

    export FATCAT_AUTH_WORKER_CLEANUP=[...]
    export FATCAT_API_AUTH_TOKEN=$FATCAT_AUTH_WORKER_CLEANUP

    # start small
    fatcat-cli search releases doi_prefix:10.14469 type:article 'title:NSC*' author:"Imperial College High Performance Computing Service" --entity-json --limit 50 \
        | jq 'select(.release_type == "article")' -c \
        | pv -l \
        | fatcat-cli batch update release release_type=entry --description "Correct release_type for 'Revised Cambridge NCI database' entries"
    # Got 158380 hits
    # editgroup_mwuqpc5j3fhtjg5vxvr2xnitda

Looks good, do the full batch (!):

    fatcat-cli search releases doi_prefix:10.14469 type:article 'title:NSC*' author:"Imperial College High Performance Computing Service" --entity-json --limit 160000 \
        | jq 'select(.release_type == "article")' -c \
        | pv -l \
        | fatcat-cli batch update release release_type=entry --description "Correct release_type for 'Revised Cambridge NCI database' entries" --auto-accept
    # 158k 1:00:21 [43.7 /s]

Off it goes!

There are more patterns from this repository, but this is a good start.