aboutsummaryrefslogtreecommitdiffstats
path: root/extra/bulk_edits/2022-02-09_repo_dois_with_containerid.md
blob: 25f74db944d2645aa790987d7b0882783d2a40bd (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Some institutional repositories register DOIs for pre-prints with the metadata
for the version of record included, including an ISSN number. This results in
the release entities getting the `container_id` of the actual journal, and show
up in preservation dashboards, etc.

## Columbia University

Here is an example search query, showing two works, both marked today as "PLoS Medicine":

    https://fatcat.wiki/release/search?q=%22Contraceptive+use+among+adolescent+and+young+women+in+North+and+South+Kivu%2C+Democratic+Republic+of+the+Congo%3A+A+cross-sectional+population-based+survey%22&generic=1

Some count queries:

    fatcat-cli search releases doi_prefix:10.7916 doi_registrar:datacite 'container_id:*' release_stage:published --count
    # 10870

    fatcat-cli search releases doi_prefix:10.7916 doi_registrar:datacite 'container_id:*' release_stage:published --entity-json -n0 \
        | rg '"Columbia University"' \
        | rg '"IsVariantFormOf"' \
        | pv -l \
        > /dev/null
    # 10.7k 0:09:39

So, most of these.

Let's update these to `release_stage=submitted` and `container_id=`.

    export FATCAT_AUTH_WORKER_CLEANUP=[...]
    export FATCAT_API_AUTH_TOKEN=$FATCAT_AUTH_WORKER_CLEANUP

    # start small
    fatcat-cli search releases doi_prefix:10.7916 doi_registrar:datacite 'container_id:*' release_stage:published --entity-json --limit 50 \
        | jq 'select(.container_id != null)' -c \
        | rg '"Columbia University"' \
        | rg '"IsVariantFormOf"' \
        | pv -l \
        | fatcat-cli batch update release release_stage=submitted container_id= --description "Remove container linkage for Columbia University repository deposits"
    # editgroup_grxwpieqvvenxfaxwojnud4lla

    # full auto
    fatcat-cli search releases doi_prefix:10.7916 doi_registrar:datacite 'container_id:*' release_stage:published --entity-json --limit 11000 \
        | jq 'select(.container_id != null)' -c \
        | rg '"Columbia University"' \
        | rg '"IsVariantFormOf"' \
        | pv -l \
        | fatcat-cli batch update release release_stage=submitted container_id= --description "Remove container linkage for Columbia University repository deposits" --auto-accept

Also created a patch for fatcat datacite importer to not link these in the future.

## "RWTH Publications"

    https://fatcat.wiki/release/search?q=%22Predicting+survival+from+colorectal+cancer+histology+slides+using+deep+learning%3A+A+retrospective+multicenter+study%22&generic=1

    doi_prefix:10.18154


    fatcat-cli search releases doi_prefix:10.18154 'container_id:*' --count
    # 11364

    fatcat-cli search releases doi_prefix:10.18154 'container_id:*' doi_registrar:datacite --count
    # 11364

    fatcat-cli search releases doi_prefix:10.18154 'container_id:*' doi_registrar:datacite affiliation:RWTH --count
    # 6257

    fatcat-cli search releases doi_prefix:10.18154 'container_id:*' doi_registrar:datacite --entity-json -n0 \
        | rg 'RWTH' \
        | rg '10.18154/rwth-' \
        | rg '"IsVariantFormOf"' \
        | pv -l \
        > /dev/null
    # many/all? at least 5k, cut off there

Ok, do updates:

    # start small
    fatcat-cli search releases doi_prefix:10.18154 'container_id:*' doi_registrar:datacite affiliation:RWTH --entity-json -n50 \
        | jq 'select(.container_id != null)' -c \
        | rg 'RWTH' \
        | rg '10.18154/rwth-' \
        | rg '"IsVariantFormOf"' \
        | fatcat-cli batch update release container_id= --description "Remove container linkage for RWTH repository deposits"
    # Got 6257 hits in 1087ms
    # editgroup_cb2vdn7npfg63muppawbhzrhjq

    # do the rest
    fatcat-cli search releases doi_prefix:10.18154 'container_id:*' doi_registrar:datacite affiliation:RWTH --entity-json -n12000 \
        | jq 'select(.container_id != null)' -c \
        | rg 'RWTH' \
        | rg '10.18154/rwth-' \
        | rg '"IsVariantFormOf"' \
        | pv -l \
        | fatcat-cli batch update release container_id= --description "Remove container linkage for RWTH repository deposits" --auto-accept
    # Got 6207 hits in 696ms
    # 6.00k 0:16:37 [6.01 /s]

After that process, there were still many mis-matched DOIs, so relaxing
constraints. This repository *does* contain a bunch of publications from RWTH
itself (books, conference series, etc), so don't want to update everything.

    fatcat-cli search releases doi_prefix:10.18154 'container_id:*' doi_registrar:datacite '!journal:RWTH' '!container_id:m2cho7mmmbgxzdpfz7cmjgegbu' --count
    # 3946

    fatcat-cli search releases doi_prefix:10.18154 'container_id:*' doi_registrar:datacite '!journal:RWTH' '!container_id:m2cho7mmmbgxzdpfz7cmjgegbu' --entity-json -n6000 \
        | jq 'select(.container_id != null)' -c \
        | rg 'RWTH' \
        | rg '10.18154/rwth-20' \
        | rg '"IsVariantFormOf"' \
        | pv -l \
        | fatcat-cli batch update release container_id= --description "Remove container linkage for RWTH repository deposits" --auto-accept
    # Got 3946 hits in 77ms

Specifically, some more PLOS ones:

    fatcat-cli search releases doi_prefix:10.18154 'container_id:*' doi_registrar:datacite '!journal:RWTH' '!container_id:m2cho7mmmbgxzdpfz7cmjgegbu' journal:plos --count
    # 338

    fatcat-cli search releases doi_prefix:10.18154 'container_id:*' doi_registrar:datacite '!journal:RWTH' '!container_id:m2cho7mmmbgxzdpfz7cmjgegbu' 'journal:plos' --entity-json -n500 \
        | jq 'select(.container_id != null)' -c \
        | rg '10.18154/rwth-' \
        | pv -l \
        | fatcat-cli batch update release container_id= --description "Remove container linkage for RWTH repository deposits" --auto-accept
    # Got 338 hits in 33ms

## DESY Pre-Print Server (PUBDB)

    https://fatcat.wiki/release/search?q=%22viral+phosphatase+adaptor+that+promotes+herpes+simplex+virus+replication+and+spread%22+type%3Aarticle-journal+%21title%3Acorrection

    fatcat-cli search releases doi_prefix:10.3204 'container_id:*' doi_registrar:datacite publisher:DESY --count
    # 313

    fatcat-cli search releases doi_prefix:10.3204 'container_id:*' doi_registrar:datacite --count
    # 6679

    fatcat-cli search releases doi_prefix:10.3204 'container_id:*' doi_registrar:datacite --entity-json -n7000 \
        | jq 'select(.container_id != null)' -c \
        | rg '10.3204/(pubdb|phppubdb)-' \
        | rg '"IsVariantFormOf"' \
        | pv -l \
        > /dev/null
    # at least hundreds

    # start small
    fatcat-cli search releases doi_prefix:10.3204 'container_id:*' doi_registrar:datacite --entity-json -n50 \
        | jq 'select(.container_id != null)' -c \
        | rg '10.3204/(pubdb|phppubdb)-' \
        | rg '"IsVariantFormOf"' \
        | fatcat-cli batch update release container_id= --description "Remove container linkage for DESY repository deposits"
    # Got 6679 hits in 368ms
    # editgroup_vhcxvqjyinhxfplkoqjtprnxj4

    fatcat-cli search releases doi_prefix:10.3204 'container_id:*' doi_registrar:datacite --entity-json -n7000 \
        | jq 'select(.container_id != null)' -c \
        | rg '10.3204/(pubdb|phppubdb)-' \
        | rg '"IsVariantFormOf"' \
        | fatcat-cli batch update release container_id= --description "Remove container linkage for DESY repository deposits" --auto-accept


## Kluedo: Publication Server of University of Kaiserslautern

doi:10.26204/kluedo/6163

    fatcat-cli search releases doi_prefix:10.26204 'container_id:*' --count
    # 7

Whew, an easy one!

    fatcat-cli search releases doi_prefix:10.26204 'container_id:*' --entity-json -n50 \
        | jq 'select(.container_id != null)' -c \
        | rg '10.26204/kluedo/' \
        | fatcat-cli batch update release release_stage=submitted container_id= --description "Remove container linkage for 'Kluedo' repository deposits"
    # Got 7 hits in 20ms
    # editgroup_tmyyg4yl7vbg7mveyfcdxhptfu

## Universitat Bayreuth

    doi:10.15495/epub_ubt_00005577 

    fatcat-cli search releases doi_prefix:10.15495 'container_id:*' --count
    # 554

Great, also not very large.

    # start small
    fatcat-cli search releases doi_prefix:10.15495 'container_id:*' --entity-json -n50 \
        | jq 'select(.container_id != null)' -c \
        | rg '10.15495/epub_ubt_' \
        | fatcat-cli batch update release container_id= --description "Remove container linkage for University of Bayreuth  repository deposits"
    # 554
    # editgroup_6oubgez7jrfabprdckijvijsa4

    fatcat-cli search releases doi_prefix:10.15495 'container_id:*' --entity-json -n600 \
        | jq 'select(.container_id != null)' -c \
        | rg '10.15495/epub_ubt_' \
        | fatcat-cli batch update release container_id= --description "Remove container linkage for University of Bayreuth  repository deposits" --auto-accept
    # did a variant with `publisher:Bayreuth`, which only matched a single release
    # Got 503 hits in 310ms

Could also have filtered on publisher "University of Bayreuth", in the post-fetch part.