blob: 3172f16fd539e74ab3f270bfd2069befdd81e9ad (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
|
Spectra DSpace Instance Cleanups
================================
Basic query:
doi_prefix:10.14469
There were a big spike of these in 2014, marked as `article`, but should be
`dataset` (or `entry`). On the order of 150k releases. In particular, causes a
weird bump in unarchived OA papers in coverage plots for the year 2014.
This is technically a dspace instance and might have various types of content
in it, so might want to narrow down the filter in some way. Eg, title prefix,
DOI pattern, etc.
fatcat-cli search releases doi_prefix:10.14469 type:article --count
196236
fatcat-cli search releases doi_prefix:10.14469 type:article 'title:NSC*' --count
158380
fatcat-cli search releases doi_prefix:10.14469 type:article 'title:NSC*' author:"Imperial College High Performance Computing Service" --count
158380
That seems to nail it down pretty well; these only fall under 2014 and a bit in
2015.
Want to just mark these as `release_type:entry` (they are sort of datasets, but
really it is all one big database and these are individual entries within
that).
Commands:
export FATCAT_AUTH_WORKER_CLEANUP=[...]
export FATCAT_API_AUTH_TOKEN=$FATCAT_AUTH_WORKER_CLEANUP
# start small
fatcat-cli search releases doi_prefix:10.14469 type:article 'title:NSC*' author:"Imperial College High Performance Computing Service" --entity-json --limit 50 \
| jq 'select(.release_type == "article")' -c \
| pv -l \
| fatcat-cli batch update release release_type=entry --description "Correct release_type for 'Revised Cambridge NCI database' entries"
# Got 158380 hits
# editgroup_mwuqpc5j3fhtjg5vxvr2xnitda
Looks good, do the full batch (!):
fatcat-cli search releases doi_prefix:10.14469 type:article 'title:NSC*' author:"Imperial College High Performance Computing Service" --entity-json --limit 160000 \
| jq 'select(.release_type == "article")' -c \
| pv -l \
| fatcat-cli batch update release release_type=entry --description "Correct release_type for 'Revised Cambridge NCI database' entries" --auto-accept
# 158k 1:00:21 [43.7 /s]
Off it goes!
There are more patterns from this repository, but this is a good start.
|