1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
|
See `container_issnl_dedupe` cleanup notes for background and dataset prep.
This bulk edit updated over 2,000 container entities (redirects), and following
a couple additional manual redirects, there are no duplicate containers in the
index.
## Prod Run
git log | head -n1
# commit 929baa24c6eec96e303286f258f2be0949303266
export FATCAT_AUTH_API_TOKEN=[...]
Start small, with the random sample, no releases or human edits allowed:
cat /srv/fatcat/datasets/container_issnl_dupes.sample.json \
| python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" merge-containers -
# Counter({'updated-entities': 307, 'skip-container-release-count': 251, 'lines': 100, 'merged': 99, 'skip': 1, 'skip-not-active-entity': 1, 'updated-total': 0})
Continue with the full batch with the 100 cut-off (original plan):
cat /srv/fatcat/datasets/container_issnl_dupes.json \
| python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" merge-containers - --max-container-releases 100
# Counter({'updated-entities': 2025, 'lines': 835, 'merged': 732, 'skip-container-release-count': 112, 'skip': 103, 'skip-not-active-entity': 101, 'skip-human-edited': 2, 'updated-total': 0})
Full output log:
skipping container_v2zsod52jnhfbb5tk3bujgvhn4: release count 189
skipping container_sk4nevyiwfgk3hwdy6h3rgk5ou: release count 212
skipping container_3sdew7wvcvdi3ancgeemu2qhtu: release count 175
skipping container_uffot534onejpohj3l4m36pyu4: release count 280
skipping container_zd7uysnu5nejxjmvmieehodd3a: release count 150
skipping container_lbskm3w74nhkvobfy3kefaqcxq: release count 388
skipping container_yf76l2b5d5hu3hlr45e5qm3tai: release count 271
skipping container_ykoitf76dbe5lbhkgozjay3krq: release count 405
skipping container_z2xwk5y4oredlg7oaxcsd266ie: release count 203
skipping container_2cxnpvvjkvfdxcz2nb3qkjucei: release count 464
skipping container_5yq3233lrfeclbiqa54ed5il6y: release count 138
skipping container_asoaslvd5nghlabgerd3rt7ffq: release count 398
skipping container_nvvtdqsmdvfilfuwsg44dl4gdy: release count 115
skipping container_gxkfhz3ywzglngcj46m2mfqeri: release count 333
skipping container_twvb6baapfaotnhhe6xynd7n6q: release count 249
skipping container_gufxbxnddfdpjag36erclh476i: release count 127
skipping container_qthwmftdgjc2fgrzx6w7py2mqe: release count 138
skipping container_dbpyvcd245ednlpc5f4eo33u4u: release count 1097
skipping container_c3h2ajl72fhf7bwzgsntpipd6u: release count 264
skipping container_vin6xe6ojndyzfl23rkzbldxym: release count 218
skipping container_eef7zpisarfcbdmgp3pyixchom: release count 175
skipping container_ls56ijz6dre7dfu5sbg32pi7pq: release count 456
skipping container_rftmo4qx6zfpxbuaellrlypqza: release count 326
skipping container_uhuemivuxffbdhdqae75bru7ie: release count 235
skipping container_4tyenz4zibemzpmodpefemlbia: release count 117
skipping container_zznalfkafzdo5p32z3qe5hflkm: release count 193
skipping container_3kio7nj2enbodc5mcfefalbbti: release count 181
skipping container_gxk755hay5bwvbllydcetcupqq: release count 119
skipping container_eixvqbgdprej3bm4bqouiqzp3e: release count 247
skipping container_rmmzgzkswjhitkwrv6zxgmji5i: release count 157
skipping container_pkq6njkgvrekham3qipbjd6czu: release count 140
skipping container_hkpd4a467bepho76ifji6jg4ue: release count 258
skipping container_dfxmy3gx5fe4ff6nr2tiemiyyi: release count 101
skipping container_eqqja4xhfzatrou3dkpjryrmaq: release count 114
skipping container_sku6jjds5ngwrhqwlxfq2jvgzu: release count 1135
skipping container_qyh3ncxjqzfmlajauwcsqqlb5q: release count 265
skipping container_syh4k3q3yjf7hcva3ficd2g4dy: release count 262
skipping container_ptmjhxoyfffjzbwkbe65fle4qm: release count 201
skipping container_3tzyaivyerexflyiy463i5q4x4: release count 466
skipping container_bbwsceba7rh2fgiyhbzspyykxe: release count 110
skipping container_e7yk3xwws5dnvk7osdoxpulv6m: release count 200
skipping container_hmyaeocgsze35p5vo5u6q4o7jm: release count 119
skipping container_ireoyurf5fcgfeem22ng7liel4: release count 313
skipping container_dkvv2nptnza3pmbiojxo2syom4: release count 274
skipping container_hyzwgzfumvaj7pqkrh65e5ev6q: release count 261
skipping container_kyfdgd22k5a53nm2jfjceklbly: release count 125
skipping container_avfetikkhnb53asxdkowshucce: release count 239
skipping container_7wrb5pc3tzcjfho3t74ab7ijjq: release count 164
skipping container_27wax7sd4vbkrfl3mbe227ui5u: release count 231
skipping container_4mwjbsjlq5c7la7dvmqbegtslq: release count 122
skipping container_udtqmaqkcvhbnjo3vhfpbubaie: release count 106
skipping container_y7hcehu5izdbpi65f4pxm6cqla: release count 170
skipping container_ddgokru3vvh3nglqkfzx5orlfe: release count 247
skipping container_qhtesw5bb5fhvpw7gymydznhne: release count 215
skipping container_62jks7u4vjbrpl4sycma7yzhru: release count 314
skipping container_knhzl5oq6resrghowrxocj52mq: release count 123
skipping container_yfgztqw72vhabfohcrr77v6b3a: release count 167
skipping container_i7v5yf24erezljqlu5dsvctulm: release count 174
skipping container_75jngrgwnbb4tn22c5qk3fxo4i: release count 198
skipping container_3zgq2ml4jjc5jh5w4mrx44qda4: release count 162
skipping container_xbdlayhr2zcajfsmgj7niiylqm: release count 120
skipping container_cpjsvpcsgrfy5ms5czaj4fcgbm: release count 228
skipping container_o44k5mmoxzdtnozysjun66slju: release count 179
skipping container_qph344zobjev7kmke2ozk33mwi: release count 241
skipping container_xkii4g5qnrdnleyhhwvd4uoz2i: release count 108
skipping container_vrdwdxnidbdiracilr3gr5v3cm: release count 204
skipping container_4gw5lyn3xna2jajzbxot6fifdq: release count 101
skipping container_t7sx4zbwyzhbjbzsvrazlwtbvy: release count 199
skipping container_xxnbdrcamrgfbhydxv6v3jy2fy: release count 101
skipping container_usm5yyjmcfgmjcu4lcy3f6sade: release count 115
skipping container_fqmqy4ws5fcf7axos7esy2fcce: release count 191
skipping container_wadmkiqbt5hhjfpm6lbn47agyu: release count 114
skipping container_ujc3wxewfraizaffzouy2mxaau: release count 136
skipping container_um6qxikn3nhlpoiovdrww5gp4q: release count 144
skipping container_bykun5gbb5cgnfo25j5jzm5iai: release count 320
skipping container_amyo2w2zgjafxbxvmlb5hfndoy: release count 109
skipping container_mfbbi2zyhrhvvjh45mvd4fz5qy: release count 119
skipping container_jftmuuvc2fa6xcu77zjsyrs5b4: release count 251
skipping container_dqrilnh3ord7fachdd4vaktvpe: release count 143
skipping container_oj4ahzx3mvcbrggtpa6sqnjtny: release count 160
skipping container_p452ltmuhbdnlaz6donkkr5une: release count 166
skipping container_dmsweozizzgqza4sd5byomyqve: release count 123
skipping container_xckj2u5pdrevxei43pgqe4c4sq: release count 119
skipping container_5qpk5qalazcmjooeruos7zck5i: release count 194
skipping container_eaqcxaog3jgjhjmpf5kdxrmmcm: release count 149
skipping container_33tn3bydy5dvtkeoe5falrto6q: release count 115
skipping container_i4t74quqcjdgplmeyga5payqhm: release count 118
skipping container_2olekyosjzer5ds3s7gc3zb624: release count 200
skipping container_yh4uvtwy7fa27ibgddi3qmewg4: release count 199
skipping container_qjidb4al4japfcbxce2dw627ni: release count 545
skipping container_y2cy2ajbwzbo3d6lwyseqig2ce: release count 191
skipping container_abehsncrarf4ffttto4toaayay: release count 440
skipping container_elmhdn55tbfgvod2oic6xlwr3e: release count 159
skipping container_ylzwxwnwpfdf3br2xgwobkqjwa: release count 239
skipping container_g3drm2f6mjazxc7vckimcycn5y: release count 116
skipping container_33f3tq7re5abtafbxdbzd6jpxi: release count 210
skipping container_jtx5kwkiwbcezbqk7gwhor53wm: release count 265
skipping container_rrikfc42xrg6vl7lnhlf26mtxm: release count 150
skipping container_pzqxtxvco5fhlj7k23bqw374yu: release count 106
skipping container_xde4qg5sh5djzdnawhou5dhb4u: release count 366
skipping container_omshcvxszffgdlspsehxlyksse: release count 165
skipping container_ekjk6clxtfgmbgvkkgixeaadfy: release count 224
skipping container_tyindzesqbg27nblbfwa6hptku: release count 190
skipping container_ho6rwlwmsnepri2amemptyvnhm: release count 134
skipping container_uqezfmzrnre5zdtspj6wphlvbm: release count 189
skipping container_6xslqgviybh7lkl2edk5ghtvum: release count 141
skipping container_3wgusadkfjhf3ejpn2nnn3v5ka: release count 307
skipping container_cbiclofvcbhmjcxubvkg5vyhxi: release count 1368
skipping container_ahvaawe3fvd3zb4mfiaado7wyy: human edited
skipping container_ejgmmh6ozvf7hczdpok2ab3fni: release count 102
skipping container_s3h3himubfaftbhcpfzgc354zq: release count 174
skipping container_wzabvii33zakjbf24dspb2tst4: release count 122
skipping container_kly2opavcjejxesqru5tib455q: release count 159
skipping container_zw7uyavzgzdtleeliwh5cnp23q: human edited
Check progress (after partial ident dump and `fatcat_export`, just for containers):
zcat container_export.json.gz \
| jq '[.issnl, .ident] | @tsv' -r \
| sort -S 4G \
| uniq -D -w 9 \
> issnl_ident.dupes.tsv
wc -l issnl_ident.dupes.tsv
# 4
cut -f1 issnl_ident.dupes.tsv | uniq | wc -l
# 2
Huh, this doesn't seem right. What happened to all the `skip-container-release-count`, etc?
Ah, the `skip-container-release-count` condition 1) can include the primary
release and 2) is only printed, never does anything (!).
Well, yikes, but no indication anything has gone really wrong.
Let's manually look at the human-edited cases:
# 2709-2143
# zw7uyavzgzdtleeliwh5cnp23q is better primary
{
"duplicate_ids": [
"osev5sprhfdwdkoyypihlspsme",
"zw7uyavzgzdtleeliwh5cnp23q"
],
"entity_type": "container",
"evidence": {
"extid": "2709-2143",
"extid_type": "issnl"
},
"primary_id": null
}
# 2567-8833
# ahvaawe3fvd3zb4mfiaado7wyy is better primary (no releases)
{
"duplicate_ids": [
"ahvaawe3fvd3zb4mfiaado7wyy",
"e7q6rq2ebbbp3m3xxccqxi3fnm"
],
"entity_type": "container",
"evidence": {
"extid": "2567-8833",
"extid_type": "issnl"
},
"primary_id": null
}
In retrospect, I probably should have also enforced title fuzzy match in
automated merging, instead of just ISSN-L match.
|