summaryrefslogtreecommitdiffstats
path: root/extra/bulk_edits/2021-11-29_container_issnl_dedupe.md
blob: 44a6226e24404a04d0c3b5b8c9537da5a2772c7a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198

See `container_issnl_dedupe` cleanup notes for background and dataset prep.

This bulk edit updated over 2,000 container entities (redirects), and following
a couple additional manual redirects, there are no duplicate containers in the
index.

## Prod Run

    git log | head -n1
    # commit 929baa24c6eec96e303286f258f2be0949303266

    export FATCAT_AUTH_API_TOKEN=[...]

Start small, with the random sample, no releases or human edits allowed:

    cat /srv/fatcat/datasets/container_issnl_dupes.sample.json \
        | python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" merge-containers -
    # Counter({'updated-entities': 307, 'skip-container-release-count': 251, 'lines': 100, 'merged': 99, 'skip': 1, 'skip-not-active-entity': 1, 'updated-total': 0})

Continue with the full batch with the 100 cut-off (original plan):

    cat /srv/fatcat/datasets/container_issnl_dupes.json \
        | python -m fatcat_tools.mergers.containers --editgroup-description-override "Automated merging of duplicate container entities with the same ISSN-L" merge-containers - --max-container-releases 100
    # Counter({'updated-entities': 2025, 'lines': 835, 'merged': 732, 'skip-container-release-count': 112, 'skip': 103, 'skip-not-active-entity': 101, 'skip-human-edited': 2, 'updated-total': 0})

Full output log:

    skipping container_v2zsod52jnhfbb5tk3bujgvhn4: release count 189
    skipping container_sk4nevyiwfgk3hwdy6h3rgk5ou: release count 212
    skipping container_3sdew7wvcvdi3ancgeemu2qhtu: release count 175
    skipping container_uffot534onejpohj3l4m36pyu4: release count 280
    skipping container_zd7uysnu5nejxjmvmieehodd3a: release count 150
    skipping container_lbskm3w74nhkvobfy3kefaqcxq: release count 388
    skipping container_yf76l2b5d5hu3hlr45e5qm3tai: release count 271
    skipping container_ykoitf76dbe5lbhkgozjay3krq: release count 405
    skipping container_z2xwk5y4oredlg7oaxcsd266ie: release count 203
    skipping container_2cxnpvvjkvfdxcz2nb3qkjucei: release count 464
    skipping container_5yq3233lrfeclbiqa54ed5il6y: release count 138
    skipping container_asoaslvd5nghlabgerd3rt7ffq: release count 398
    skipping container_nvvtdqsmdvfilfuwsg44dl4gdy: release count 115
    skipping container_gxkfhz3ywzglngcj46m2mfqeri: release count 333
    skipping container_twvb6baapfaotnhhe6xynd7n6q: release count 249
    skipping container_gufxbxnddfdpjag36erclh476i: release count 127
    skipping container_qthwmftdgjc2fgrzx6w7py2mqe: release count 138
    skipping container_dbpyvcd245ednlpc5f4eo33u4u: release count 1097
    skipping container_c3h2ajl72fhf7bwzgsntpipd6u: release count 264
    skipping container_vin6xe6ojndyzfl23rkzbldxym: release count 218
    skipping container_eef7zpisarfcbdmgp3pyixchom: release count 175
    skipping container_ls56ijz6dre7dfu5sbg32pi7pq: release count 456
    skipping container_rftmo4qx6zfpxbuaellrlypqza: release count 326
    skipping container_uhuemivuxffbdhdqae75bru7ie: release count 235
    skipping container_4tyenz4zibemzpmodpefemlbia: release count 117
    skipping container_zznalfkafzdo5p32z3qe5hflkm: release count 193
    skipping container_3kio7nj2enbodc5mcfefalbbti: release count 181
    skipping container_gxk755hay5bwvbllydcetcupqq: release count 119
    skipping container_eixvqbgdprej3bm4bqouiqzp3e: release count 247
    skipping container_rmmzgzkswjhitkwrv6zxgmji5i: release count 157
    skipping container_pkq6njkgvrekham3qipbjd6czu: release count 140
    skipping container_hkpd4a467bepho76ifji6jg4ue: release count 258
    skipping container_dfxmy3gx5fe4ff6nr2tiemiyyi: release count 101
    skipping container_eqqja4xhfzatrou3dkpjryrmaq: release count 114
    skipping container_sku6jjds5ngwrhqwlxfq2jvgzu: release count 1135
    skipping container_qyh3ncxjqzfmlajauwcsqqlb5q: release count 265
    skipping container_syh4k3q3yjf7hcva3ficd2g4dy: release count 262
    skipping container_ptmjhxoyfffjzbwkbe65fle4qm: release count 201
    skipping container_3tzyaivyerexflyiy463i5q4x4: release count 466
    skipping container_bbwsceba7rh2fgiyhbzspyykxe: release count 110
    skipping container_e7yk3xwws5dnvk7osdoxpulv6m: release count 200
    skipping container_hmyaeocgsze35p5vo5u6q4o7jm: release count 119
    skipping container_ireoyurf5fcgfeem22ng7liel4: release count 313
    skipping container_dkvv2nptnza3pmbiojxo2syom4: release count 274
    skipping container_hyzwgzfumvaj7pqkrh65e5ev6q: release count 261
    skipping container_kyfdgd22k5a53nm2jfjceklbly: release count 125
    skipping container_avfetikkhnb53asxdkowshucce: release count 239
    skipping container_7wrb5pc3tzcjfho3t74ab7ijjq: release count 164
    skipping container_27wax7sd4vbkrfl3mbe227ui5u: release count 231
    skipping container_4mwjbsjlq5c7la7dvmqbegtslq: release count 122
    skipping container_udtqmaqkcvhbnjo3vhfpbubaie: release count 106
    skipping container_y7hcehu5izdbpi65f4pxm6cqla: release count 170
    skipping container_ddgokru3vvh3nglqkfzx5orlfe: release count 247
    skipping container_qhtesw5bb5fhvpw7gymydznhne: release count 215
    skipping container_62jks7u4vjbrpl4sycma7yzhru: release count 314
    skipping container_knhzl5oq6resrghowrxocj52mq: release count 123
    skipping container_yfgztqw72vhabfohcrr77v6b3a: release count 167
    skipping container_i7v5yf24erezljqlu5dsvctulm: release count 174
    skipping container_75jngrgwnbb4tn22c5qk3fxo4i: release count 198
    skipping container_3zgq2ml4jjc5jh5w4mrx44qda4: release count 162
    skipping container_xbdlayhr2zcajfsmgj7niiylqm: release count 120
    skipping container_cpjsvpcsgrfy5ms5czaj4fcgbm: release count 228
    skipping container_o44k5mmoxzdtnozysjun66slju: release count 179
    skipping container_qph344zobjev7kmke2ozk33mwi: release count 241
    skipping container_xkii4g5qnrdnleyhhwvd4uoz2i: release count 108
    skipping container_vrdwdxnidbdiracilr3gr5v3cm: release count 204
    skipping container_4gw5lyn3xna2jajzbxot6fifdq: release count 101
    skipping container_t7sx4zbwyzhbjbzsvrazlwtbvy: release count 199
    skipping container_xxnbdrcamrgfbhydxv6v3jy2fy: release count 101
    skipping container_usm5yyjmcfgmjcu4lcy3f6sade: release count 115
    skipping container_fqmqy4ws5fcf7axos7esy2fcce: release count 191
    skipping container_wadmkiqbt5hhjfpm6lbn47agyu: release count 114
    skipping container_ujc3wxewfraizaffzouy2mxaau: release count 136
    skipping container_um6qxikn3nhlpoiovdrww5gp4q: release count 144
    skipping container_bykun5gbb5cgnfo25j5jzm5iai: release count 320
    skipping container_amyo2w2zgjafxbxvmlb5hfndoy: release count 109
    skipping container_mfbbi2zyhrhvvjh45mvd4fz5qy: release count 119
    skipping container_jftmuuvc2fa6xcu77zjsyrs5b4: release count 251
    skipping container_dqrilnh3ord7fachdd4vaktvpe: release count 143
    skipping container_oj4ahzx3mvcbrggtpa6sqnjtny: release count 160
    skipping container_p452ltmuhbdnlaz6donkkr5une: release count 166
    skipping container_dmsweozizzgqza4sd5byomyqve: release count 123
    skipping container_xckj2u5pdrevxei43pgqe4c4sq: release count 119
    skipping container_5qpk5qalazcmjooeruos7zck5i: release count 194
    skipping container_eaqcxaog3jgjhjmpf5kdxrmmcm: release count 149
    skipping container_33tn3bydy5dvtkeoe5falrto6q: release count 115
    skipping container_i4t74quqcjdgplmeyga5payqhm: release count 118
    skipping container_2olekyosjzer5ds3s7gc3zb624: release count 200
    skipping container_yh4uvtwy7fa27ibgddi3qmewg4: release count 199
    skipping container_qjidb4al4japfcbxce2dw627ni: release count 545
    skipping container_y2cy2ajbwzbo3d6lwyseqig2ce: release count 191
    skipping container_abehsncrarf4ffttto4toaayay: release count 440
    skipping container_elmhdn55tbfgvod2oic6xlwr3e: release count 159
    skipping container_ylzwxwnwpfdf3br2xgwobkqjwa: release count 239
    skipping container_g3drm2f6mjazxc7vckimcycn5y: release count 116
    skipping container_33f3tq7re5abtafbxdbzd6jpxi: release count 210
    skipping container_jtx5kwkiwbcezbqk7gwhor53wm: release count 265
    skipping container_rrikfc42xrg6vl7lnhlf26mtxm: release count 150
    skipping container_pzqxtxvco5fhlj7k23bqw374yu: release count 106
    skipping container_xde4qg5sh5djzdnawhou5dhb4u: release count 366
    skipping container_omshcvxszffgdlspsehxlyksse: release count 165
    skipping container_ekjk6clxtfgmbgvkkgixeaadfy: release count 224
    skipping container_tyindzesqbg27nblbfwa6hptku: release count 190
    skipping container_ho6rwlwmsnepri2amemptyvnhm: release count 134
    skipping container_uqezfmzrnre5zdtspj6wphlvbm: release count 189
    skipping container_6xslqgviybh7lkl2edk5ghtvum: release count 141
    skipping container_3wgusadkfjhf3ejpn2nnn3v5ka: release count 307
    skipping container_cbiclofvcbhmjcxubvkg5vyhxi: release count 1368
    skipping container_ahvaawe3fvd3zb4mfiaado7wyy: human edited
    skipping container_ejgmmh6ozvf7hczdpok2ab3fni: release count 102
    skipping container_s3h3himubfaftbhcpfzgc354zq: release count 174
    skipping container_wzabvii33zakjbf24dspb2tst4: release count 122
    skipping container_kly2opavcjejxesqru5tib455q: release count 159
    skipping container_zw7uyavzgzdtleeliwh5cnp23q: human edited

Check progress (after partial ident dump and `fatcat_export`, just for containers):

    zcat container_export.json.gz \
        | jq '[.issnl, .ident] | @tsv' -r \
        | sort -S 4G \
        | uniq -D -w 9 \
        > issnl_ident.dupes.tsv

    wc -l issnl_ident.dupes.tsv
    # 4

    cut -f1 issnl_ident.dupes.tsv | uniq | wc -l
    # 2

Huh, this doesn't seem right. What happened to all the `skip-container-release-count`, etc?

Ah, the `skip-container-release-count` condition 1) can include the primary
release and 2) is only printed, never does anything (!).

Well, yikes, but no indication anything has gone really wrong.

Let's manually look at the human-edited cases:

    # 2709-2143
    # zw7uyavzgzdtleeliwh5cnp23q is better primary
    {
      "duplicate_ids": [
        "osev5sprhfdwdkoyypihlspsme",
        "zw7uyavzgzdtleeliwh5cnp23q"
      ],
      "entity_type": "container",
      "evidence": {
        "extid": "2709-2143",
        "extid_type": "issnl"
      },
      "primary_id": null
    }

    # 2567-8833
    # ahvaawe3fvd3zb4mfiaado7wyy is better primary (no releases)
    {
      "duplicate_ids": [
        "ahvaawe3fvd3zb4mfiaado7wyy",
        "e7q6rq2ebbbp3m3xxccqxi3fnm"
      ],
      "entity_type": "container",
      "evidence": {
        "extid": "2567-8833",
        "extid_type": "issnl"
      },
      "primary_id": null
    }

In retrospect, I probably should have also enforced title fuzzy match in
automated merging, instead of just ISSN-L match.