aboutsummaryrefslogtreecommitdiffstats
path: root/extra/cleanups/file_release_ingest_bug.md
blob: d818905b0c6802209d1f92a5e1440b922c11cde3 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192

We want to find cases where the `file_edit` metadata for a file entity does not match the DOI of the release it is linked to.

## Background


Seems to mostly happen when the `link_source_id` DOI is not found during a
fatcat 'lookup', eg due to upstream DOI metadata not including a 'title' field
(required by fatcat schema). Many of these seem to be 'journal-issue' or
something like that. Presumably upstream (unpaywall) crawled and found some PDF
for the DOI and indexes that.

TODO: what was the specific code bug that caused this? time window of edits?
https://github.com/internetarchive/fatcat/commit/d8cf02dbdc6848be4de2710e54f4463d702b6e7c

TODO: how many empty-title Crossref DOIs are there? perhaps we should import these after all, in some situations? eg, pull `html_biblio`?
TODO: examples of other mismatches? edits from the same timespan with mismatching identifiers

## Queries

    file_edit
        ident_id
        rev_id
        extra_json->>'link_source_id'::text
    file_rev_release
        file_rev
        target_release_ident_id
    release_ident
        id
        rev_id
    release_rev
        id
        doi

Query some samples:

    SELECT file_edit.ident_id as file_ident, release_ident.id as release_ident, file_edit.extra_json->>'link_source_id' as file_edit_doi, release_rev.doi as release_doi
    FROM file_edit
        LEFT JOIN file_rev_release ON file_edit.rev_id = file_rev_release.file_rev
        LEFT JOIN release_ident ON file_rev_release.target_release_ident_id = release_ident.id
        LEFT JOIN release_rev ON release_rev.id = release_ident.rev_id
    WHERE
        file_edit.extra_json->>'ingest_request_source' = 'unpaywall'
        AND release_rev.doi != file_edit.extra_json->>'link_source_id'
    LIMIT 20;

                  file_ident              |            release_ident             |             file_edit_doi             |             release_doi             
    --------------------------------------+--------------------------------------+---------------------------------------+-------------------------------------
     8fe6eb2a-da3a-4e1c-b281-85fad3cae601 | 0a0cbabd-3f04-4296-94d5-b61c27fddee4 | 10.7167/2013/520930/dataset/3         | 10.7167/2013/520930
     cbb2d4e7-a9a7-41df-be1d-faf9dd552ee1 | d7b356c7-de6b-4885-b37d-ea603849a30a | 10.1525/elementa.124.t1               | 10.1525/elementa.124
     26242648-2a7b-42ea-b2ca-e4fba8f0a2c7 | d2647e07-fbb3-47f5-b06a-de2872bf84ec | 10.1023/a:1000301702170               | 10.1007/bfb0085374
     f774854a-3939-4aa9-bf73-7e0573908b88 | e19ce792-422d-4feb-92ba-dd8536a60618 | 10.32835/2223-5752.2018.17.14-21      | 10.31652/2412-1142-2018-50-151-156
     615ddf8c-2d91-4c1c-b04a-76b69f07b300 | e7af8377-9542-4711-9444-872f45521dc1 | 10.5644/sjm.10.1.00                   | 10.5644/sjm.10.1.01
     43fa1b53-eddb-4f31-96af-bc9bd3bb31b6 | e1ec5d8a-ff88-49f4-8540-73ee4906e119 | 10.1136/bjo.81.11.934                 | 10.1038/sj.bjc.6690790
     64b04cbc-ab3d-4cef-aff5-9d6b2532c47a | 43c387d1-e4c9-49f3-9995-552451a9a246 | 10.1023/a:1002015330987               | 10.1006/jnth.1998.2271
     31e6e1a8-8457-4c93-82d3-12a51c0dc1bb | a2ec4305-63d2-4164-8470-02bf5c4ba74c | 10.1186/1742-4690-2-s1-s84            | 10.1038/nm0703-847
     3461024a-1800-44a0-a7be-73f14bac99dd | 6849d39f-6bae-460b-8a9d-d6b86fbac2d3 | 10.17265/2328-2150/2014.11            | 10.1590/s0074-02762002000900009
     c2707e43-6a4b-4f5d-8fb4-babe16b42b4d | 82127faf-d125-4cd4-88b9-9a8618392019 | 10.1055/s-005-28993                   | 10.1055/s-0034-1380711
     0fe72304-1884-4dde-98c6-7cf3aec5bf31 | e3fe46cd-0205-42e4-8255-490e0eba38ea | 10.1787/5k4c1s0z2gs1-en               | 10.1787/9789282105931-2-en
     686a1508-b6c5-4060-9035-1dd8a4b85be1 | dcecd03d-e0f6-40ce-a376-7e64bd90bdca | 10.15406/jlprr.2.3                    | 10.15406/jlprr.2015.02.00039
     9403bc0e-2a50-4c8a-ab79-943bce20e583 | 3073dee2-c8f7-42ed-a727-12306d86fa35 | 10.1787/5jrtgzfl6g9w-en               | 10.1111/1468-2443.00004
     9a080a6f-aaaf-4f4b-acab-3eda2bfacc22 | af28a167-2ddc-43db-98ca-1cc29b863f9f | 10.13040/ijpsr.0975-8232.4(6).2094-05 | 10.1002/chin.201418290
     810f1a00-623f-4a5c-88e2-7240232ae8f9 | 639d4ddd-ef24-49dc-8d4e-ea626b0b85f8 | 10.13040/ijpsr.0975-8232.5(6).2216-24 | 10.1006/phrs.1997.0184
                                                                                   10.13040/IJPSR.0975-8232.5(6).2216-24
     72752c59-9532-467d-8b4f-fe4d994bcff5 | cc3db36c-27e9-45bf-96b0-56d1119b93d6 | 10.22201/facmed.2007865xe.2018.26     | 10.22201/facmed.2007865x.2018.26.01
     a0740fc2-a1db-4bc8-ac98-177ccebde24f | 0a7136ac-86ad-4636-9c2a-b56a6d5a3a27 | 10.24966/cmph-1978/vol6iss1           | 10.1093/milmed/146.4.283
     5b05df6a-7a0e-411b-a4e1-0081d7522673 | dc0cb585-c709-4990-a82e-c7c9b254fc74 | 10.15406/jig.3.1                      | 10.15406/jig.2016.03.00039
     011964dd-24a6-4d25-b9dd-921077f1947e | deabab6d-7381-4567-bf56-5294ab081e20 | 10.17265/1537-1506/2013.01            | 10.1109/icsssm.2016.7538527
     4d7efa59-30c7-4015-bea9-d7df171a97ed | a05449a0-34d7-4f45-9b92-bcf71db4918d | 10.14413/herj.2017.01.09              | 10.14413/herj.2017.01.09.
    (20 rows)

Total counts, broken DOIs, by source:

    SELECT file_edit.extra_json->>'ingest_request_source' as source, COUNT(*) as broken_files
    FROM file_edit
        LEFT JOIN file_rev_release ON file_edit.rev_id = file_rev_release.file_rev
        LEFT JOIN release_ident ON file_rev_release.target_release_ident_id = release_ident.id
        LEFT JOIN release_rev ON release_rev.id = release_ident.rev_id
    WHERE
        file_edit.extra_json->>'link_source_id' IS NOT NULL
        AND file_edit.extra_json->>'link_source_id' LIKE '10.%'
        AND release_rev.doi != file_edit.extra_json->>'link_source_id'
    GROUP BY file_edit.extra_json->>'ingest_request_source';


          source      | broken_files
    ------------------+--------------
     fatcat-changelog |          872
     unpaywall        |       227954
    (2 rows)

## Cleanup Pseudocode

Dump file idents from above SQL queries. Transform from UUID to fatcat ident.

For each file ident, fetch file entity, with releases expanded. Also fetch edit
history. Filter by:

- single release for file
- single edit in history, matching broken agent
- edit metadata indicates a DOI, which doesn't match release DOI (or release doesn't have DOI?)

If all that is the case, do a lookup by the correct DOI. If there is a match,
update the file entity to point to that release (and only that release).
Operate in batches.

Write cleanup bot, run in QA to test. Run SQL query again and iterate if some
files were not updated. Merge to master, run full dump and update in prod.

## Export Matches Needing Fix

    COPY (
        SELECT row_to_json(row.*) FROM (
            SELECT file_edit.ident_id as file_ident, release_ident.id as wrong_release_ident, file_edit.extra_json as edit_extra
            FROM file_edit
                LEFT JOIN file_rev_release ON file_edit.rev_id = file_rev_release.file_rev
                LEFT JOIN release_ident ON file_rev_release.target_release_ident_id = release_ident.id
                LEFT JOIN release_rev ON release_rev.id = release_ident.rev_id
            WHERE
                file_edit.extra_json->>'link_source_id' IS NOT NULL
                AND file_edit.extra_json->>'link_source_id' LIKE '10.%'
                AND release_rev.doi != file_edit.extra_json->>'link_source_id'
            -- LIMIT 20;
        ) as row
    ) TO '/srv/fatcat/snapshots/file_release_bugfix_20211105.unfiltered.json';
    # COPY 228,827

Then need to do a `sed` pass, and a `rg` pass, to filter out bad escapes:

    cat /srv/fatcat/snapshots/file_release_bugfix_20211105.unfiltered.json \
        | sed 's/\\"/\"/g' \
        | rg -v "\\\\" \
        | rg '^\{' \
        | pv -l \
        > /srv/fatcat/snapshots/file_release_bugfix_20211105.json
    # 228k 0:00:00 [ 667k/s]

    wc -l /srv/fatcat/snapshots/file_release_bugfix_20211105.json
    # 228,826

And create a sample file:

    shuf -n10000 /srv/fatcat/snapshots/file_release_bugfix_20211105.json > /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json


## Testing in QA

    export FATCAT_AUTH_WORKER_CLEANUP=[...]

    head -n10 /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json \
        | python -m fatcat_tools.cleanups.file_release_bugfix -
    # Counter({'total': 10, 'update': 10, 'skip': 0, 'insert': 0, 'exists': 0})

    file_wmjzuybuorfrjdgnr2d32vg5va: was wrong, now correct, no other files
    file_wgszcwehlffnzmoypr4l2yhvza: was correct, now correct, no other files. multiple articles in single PDF
    file_p4r5sbminzgrhn4yaiqyr7ahwi: was correct, now wrong (!!!)
        doi:10.1055/s-0036-1579844
        PDF says: 10.4103/0028-3886.158210
        unpaywall still has this incorrect linkage
    file_n5jtvrnodbfdbccl5d6hshhvw4: now stub, was wrong
        doi:10.19080/bboaj.2018.04.555640 release_suvtcm7hdbbr3fczcxxt4thgoi seems correct?
        doi:10.19080/bboaj.4.4 is the bad DOI, not in fatcat. is registered, as an entire issue
    file_kt5fv4d5lrbk7j3dxxgcm7hph4: was wrong, now correct
    file_jq4juugnynhsdkh3whkcjod46q: was wrong, now correct
    file_ef7w6y7k4jhjhgwmfai37yjjmm: was wrong, now correct
    file_ca2svhd6knfnff4dktuonp45du: was correct, now correct. complicated, multiple DOIs/copies/files of same work
    file_borst2aewvcwzhucnyth2vf3lm: was correct, now correct. complicated, multiple DOIs, single file of same work

Overall, seems like this might not be as much of an obvious improvement as
hoped! But still progress and more correct.

    head -n1000 /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json \
        | python -m fatcat_tools.cleanups.file_release_bugfix -
    # Counter({'total': 1000, 'update': 929, 'skip-existing-history-updated': 58, 'skip-existing-fixed': 10, 'skip': 3, 'skip-link-source': 3, 'insert': 0, 'exists': 0})

Looking at `skip-link-source`, it is cases where `link_source` is 'doi' not
'fatcat-changelog'. Will update filter behavior, 'fatcat-changelog' is a
`ingest_request_source`.

Checking another 10 examples. They all seem to end up as correct matches.

Did a small update and running the whole batch:

    cat /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json \
        | python -m fatcat_tools.cleanups.file_release_bugfix -
    # Counter({'total': 10000, 'update': 8499, 'skip-existing-fixed': 939, 'skip-existing-history-updated': 560, 'skip': 2, 'skip-wrong-release-is-ok': 2, 'insert': 0, 'exists': 0})

I think this is ready to go! Example with parallel:

    cat /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json \
        | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_tools.cleanups.file_release_bugfix -