summaryrefslogtreecommitdiffstats
path: root/notes/cleanups/file_release_ingest_bug.md
blob: 8690157acc1c3f9defe38ad134f96220c7b0f3d7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144

We want to find cases where the `file_edit` metadata for a file entity does not match the DOI of the release it is linked to.

## Background


Seems to mostly happen when the `link_source_id` DOI is not found during a
fatcat 'lookup', eg due to upstream DOI metadata not including a 'title' field
(required by fatcat schema). Many of these seem to be 'journal-issue' or
something like that. Presumably upstream (unpaywall) crawled and found some PDF
for the DOI and indexes that.

TODO: what was the specific code bug that caused this? time window of edits?
https://github.com/internetarchive/fatcat/commit/d8cf02dbdc6848be4de2710e54f4463d702b6e7c

TODO: how many empty-title Crossref DOIs are there? perhaps we should import these after all, in some situations? eg, pull `html_biblio`?
TODO: examples of other mismatches? edits from the same timespan with mismatching identifiers

## Queries

    file_edit
        ident_id
        rev_id
        extra_json->>'link_source_id'::text
    file_rev_release
        file_rev
        target_release_ident_id
    release_ident
        id
        rev_id
    release_rev
        id
        doi

Query some samples:

    SELECT file_edit.ident_id as file_ident, release_ident.id as release_ident, file_edit.extra_json->>'link_source_id' as file_edit_doi, release_rev.doi as release_doi
    FROM file_edit
        LEFT JOIN file_rev_release ON file_edit.rev_id = file_rev_release.file_rev
        LEFT JOIN release_ident ON file_rev_release.target_release_ident_id = release_ident.id
        LEFT JOIN release_rev ON release_rev.id = release_ident.rev_id
    WHERE
        file_edit.extra_json->>'ingest_request_source' = 'unpaywall'
        AND release_rev.doi != file_edit.extra_json->>'link_source_id'
    LIMIT 20;

                  file_ident              |            release_ident             |             file_edit_doi             |             release_doi             
    --------------------------------------+--------------------------------------+---------------------------------------+-------------------------------------
     8fe6eb2a-da3a-4e1c-b281-85fad3cae601 | 0a0cbabd-3f04-4296-94d5-b61c27fddee4 | 10.7167/2013/520930/dataset/3         | 10.7167/2013/520930
     cbb2d4e7-a9a7-41df-be1d-faf9dd552ee1 | d7b356c7-de6b-4885-b37d-ea603849a30a | 10.1525/elementa.124.t1               | 10.1525/elementa.124
     26242648-2a7b-42ea-b2ca-e4fba8f0a2c7 | d2647e07-fbb3-47f5-b06a-de2872bf84ec | 10.1023/a:1000301702170               | 10.1007/bfb0085374
     f774854a-3939-4aa9-bf73-7e0573908b88 | e19ce792-422d-4feb-92ba-dd8536a60618 | 10.32835/2223-5752.2018.17.14-21      | 10.31652/2412-1142-2018-50-151-156
     615ddf8c-2d91-4c1c-b04a-76b69f07b300 | e7af8377-9542-4711-9444-872f45521dc1 | 10.5644/sjm.10.1.00                   | 10.5644/sjm.10.1.01
     43fa1b53-eddb-4f31-96af-bc9bd3bb31b6 | e1ec5d8a-ff88-49f4-8540-73ee4906e119 | 10.1136/bjo.81.11.934                 | 10.1038/sj.bjc.6690790
     64b04cbc-ab3d-4cef-aff5-9d6b2532c47a | 43c387d1-e4c9-49f3-9995-552451a9a246 | 10.1023/a:1002015330987               | 10.1006/jnth.1998.2271
     31e6e1a8-8457-4c93-82d3-12a51c0dc1bb | a2ec4305-63d2-4164-8470-02bf5c4ba74c | 10.1186/1742-4690-2-s1-s84            | 10.1038/nm0703-847
     3461024a-1800-44a0-a7be-73f14bac99dd | 6849d39f-6bae-460b-8a9d-d6b86fbac2d3 | 10.17265/2328-2150/2014.11            | 10.1590/s0074-02762002000900009
     c2707e43-6a4b-4f5d-8fb4-babe16b42b4d | 82127faf-d125-4cd4-88b9-9a8618392019 | 10.1055/s-005-28993                   | 10.1055/s-0034-1380711
     0fe72304-1884-4dde-98c6-7cf3aec5bf31 | e3fe46cd-0205-42e4-8255-490e0eba38ea | 10.1787/5k4c1s0z2gs1-en               | 10.1787/9789282105931-2-en
     686a1508-b6c5-4060-9035-1dd8a4b85be1 | dcecd03d-e0f6-40ce-a376-7e64bd90bdca | 10.15406/jlprr.2.3                    | 10.15406/jlprr.2015.02.00039
     9403bc0e-2a50-4c8a-ab79-943bce20e583 | 3073dee2-c8f7-42ed-a727-12306d86fa35 | 10.1787/5jrtgzfl6g9w-en               | 10.1111/1468-2443.00004
     9a080a6f-aaaf-4f4b-acab-3eda2bfacc22 | af28a167-2ddc-43db-98ca-1cc29b863f9f | 10.13040/ijpsr.0975-8232.4(6).2094-05 | 10.1002/chin.201418290
     810f1a00-623f-4a5c-88e2-7240232ae8f9 | 639d4ddd-ef24-49dc-8d4e-ea626b0b85f8 | 10.13040/ijpsr.0975-8232.5(6).2216-24 | 10.1006/phrs.1997.0184
                                                                                   10.13040/IJPSR.0975-8232.5(6).2216-24
     72752c59-9532-467d-8b4f-fe4d994bcff5 | cc3db36c-27e9-45bf-96b0-56d1119b93d6 | 10.22201/facmed.2007865xe.2018.26     | 10.22201/facmed.2007865x.2018.26.01
     a0740fc2-a1db-4bc8-ac98-177ccebde24f | 0a7136ac-86ad-4636-9c2a-b56a6d5a3a27 | 10.24966/cmph-1978/vol6iss1           | 10.1093/milmed/146.4.283
     5b05df6a-7a0e-411b-a4e1-0081d7522673 | dc0cb585-c709-4990-a82e-c7c9b254fc74 | 10.15406/jig.3.1                      | 10.15406/jig.2016.03.00039
     011964dd-24a6-4d25-b9dd-921077f1947e | deabab6d-7381-4567-bf56-5294ab081e20 | 10.17265/1537-1506/2013.01            | 10.1109/icsssm.2016.7538527
     4d7efa59-30c7-4015-bea9-d7df171a97ed | a05449a0-34d7-4f45-9b92-bcf71db4918d | 10.14413/herj.2017.01.09              | 10.14413/herj.2017.01.09.
    (20 rows)

Total counts, broken DOIs, by source:

    SELECT file_edit.extra_json->>'ingest_request_source' as source, COUNT(*) as broken_files
    FROM file_edit
        LEFT JOIN file_rev_release ON file_edit.rev_id = file_rev_release.file_rev
        LEFT JOIN release_ident ON file_rev_release.target_release_ident_id = release_ident.id
        LEFT JOIN release_rev ON release_rev.id = release_ident.rev_id
    WHERE
        file_edit.extra_json->>'link_source_id' IS NOT NULL
        AND file_edit.extra_json->>'link_source_id' LIKE '10.%'
        AND release_rev.doi != file_edit.extra_json->>'link_source_id'
    GROUP BY file_edit.extra_json->>'ingest_request_source';


          source      | broken_files
    ------------------+--------------
     fatcat-changelog |          872
     unpaywall        |       227954
    (2 rows)

## Cleanup Pseudocode

Dump file idents from above SQL queries. Transform from UUID to fatcat ident.

For each file ident, fetch file entity, with releases expanded. Also fetch edit
history. Filter by:

- single release for file
- single edit in history, matching broken agent
- edit metadata indicates a DOI, which doesn't match release DOI (or release doesn't have DOI?)

If all that is the case, do a lookup by the correct DOI. If there is a match,
update the file entity to point to that release (and only that release).
Operate in batches.

Write cleanup bot, run in QA to test. Run SQL query again and iterate if some
files were not updated. Merge to master, run full dump and update in prod.

## Export Matches Needing Fix

    COPY (
        SELECT row_to_json(row.*) FROM (
            SELECT file_edit.ident_id as file_ident, release_ident.id as wrong_release_ident, file_edit.extra_json as edit_extra
            FROM file_edit
                LEFT JOIN file_rev_release ON file_edit.rev_id = file_rev_release.file_rev
                LEFT JOIN release_ident ON file_rev_release.target_release_ident_id = release_ident.id
                LEFT JOIN release_rev ON release_rev.id = release_ident.rev_id
            WHERE
                file_edit.extra_json->>'link_source_id' IS NOT NULL
                AND file_edit.extra_json->>'link_source_id' LIKE '10.%'
                AND release_rev.doi != file_edit.extra_json->>'link_source_id'
            -- LIMIT 20;
        ) as row
    ) TO '/srv/fatcat/snapshots/file_release_bugfix_20211105.unfiltered.json';
    # COPY 228,827

Then need to do a `sed` pass, and a `rg` pass, to filter out bad escapes:

    cat /srv/fatcat/snapshots/file_release_bugfix_20211105.unfiltered.json \
        | sed 's/\\"/\"/g' \
        | rg -v "\\\\" \
        | rg '^\{' \
        | pv -l \
        > /srv/fatcat/snapshots/file_release_bugfix_20211105.json
    # 228k 0:00:00 [ 667k/s]

    wc -l /srv/fatcat/snapshots/file_release_bugfix_20211105.json
    # 228,826

And create a sample file:

    shuf -n10000 /srv/fatcat/snapshots/file_release_bugfix_20211105.json > /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json