1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
|
We want to find cases where the `file_edit` metadata for a file entity does not match the DOI of the release it is linked to.
## Background
Seems to mostly happen when the `link_source_id` DOI is not found during a
fatcat 'lookup', eg due to upstream DOI metadata not including a 'title' field
(required by fatcat schema). Many of these seem to be 'journal-issue' or
something like that. Presumably upstream (unpaywall) crawled and found some PDF
for the DOI and indexes that.
TODO: what was the specific code bug that caused this? time window of edits?
https://github.com/internetarchive/fatcat/commit/d8cf02dbdc6848be4de2710e54f4463d702b6e7c
TODO: how many empty-title Crossref DOIs are there? perhaps we should import these after all, in some situations? eg, pull `html_biblio`?
TODO: examples of other mismatches? edits from the same timespan with mismatching identifiers
## Queries
file_edit
ident_id
rev_id
extra_json->>'link_source_id'::text
file_rev_release
file_rev
target_release_ident_id
release_ident
id
rev_id
release_rev
id
doi
Query some samples:
SELECT file_edit.ident_id as file_ident, release_ident.id as release_ident, file_edit.extra_json->>'link_source_id' as file_edit_doi, release_rev.doi as release_doi
FROM file_edit
LEFT JOIN file_rev_release ON file_edit.rev_id = file_rev_release.file_rev
LEFT JOIN release_ident ON file_rev_release.target_release_ident_id = release_ident.id
LEFT JOIN release_rev ON release_rev.id = release_ident.rev_id
WHERE
file_edit.extra_json->>'ingest_request_source' = 'unpaywall'
AND release_rev.doi != file_edit.extra_json->>'link_source_id'
LIMIT 20;
file_ident | release_ident | file_edit_doi | release_doi
--------------------------------------+--------------------------------------+---------------------------------------+-------------------------------------
8fe6eb2a-da3a-4e1c-b281-85fad3cae601 | 0a0cbabd-3f04-4296-94d5-b61c27fddee4 | 10.7167/2013/520930/dataset/3 | 10.7167/2013/520930
cbb2d4e7-a9a7-41df-be1d-faf9dd552ee1 | d7b356c7-de6b-4885-b37d-ea603849a30a | 10.1525/elementa.124.t1 | 10.1525/elementa.124
26242648-2a7b-42ea-b2ca-e4fba8f0a2c7 | d2647e07-fbb3-47f5-b06a-de2872bf84ec | 10.1023/a:1000301702170 | 10.1007/bfb0085374
f774854a-3939-4aa9-bf73-7e0573908b88 | e19ce792-422d-4feb-92ba-dd8536a60618 | 10.32835/2223-5752.2018.17.14-21 | 10.31652/2412-1142-2018-50-151-156
615ddf8c-2d91-4c1c-b04a-76b69f07b300 | e7af8377-9542-4711-9444-872f45521dc1 | 10.5644/sjm.10.1.00 | 10.5644/sjm.10.1.01
43fa1b53-eddb-4f31-96af-bc9bd3bb31b6 | e1ec5d8a-ff88-49f4-8540-73ee4906e119 | 10.1136/bjo.81.11.934 | 10.1038/sj.bjc.6690790
64b04cbc-ab3d-4cef-aff5-9d6b2532c47a | 43c387d1-e4c9-49f3-9995-552451a9a246 | 10.1023/a:1002015330987 | 10.1006/jnth.1998.2271
31e6e1a8-8457-4c93-82d3-12a51c0dc1bb | a2ec4305-63d2-4164-8470-02bf5c4ba74c | 10.1186/1742-4690-2-s1-s84 | 10.1038/nm0703-847
3461024a-1800-44a0-a7be-73f14bac99dd | 6849d39f-6bae-460b-8a9d-d6b86fbac2d3 | 10.17265/2328-2150/2014.11 | 10.1590/s0074-02762002000900009
c2707e43-6a4b-4f5d-8fb4-babe16b42b4d | 82127faf-d125-4cd4-88b9-9a8618392019 | 10.1055/s-005-28993 | 10.1055/s-0034-1380711
0fe72304-1884-4dde-98c6-7cf3aec5bf31 | e3fe46cd-0205-42e4-8255-490e0eba38ea | 10.1787/5k4c1s0z2gs1-en | 10.1787/9789282105931-2-en
686a1508-b6c5-4060-9035-1dd8a4b85be1 | dcecd03d-e0f6-40ce-a376-7e64bd90bdca | 10.15406/jlprr.2.3 | 10.15406/jlprr.2015.02.00039
9403bc0e-2a50-4c8a-ab79-943bce20e583 | 3073dee2-c8f7-42ed-a727-12306d86fa35 | 10.1787/5jrtgzfl6g9w-en | 10.1111/1468-2443.00004
9a080a6f-aaaf-4f4b-acab-3eda2bfacc22 | af28a167-2ddc-43db-98ca-1cc29b863f9f | 10.13040/ijpsr.0975-8232.4(6).2094-05 | 10.1002/chin.201418290
810f1a00-623f-4a5c-88e2-7240232ae8f9 | 639d4ddd-ef24-49dc-8d4e-ea626b0b85f8 | 10.13040/ijpsr.0975-8232.5(6).2216-24 | 10.1006/phrs.1997.0184
10.13040/IJPSR.0975-8232.5(6).2216-24
72752c59-9532-467d-8b4f-fe4d994bcff5 | cc3db36c-27e9-45bf-96b0-56d1119b93d6 | 10.22201/facmed.2007865xe.2018.26 | 10.22201/facmed.2007865x.2018.26.01
a0740fc2-a1db-4bc8-ac98-177ccebde24f | 0a7136ac-86ad-4636-9c2a-b56a6d5a3a27 | 10.24966/cmph-1978/vol6iss1 | 10.1093/milmed/146.4.283
5b05df6a-7a0e-411b-a4e1-0081d7522673 | dc0cb585-c709-4990-a82e-c7c9b254fc74 | 10.15406/jig.3.1 | 10.15406/jig.2016.03.00039
011964dd-24a6-4d25-b9dd-921077f1947e | deabab6d-7381-4567-bf56-5294ab081e20 | 10.17265/1537-1506/2013.01 | 10.1109/icsssm.2016.7538527
4d7efa59-30c7-4015-bea9-d7df171a97ed | a05449a0-34d7-4f45-9b92-bcf71db4918d | 10.14413/herj.2017.01.09 | 10.14413/herj.2017.01.09.
(20 rows)
Total counts, broken DOIs, by source:
SELECT file_edit.extra_json->>'ingest_request_source' as source, COUNT(*) as broken_files
FROM file_edit
LEFT JOIN file_rev_release ON file_edit.rev_id = file_rev_release.file_rev
LEFT JOIN release_ident ON file_rev_release.target_release_ident_id = release_ident.id
LEFT JOIN release_rev ON release_rev.id = release_ident.rev_id
WHERE
file_edit.extra_json->>'link_source_id' IS NOT NULL
AND file_edit.extra_json->>'link_source_id' LIKE '10.%'
AND release_rev.doi != file_edit.extra_json->>'link_source_id'
GROUP BY file_edit.extra_json->>'ingest_request_source';
source | broken_files
------------------+--------------
fatcat-changelog | 872
unpaywall | 227954
(2 rows)
## Cleanup Pseudocode
Dump file idents from above SQL queries. Transform from UUID to fatcat ident.
For each file ident, fetch file entity, with releases expanded. Also fetch edit
history. Filter by:
- single release for file
- single edit in history, matching broken agent
- edit metadata indicates a DOI, which doesn't match release DOI (or release doesn't have DOI?)
If all that is the case, do a lookup by the correct DOI. If there is a match,
update the file entity to point to that release (and only that release).
Operate in batches.
Write cleanup bot, run in QA to test. Run SQL query again and iterate if some
files were not updated. Merge to master, run full dump and update in prod.
## Export Matches Needing Fix
COPY (
SELECT row_to_json(row.*) FROM (
SELECT file_edit.ident_id as file_ident, release_ident.id as wrong_release_ident, file_edit.extra_json as edit_extra
FROM file_edit
LEFT JOIN file_rev_release ON file_edit.rev_id = file_rev_release.file_rev
LEFT JOIN release_ident ON file_rev_release.target_release_ident_id = release_ident.id
LEFT JOIN release_rev ON release_rev.id = release_ident.rev_id
WHERE
file_edit.extra_json->>'link_source_id' IS NOT NULL
AND file_edit.extra_json->>'link_source_id' LIKE '10.%'
AND release_rev.doi != file_edit.extra_json->>'link_source_id'
-- LIMIT 20;
) as row
) TO '/srv/fatcat/snapshots/file_release_bugfix_20211105.unfiltered.json';
# COPY 228,827
Then need to do a `sed` pass, and a `rg` pass, to filter out bad escapes:
cat /srv/fatcat/snapshots/file_release_bugfix_20211105.unfiltered.json \
| sed 's/\\"/\"/g' \
| rg -v "\\\\" \
| rg '^\{' \
| pv -l \
> /srv/fatcat/snapshots/file_release_bugfix_20211105.json
# 228k 0:00:00 [ 667k/s]
wc -l /srv/fatcat/snapshots/file_release_bugfix_20211105.json
# 228,826
And create a sample file:
shuf -n10000 /srv/fatcat/snapshots/file_release_bugfix_20211105.json > /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json
## Testing in QA
export FATCAT_AUTH_WORKER_CLEANUP=[...]
head -n10 /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json \
| python -m fatcat_tools.cleanups.file_release_bugfix -
# Counter({'total': 10, 'update': 10, 'skip': 0, 'insert': 0, 'exists': 0})
file_wmjzuybuorfrjdgnr2d32vg5va: was wrong, now correct, no other files
file_wgszcwehlffnzmoypr4l2yhvza: was correct, now correct, no other files. multiple articles in single PDF
file_p4r5sbminzgrhn4yaiqyr7ahwi: was correct, now wrong (!!!)
doi:10.1055/s-0036-1579844
PDF says: 10.4103/0028-3886.158210
unpaywall still has this incorrect linkage
file_n5jtvrnodbfdbccl5d6hshhvw4: now stub, was wrong
doi:10.19080/bboaj.2018.04.555640 release_suvtcm7hdbbr3fczcxxt4thgoi seems correct?
doi:10.19080/bboaj.4.4 is the bad DOI, not in fatcat. is registered, as an entire issue
file_kt5fv4d5lrbk7j3dxxgcm7hph4: was wrong, now correct
file_jq4juugnynhsdkh3whkcjod46q: was wrong, now correct
file_ef7w6y7k4jhjhgwmfai37yjjmm: was wrong, now correct
file_ca2svhd6knfnff4dktuonp45du: was correct, now correct. complicated, multiple DOIs/copies/files of same work
file_borst2aewvcwzhucnyth2vf3lm: was correct, now correct. complicated, multiple DOIs, single file of same work
Overall, seems like this might not be as much of an obvious improvement as
hoped! But still progress and more correct.
head -n1000 /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json \
| python -m fatcat_tools.cleanups.file_release_bugfix -
# Counter({'total': 1000, 'update': 929, 'skip-existing-history-updated': 58, 'skip-existing-fixed': 10, 'skip': 3, 'skip-link-source': 3, 'insert': 0, 'exists': 0})
Looking at `skip-link-source`, it is cases where `link_source` is 'doi' not
'fatcat-changelog'. Will update filter behavior, 'fatcat-changelog' is a
`ingest_request_source`.
Checking another 10 examples. They all seem to end up as correct matches.
Did a small update and running the whole batch:
cat /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json \
| python -m fatcat_tools.cleanups.file_release_bugfix -
# Counter({'total': 10000, 'update': 8499, 'skip-existing-fixed': 939, 'skip-existing-history-updated': 560, 'skip': 2, 'skip-wrong-release-is-ok': 2, 'insert': 0, 'exists': 0})
I think this is ready to go! Example with parallel:
cat /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json \
| parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_tools.cleanups.file_release_bugfix -
|