diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-29 14:33:14 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-29 14:33:14 -0800 |
commit | c5ea2dba358624f4c14da0a1a988ae14d0edfd59 (patch) | |
tree | 7d3934e4922439402f882a374fe477906fd41aae /notes/cleanups/file_release_ingest_bug.md | |
parent | ec2809ef2ac51c992463839c1e3451927f5e1661 (diff) | |
download | fatcat-c5ea2dba358624f4c14da0a1a988ae14d0edfd59.tar.gz fatcat-c5ea2dba358624f4c14da0a1a988ae14d0edfd59.zip |
move 'cleanups' directory from notes to extra/
Diffstat (limited to 'notes/cleanups/file_release_ingest_bug.md')
-rw-r--r-- | notes/cleanups/file_release_ingest_bug.md | 192 |
1 files changed, 0 insertions, 192 deletions
diff --git a/notes/cleanups/file_release_ingest_bug.md b/notes/cleanups/file_release_ingest_bug.md deleted file mode 100644 index d818905b..00000000 --- a/notes/cleanups/file_release_ingest_bug.md +++ /dev/null @@ -1,192 +0,0 @@ - -We want to find cases where the `file_edit` metadata for a file entity does not match the DOI of the release it is linked to. - -## Background - - -Seems to mostly happen when the `link_source_id` DOI is not found during a -fatcat 'lookup', eg due to upstream DOI metadata not including a 'title' field -(required by fatcat schema). Many of these seem to be 'journal-issue' or -something like that. Presumably upstream (unpaywall) crawled and found some PDF -for the DOI and indexes that. - -TODO: what was the specific code bug that caused this? time window of edits? -https://github.com/internetarchive/fatcat/commit/d8cf02dbdc6848be4de2710e54f4463d702b6e7c - -TODO: how many empty-title Crossref DOIs are there? perhaps we should import these after all, in some situations? eg, pull `html_biblio`? -TODO: examples of other mismatches? edits from the same timespan with mismatching identifiers - -## Queries - - file_edit - ident_id - rev_id - extra_json->>'link_source_id'::text - file_rev_release - file_rev - target_release_ident_id - release_ident - id - rev_id - release_rev - id - doi - -Query some samples: - - SELECT file_edit.ident_id as file_ident, release_ident.id as release_ident, file_edit.extra_json->>'link_source_id' as file_edit_doi, release_rev.doi as release_doi - FROM file_edit - LEFT JOIN file_rev_release ON file_edit.rev_id = file_rev_release.file_rev - LEFT JOIN release_ident ON file_rev_release.target_release_ident_id = release_ident.id - LEFT JOIN release_rev ON release_rev.id = release_ident.rev_id - WHERE - file_edit.extra_json->>'ingest_request_source' = 'unpaywall' - AND release_rev.doi != file_edit.extra_json->>'link_source_id' - LIMIT 20; - - file_ident | release_ident | file_edit_doi | release_doi - --------------------------------------+--------------------------------------+---------------------------------------+------------------------------------- - 8fe6eb2a-da3a-4e1c-b281-85fad3cae601 | 0a0cbabd-3f04-4296-94d5-b61c27fddee4 | 10.7167/2013/520930/dataset/3 | 10.7167/2013/520930 - cbb2d4e7-a9a7-41df-be1d-faf9dd552ee1 | d7b356c7-de6b-4885-b37d-ea603849a30a | 10.1525/elementa.124.t1 | 10.1525/elementa.124 - 26242648-2a7b-42ea-b2ca-e4fba8f0a2c7 | d2647e07-fbb3-47f5-b06a-de2872bf84ec | 10.1023/a:1000301702170 | 10.1007/bfb0085374 - f774854a-3939-4aa9-bf73-7e0573908b88 | e19ce792-422d-4feb-92ba-dd8536a60618 | 10.32835/2223-5752.2018.17.14-21 | 10.31652/2412-1142-2018-50-151-156 - 615ddf8c-2d91-4c1c-b04a-76b69f07b300 | e7af8377-9542-4711-9444-872f45521dc1 | 10.5644/sjm.10.1.00 | 10.5644/sjm.10.1.01 - 43fa1b53-eddb-4f31-96af-bc9bd3bb31b6 | e1ec5d8a-ff88-49f4-8540-73ee4906e119 | 10.1136/bjo.81.11.934 | 10.1038/sj.bjc.6690790 - 64b04cbc-ab3d-4cef-aff5-9d6b2532c47a | 43c387d1-e4c9-49f3-9995-552451a9a246 | 10.1023/a:1002015330987 | 10.1006/jnth.1998.2271 - 31e6e1a8-8457-4c93-82d3-12a51c0dc1bb | a2ec4305-63d2-4164-8470-02bf5c4ba74c | 10.1186/1742-4690-2-s1-s84 | 10.1038/nm0703-847 - 3461024a-1800-44a0-a7be-73f14bac99dd | 6849d39f-6bae-460b-8a9d-d6b86fbac2d3 | 10.17265/2328-2150/2014.11 | 10.1590/s0074-02762002000900009 - c2707e43-6a4b-4f5d-8fb4-babe16b42b4d | 82127faf-d125-4cd4-88b9-9a8618392019 | 10.1055/s-005-28993 | 10.1055/s-0034-1380711 - 0fe72304-1884-4dde-98c6-7cf3aec5bf31 | e3fe46cd-0205-42e4-8255-490e0eba38ea | 10.1787/5k4c1s0z2gs1-en | 10.1787/9789282105931-2-en - 686a1508-b6c5-4060-9035-1dd8a4b85be1 | dcecd03d-e0f6-40ce-a376-7e64bd90bdca | 10.15406/jlprr.2.3 | 10.15406/jlprr.2015.02.00039 - 9403bc0e-2a50-4c8a-ab79-943bce20e583 | 3073dee2-c8f7-42ed-a727-12306d86fa35 | 10.1787/5jrtgzfl6g9w-en | 10.1111/1468-2443.00004 - 9a080a6f-aaaf-4f4b-acab-3eda2bfacc22 | af28a167-2ddc-43db-98ca-1cc29b863f9f | 10.13040/ijpsr.0975-8232.4(6).2094-05 | 10.1002/chin.201418290 - 810f1a00-623f-4a5c-88e2-7240232ae8f9 | 639d4ddd-ef24-49dc-8d4e-ea626b0b85f8 | 10.13040/ijpsr.0975-8232.5(6).2216-24 | 10.1006/phrs.1997.0184 - 10.13040/IJPSR.0975-8232.5(6).2216-24 - 72752c59-9532-467d-8b4f-fe4d994bcff5 | cc3db36c-27e9-45bf-96b0-56d1119b93d6 | 10.22201/facmed.2007865xe.2018.26 | 10.22201/facmed.2007865x.2018.26.01 - a0740fc2-a1db-4bc8-ac98-177ccebde24f | 0a7136ac-86ad-4636-9c2a-b56a6d5a3a27 | 10.24966/cmph-1978/vol6iss1 | 10.1093/milmed/146.4.283 - 5b05df6a-7a0e-411b-a4e1-0081d7522673 | dc0cb585-c709-4990-a82e-c7c9b254fc74 | 10.15406/jig.3.1 | 10.15406/jig.2016.03.00039 - 011964dd-24a6-4d25-b9dd-921077f1947e | deabab6d-7381-4567-bf56-5294ab081e20 | 10.17265/1537-1506/2013.01 | 10.1109/icsssm.2016.7538527 - 4d7efa59-30c7-4015-bea9-d7df171a97ed | a05449a0-34d7-4f45-9b92-bcf71db4918d | 10.14413/herj.2017.01.09 | 10.14413/herj.2017.01.09. - (20 rows) - -Total counts, broken DOIs, by source: - - SELECT file_edit.extra_json->>'ingest_request_source' as source, COUNT(*) as broken_files - FROM file_edit - LEFT JOIN file_rev_release ON file_edit.rev_id = file_rev_release.file_rev - LEFT JOIN release_ident ON file_rev_release.target_release_ident_id = release_ident.id - LEFT JOIN release_rev ON release_rev.id = release_ident.rev_id - WHERE - file_edit.extra_json->>'link_source_id' IS NOT NULL - AND file_edit.extra_json->>'link_source_id' LIKE '10.%' - AND release_rev.doi != file_edit.extra_json->>'link_source_id' - GROUP BY file_edit.extra_json->>'ingest_request_source'; - - - source | broken_files - ------------------+-------------- - fatcat-changelog | 872 - unpaywall | 227954 - (2 rows) - -## Cleanup Pseudocode - -Dump file idents from above SQL queries. Transform from UUID to fatcat ident. - -For each file ident, fetch file entity, with releases expanded. Also fetch edit -history. Filter by: - -- single release for file -- single edit in history, matching broken agent -- edit metadata indicates a DOI, which doesn't match release DOI (or release doesn't have DOI?) - -If all that is the case, do a lookup by the correct DOI. If there is a match, -update the file entity to point to that release (and only that release). -Operate in batches. - -Write cleanup bot, run in QA to test. Run SQL query again and iterate if some -files were not updated. Merge to master, run full dump and update in prod. - -## Export Matches Needing Fix - - COPY ( - SELECT row_to_json(row.*) FROM ( - SELECT file_edit.ident_id as file_ident, release_ident.id as wrong_release_ident, file_edit.extra_json as edit_extra - FROM file_edit - LEFT JOIN file_rev_release ON file_edit.rev_id = file_rev_release.file_rev - LEFT JOIN release_ident ON file_rev_release.target_release_ident_id = release_ident.id - LEFT JOIN release_rev ON release_rev.id = release_ident.rev_id - WHERE - file_edit.extra_json->>'link_source_id' IS NOT NULL - AND file_edit.extra_json->>'link_source_id' LIKE '10.%' - AND release_rev.doi != file_edit.extra_json->>'link_source_id' - -- LIMIT 20; - ) as row - ) TO '/srv/fatcat/snapshots/file_release_bugfix_20211105.unfiltered.json'; - # COPY 228,827 - -Then need to do a `sed` pass, and a `rg` pass, to filter out bad escapes: - - cat /srv/fatcat/snapshots/file_release_bugfix_20211105.unfiltered.json \ - | sed 's/\\"/\"/g' \ - | rg -v "\\\\" \ - | rg '^\{' \ - | pv -l \ - > /srv/fatcat/snapshots/file_release_bugfix_20211105.json - # 228k 0:00:00 [ 667k/s] - - wc -l /srv/fatcat/snapshots/file_release_bugfix_20211105.json - # 228,826 - -And create a sample file: - - shuf -n10000 /srv/fatcat/snapshots/file_release_bugfix_20211105.json > /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json - - -## Testing in QA - - export FATCAT_AUTH_WORKER_CLEANUP=[...] - - head -n10 /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json \ - | python -m fatcat_tools.cleanups.file_release_bugfix - - # Counter({'total': 10, 'update': 10, 'skip': 0, 'insert': 0, 'exists': 0}) - - file_wmjzuybuorfrjdgnr2d32vg5va: was wrong, now correct, no other files - file_wgszcwehlffnzmoypr4l2yhvza: was correct, now correct, no other files. multiple articles in single PDF - file_p4r5sbminzgrhn4yaiqyr7ahwi: was correct, now wrong (!!!) - doi:10.1055/s-0036-1579844 - PDF says: 10.4103/0028-3886.158210 - unpaywall still has this incorrect linkage - file_n5jtvrnodbfdbccl5d6hshhvw4: now stub, was wrong - doi:10.19080/bboaj.2018.04.555640 release_suvtcm7hdbbr3fczcxxt4thgoi seems correct? - doi:10.19080/bboaj.4.4 is the bad DOI, not in fatcat. is registered, as an entire issue - file_kt5fv4d5lrbk7j3dxxgcm7hph4: was wrong, now correct - file_jq4juugnynhsdkh3whkcjod46q: was wrong, now correct - file_ef7w6y7k4jhjhgwmfai37yjjmm: was wrong, now correct - file_ca2svhd6knfnff4dktuonp45du: was correct, now correct. complicated, multiple DOIs/copies/files of same work - file_borst2aewvcwzhucnyth2vf3lm: was correct, now correct. complicated, multiple DOIs, single file of same work - -Overall, seems like this might not be as much of an obvious improvement as -hoped! But still progress and more correct. - - head -n1000 /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json \ - | python -m fatcat_tools.cleanups.file_release_bugfix - - # Counter({'total': 1000, 'update': 929, 'skip-existing-history-updated': 58, 'skip-existing-fixed': 10, 'skip': 3, 'skip-link-source': 3, 'insert': 0, 'exists': 0}) - -Looking at `skip-link-source`, it is cases where `link_source` is 'doi' not -'fatcat-changelog'. Will update filter behavior, 'fatcat-changelog' is a -`ingest_request_source`. - -Checking another 10 examples. They all seem to end up as correct matches. - -Did a small update and running the whole batch: - - cat /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json \ - | python -m fatcat_tools.cleanups.file_release_bugfix - - # Counter({'total': 10000, 'update': 8499, 'skip-existing-fixed': 939, 'skip-existing-history-updated': 560, 'skip': 2, 'skip-wrong-release-is-ok': 2, 'insert': 0, 'exists': 0}) - -I think this is ready to go! Example with parallel: - - cat /srv/fatcat/snapshots/file_release_bugfix_20211105.10k_sample.json \ - | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_tools.cleanups.file_release_bugfix - - |