From 1b2cff693eeec25468d7dcf743408720a49859b9 Mon Sep 17 00:00:00 2001
From: Bryan Newbold <bnewbold@robocracy.org>
Date: Wed, 4 Jan 2023 19:40:57 -0800
Subject: commit cleanup TODO list

---
 extra/cleanups/TODO | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 95 insertions(+)
 create mode 100644 extra/cleanups/TODO

diff --git a/extra/cleanups/TODO b/extra/cleanups/TODO
new file mode 100644
index 00000000..723628de
--- /dev/null
+++ b/extra/cleanups/TODO
@@ -0,0 +1,95 @@
+
+## Containers: bad publisher strings
+
+    fatcat-cli search container publisher:NULL --count
+    # 131
+    # update to empty string (?)
+
+## Releases: very long titles
+
+
+## Bad PDFs
+
+    https://fatcat.wiki/file/ypoyxwqw5zexbamwtdmpavjjbi
+    https://web.archive.org/web/20190305033128/http://pdfs.semanticscholar.org/ceb2/b47a7647c710cd8e2c1937395b5d4a3a0204.pdf
+    sha1:ceb2b47a7647c710cd8e2c1937395b5d4a3a0204
+    not actually even a PDF?
+
+Should do a query of `file_meta` and/or `pdf_meta` from sandcrawler DB, with
+updated `fatcat_file` table, and look for mismatches, then remove/update on
+fatcat side.
+
+
+## Partial PDFs
+
+look in to `ieeexplore.ieee.org` PDFs; may be partial?
+
+
+## Invalid DOIs
+
+We get a bunch of bogus DOIs from various sources. Eg, pubmed and doaj metadata
+(and probably dblp).
+
+It is not hard to verify individual DOIs, but doing so at scale is a bit harder.
+
+We could start by identifying bogus DOIs from failed ingests in sandcrawler-db,
+then verifying and removing from fatcat. Need to ensure we aren't "looping" the
+DOIs on the fatcat side (eg, re-importing).
+
+Could also do random sampling across, eg, DOAJ containers, to identify
+publishers which don't register DOIs, then verify all of them.
+
+Also, deleted DOIs
+
+
+## Likely Bogus Dates
+
+If 1970-01-01, then set to none (UNIX timestamp zero)
+
+
+## Forthcoming Articles
+
+These entities are created when the DOI is registered, but perhaps shouldn't be?
+
+Forthcoming Article 2019   Astrophysical Journal Letters
+doi:10.3847/2041-8213/ab0c96 
+
+
+## File Slides
+
+Many PDFs in fatcat, which are associated with "papers", seem to actually be slide decks.
+
+#### Sandcrawler SQL Exploration
+
+    SELECT *
+    FROM pdf_meta
+    LEFT JOIN fatcat_file
+        ON pdf_meta.sha1hex = fatcat_file.sha1hex
+    WHERE
+        status = 'success'
+        AND page0_height < page0_width
+        AND fatcat_file.sha1hex IS NOT NULL
+    LIMIT 10;
+
+    SELECT COUNT(*)
+    FROM pdf_meta
+    LEFT JOIN fatcat_file
+        ON pdf_meta.sha1hex = fatcat_file.sha1hex
+    WHERE
+        status = 'success'
+        AND page0_height < page0_width
+        AND fatcat_file.sha1hex IS NOT NULL
+    LIMIT 10;
+    # 199,126
+
+#### Low-Code Cleanup Idea
+
+1. do a SQL dump of file idents with this issue
+2. use fatcat-cli to fetch the file entities, with releases expanded
+3. use jq to filter to files with only one release associated
+4. use jq to filter to files where the single release is a paper (eg, "article-journal") and maybe also has a `container_id`
+5. use jq to modify the entities, setting `release_id` to null/empty, and setting `file_scope`
+6. use `fatcat-cli` to update the file entities
+
+This should fix many, though not all, such cases.
+
-- 
cgit v1.2.3