summaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/cleanups/NOTES.txt
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2019-10-08 15:56:53 -0700
committerBryan Newbold <bnewbold@robocracy.org>2019-10-08 15:56:53 -0700
commit76db7f4048116a23c82bdd70bb11dd004e347e8e (patch)
treeb273f321e4645121e2579de5d1478e7722bf629f /python/fatcat_tools/cleanups/NOTES.txt
parentb3bba513a843029459823ce9a74cce9947bba339 (diff)
downloadfatcat-76db7f4048116a23c82bdd70bb11dd004e347e8e.tar.gz
fatcat-76db7f4048116a23c82bdd70bb11dd004e347e8e.zip
new cleanup python tool/framework
Diffstat (limited to 'python/fatcat_tools/cleanups/NOTES.txt')
-rw-r--r--python/fatcat_tools/cleanups/NOTES.txt24
1 files changed, 24 insertions, 0 deletions
diff --git a/python/fatcat_tools/cleanups/NOTES.txt b/python/fatcat_tools/cleanups/NOTES.txt
new file mode 100644
index 00000000..cdaed6b1
--- /dev/null
+++ b/python/fatcat_tools/cleanups/NOTES.txt
@@ -0,0 +1,24 @@
+
+design is to iterate over JSON list of full entities. perform transforms/fixes.
+if no changes, bail early. if changes, do a request to check that current rev
+of entity is same as processed, to prevent race conditions; if a match, do
+update (in import/merge batch style).
+
+should pre-filter entities piped in. also have a CLI mode to do a single
+entity; check+update code should be distinct from fix code.
+
+releases
+- extra.subtitle => subtitle
+- has pmid, type is journal-article, title like "Retraction:" => type is retraction
+- similar to above, title like "Retracted:" => status is retracted
+- longtail release year is bogus (like > 2030?) => remove release year
+
+files
+- URL has ://archive.org/ link with rel=repository => rel=archive
+- URL has ://web.archive.org/web/None/ link => delete URL
+- URL has short wayback date ("2017") and another url with that as prefix => delete URL
+- mimetype is bogus like (???) => clean mimetype
+
+container
+- extra.issnp = "NA" => delete key
+ => in general, issne or issnp not valid ISSNs -> delete key