diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2019-10-08 15:56:53 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2019-10-08 15:56:53 -0700 |
commit | 76db7f4048116a23c82bdd70bb11dd004e347e8e (patch) | |
tree | b273f321e4645121e2579de5d1478e7722bf629f /python/fatcat_tools/cleanups/NOTES.txt | |
parent | b3bba513a843029459823ce9a74cce9947bba339 (diff) | |
download | fatcat-76db7f4048116a23c82bdd70bb11dd004e347e8e.tar.gz fatcat-76db7f4048116a23c82bdd70bb11dd004e347e8e.zip |
new cleanup python tool/framework
Diffstat (limited to 'python/fatcat_tools/cleanups/NOTES.txt')
-rw-r--r-- | python/fatcat_tools/cleanups/NOTES.txt | 24 |
1 files changed, 24 insertions, 0 deletions
diff --git a/python/fatcat_tools/cleanups/NOTES.txt b/python/fatcat_tools/cleanups/NOTES.txt new file mode 100644 index 00000000..cdaed6b1 --- /dev/null +++ b/python/fatcat_tools/cleanups/NOTES.txt @@ -0,0 +1,24 @@ + +design is to iterate over JSON list of full entities. perform transforms/fixes. +if no changes, bail early. if changes, do a request to check that current rev +of entity is same as processed, to prevent race conditions; if a match, do +update (in import/merge batch style). + +should pre-filter entities piped in. also have a CLI mode to do a single +entity; check+update code should be distinct from fix code. + +releases +- extra.subtitle => subtitle +- has pmid, type is journal-article, title like "Retraction:" => type is retraction +- similar to above, title like "Retracted:" => status is retracted +- longtail release year is bogus (like > 2030?) => remove release year + +files +- URL has ://archive.org/ link with rel=repository => rel=archive +- URL has ://web.archive.org/web/None/ link => delete URL +- URL has short wayback date ("2017") and another url with that as prefix => delete URL +- mimetype is bogus like (???) => clean mimetype + +container +- extra.issnp = "NA" => delete key + => in general, issne or issnp not valid ISSNs -> delete key |