diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-17 16:13:07 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-17 16:13:07 -0800 |
commit | 56a7306f1cebf5833238bc4d894261a050c8e3c9 (patch) | |
tree | f5de3654c3318631ae57d2d698f7e26d2b5d6c20 | |
parent | 0cec9c1b9dd009a3380a9548598b15094186c7ea (diff) | |
download | fatcat-56a7306f1cebf5833238bc4d894261a050c8e3c9.tar.gz fatcat-56a7306f1cebf5833238bc4d894261a050c8e3c9.zip |
updated notes on possible cleanups
-rw-r--r-- | notes/cleanup_tasks.txt | 31 |
1 files changed, 27 insertions, 4 deletions
diff --git a/notes/cleanup_tasks.txt b/notes/cleanup_tasks.txt index 43b52836..812d1c2e 100644 --- a/notes/cleanup_tasks.txt +++ b/notes/cleanup_tasks.txt @@ -1,25 +1,48 @@ -Cambridge Chemical Database (NCI) +This is a list of relatively simple bibliographic metadata bugs, which have not +been fixed yet. Some of these need fixes in importers, others might be one-time +runs with a simple tool (even `fatcat-cli`). + +## Cambridge Chemical Database (NCI) doi_prefix:10.3406 release_type:article doi_prefix:10.14469 release_type:article 193,346+ entities - should be 'dataset' not 'article' + should be 'dataset' or 'entry' or something, not 'article' datacite importer -Frontiers +## Frontiers Frontiers non-PDF abstracts, which have DOIs like `10.3389/conf.*`. Should crawl these, but `release_type` should be... `abstract`? There are at least 18,743 of these. Should be fixed in both crossref-bot, then a retro-active cleanup. -Applied Physics Letters +## Applied Physics Letters doi_prefix:10.2172 title:10.2172 For 700+ entities, the title is the DOI number. They all seem to be "deleted" DOIs, and should be marked as stubs. + +## Far-Future Release Years + +If year is more than 20 years in the future (arbitrary cut-off), both the year +and date should probably be cleared. + + + + +-------- + +The following may be more difficult + +## Antarctica: A Keystone in a Changing World + +container_adgy773dtra3xmrsynghcednqm +homepage URL is wrong + +36k releases of unknown type and unknown publication stage. |