From 56a7306f1cebf5833238bc4d894261a050c8e3c9 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Wed, 17 Nov 2021 16:13:07 -0800 Subject: updated notes on possible cleanups --- notes/cleanup_tasks.txt | 31 +++++++++++++++++++++++++++---- 1 file changed, 27 insertions(+), 4 deletions(-) diff --git a/notes/cleanup_tasks.txt b/notes/cleanup_tasks.txt index 43b52836..812d1c2e 100644 --- a/notes/cleanup_tasks.txt +++ b/notes/cleanup_tasks.txt @@ -1,25 +1,48 @@ -Cambridge Chemical Database (NCI) +This is a list of relatively simple bibliographic metadata bugs, which have not +been fixed yet. Some of these need fixes in importers, others might be one-time +runs with a simple tool (even `fatcat-cli`). + +## Cambridge Chemical Database (NCI) doi_prefix:10.3406 release_type:article doi_prefix:10.14469 release_type:article 193,346+ entities - should be 'dataset' not 'article' + should be 'dataset' or 'entry' or something, not 'article' datacite importer -Frontiers +## Frontiers Frontiers non-PDF abstracts, which have DOIs like `10.3389/conf.*`. Should crawl these, but `release_type` should be... `abstract`? There are at least 18,743 of these. Should be fixed in both crossref-bot, then a retro-active cleanup. -Applied Physics Letters +## Applied Physics Letters doi_prefix:10.2172 title:10.2172 For 700+ entities, the title is the DOI number. They all seem to be "deleted" DOIs, and should be marked as stubs. + +## Far-Future Release Years + +If year is more than 20 years in the future (arbitrary cut-off), both the year +and date should probably be cleared. + + + + +-------- + +The following may be more difficult + +## Antarctica: A Keystone in a Changing World + +container_adgy773dtra3xmrsynghcednqm +homepage URL is wrong + +36k releases of unknown type and unknown publication stage. -- cgit v1.2.3