aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2021-11-17 16:13:07 -0800
committerBryan Newbold <bnewbold@robocracy.org>2021-11-17 16:13:07 -0800
commit56a7306f1cebf5833238bc4d894261a050c8e3c9 (patch)
treef5de3654c3318631ae57d2d698f7e26d2b5d6c20
parent0cec9c1b9dd009a3380a9548598b15094186c7ea (diff)
downloadfatcat-56a7306f1cebf5833238bc4d894261a050c8e3c9.tar.gz
fatcat-56a7306f1cebf5833238bc4d894261a050c8e3c9.zip
updated notes on possible cleanups
-rw-r--r--notes/cleanup_tasks.txt31
1 files changed, 27 insertions, 4 deletions
diff --git a/notes/cleanup_tasks.txt b/notes/cleanup_tasks.txt
index 43b52836..812d1c2e 100644
--- a/notes/cleanup_tasks.txt
+++ b/notes/cleanup_tasks.txt
@@ -1,25 +1,48 @@
-Cambridge Chemical Database (NCI)
+This is a list of relatively simple bibliographic metadata bugs, which have not
+been fixed yet. Some of these need fixes in importers, others might be one-time
+runs with a simple tool (even `fatcat-cli`).
+
+## Cambridge Chemical Database (NCI)
doi_prefix:10.3406 release_type:article
doi_prefix:10.14469 release_type:article
193,346+ entities
- should be 'dataset' not 'article'
+ should be 'dataset' or 'entry' or something, not 'article'
datacite importer
-Frontiers
+## Frontiers
Frontiers non-PDF abstracts, which have DOIs like `10.3389/conf.*`. Should
crawl these, but `release_type` should be... `abstract`? There are at least
18,743 of these. Should be fixed in both crossref-bot, then a retro-active
cleanup.
-Applied Physics Letters
+## Applied Physics Letters
doi_prefix:10.2172 title:10.2172
For 700+ entities, the title is the DOI number. They all seem to be
"deleted" DOIs, and should be marked as stubs.
+
+## Far-Future Release Years
+
+If year is more than 20 years in the future (arbitrary cut-off), both the year
+and date should probably be cleared.
+
+
+
+
+--------
+
+The following may be more difficult
+
+## Antarctica: A Keystone in a Changing World
+
+container_adgy773dtra3xmrsynghcednqm
+homepage URL is wrong
+
+36k releases of unknown type and unknown publication stage.