From 5b6a0283aa030a34cef4b9b83d281690937cdfae Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Wed, 31 Jul 2019 21:57:47 -0700 Subject: commit TODO list --- TODO.md | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) create mode 100644 TODO.md diff --git a/TODO.md b/TODO.md new file mode 100644 index 0000000..cfbc4d5 --- /dev/null +++ b/TODO.md @@ -0,0 +1,37 @@ + +## Chocula + +priorities: +x fraction/which are pointing to wayback +- coverage stats, particularly for longtail +x wikidata linkage (prep for wikimania) +- "still in print" flag +- clean out invalid ISSN-L from fatcat +- don't list dead URLs in fatcat +- summary report of some of above +- update all fatcat (wikidata QID, urls, fixed ISSN-L, etc) + + +- public scopus list (?) +- scrape/munge public clarivate dumps +- import JURN into fatcat (one way or another) + => try to title match and get ISSN-L + => manual lookups for remainders? +- dump json +- "GOLD" importer (for scopus/WoS) +- check that all fields actually getting imported reasonably +- homepage crawl/status script + +- KBART imports (with JSON, so only a single row per slug) +- imprint/publisher distinction (publisher is big group) +- summary table should be superset of fatcat table +- add timestamp columns to enable updates? +- fatcat export (filters for changes to make, writes out as JSON) +- update_url_status (needs re-write) +- index -> directory +- log out index issues (duplicate ISSN-L, etc) to a file +- validate against GOLD OA list +- decide what to do with JURN... match? fuzzy match? create missing fatcat? +- lots of bogus ISSN-L, like 9999-9999 or 0000-0000. should both validate + check digit and require an ISSN-L to actually exist. + -- cgit v1.2.3