aboutsummaryrefslogtreecommitdiffstats
path: root/TODO.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-06-01 17:01:20 -0700
committerBryan Newbold <bnewbold@archive.org>2020-06-01 17:01:20 -0700
commit57db2db336c08031324e44b2d2880fbd4b6893c9 (patch)
treef5ad462ab6b3e7d3ac7987049e8c604bd5ee9fbe /TODO.md
parent08867f9b8de576f0831e6bb9f7b88acddcc31dee (diff)
downloadchocula-57db2db336c08031324e44b2d2880fbd4b6893c9.tar.gz
chocula-57db2db336c08031324e44b2d2880fbd4b6893c9.zip
'everything' at least partially working
Diffstat (limited to 'TODO.md')
-rw-r--r--TODO.md48
1 files changed, 39 insertions, 9 deletions
diff --git a/TODO.md b/TODO.md
index 2d0c7e3..a6814a0 100644
--- a/TODO.md
+++ b/TODO.md
@@ -1,33 +1,66 @@
+2020-05-06
+x python3.7
+x type annotations / dataclasses
+x "update-sources"
+ => makefile
+- run "everything" successfully
+- "upload-sources"
+ => to archive.org, with datetime
+- "fetch-sources"
+ => all snapshots in a single ia item, with datetime
+- scielo journal metadata
+- kbart loading
+- "platform" column in database
+- rewrite README
+
+- flag to delete old table/rows when loading (?)
+- "loaders" not directories?
+- makefile
+- black
+- refactor most code into module directory
+- tests
+ => index process
+- update upstreams
+
+refactors:
+- "directory" command with directory as arg
+- "kbart" command with directory as arg
+- "load" command with directory as arg
+
+https://isaw.nyu.edu/publications/awol-index/
+
## Chocula
+- fully automated updates, luigi/gluish style
+ => downloads/uploads source metadata files
+ => outputs config file for chocula run
+ => runs chocula everything
+
priorities:
-x fraction/which are pointing to wayback
- coverage stats, particularly for longtail
-x wikidata linkage (prep for wikimania)
- "still in print" flag
- clean out invalid ISSN-L from fatcat
- don't list dead URLs in fatcat
- summary report of some of above
-- update all fatcat (wikidata QID, urls, fixed ISSN-L, etc)
- when updating fatcat:
if title is "blah, Proceedings of the", set type to proceedings and re-write title
if title like "Workshop on", set type
source improvements:
- entrez: "NLM Unique Id"
-- JUFO: finish
+- JURN: finish
- crossref: empty string identifiers?
+- scielo: https://scielo.org/en/journals/list-by-alphabetical-order/?export=csv
+- https://www.arc.gov.au/excellence-research-australia (journal list)
- public scopus list (?)
- scrape/munge public clarivate dumps
- import JURN into fatcat (one way or another)
=> try to title match and get ISSN-L
=> manual lookups for remainders?
-- dump json
- "GOLD" importer (for scopus/WoS)
- check that all fields actually getting imported reasonably
-- homepage crawl/status script
- could poll portal.issn.org like:
https://portal.issn.org/resource/ISSN/1561-7645?format=json
@@ -40,7 +73,4 @@ source improvements:
- update_url_status (needs re-write)
- log out index issues (duplicate ISSN-L, etc) to a file
- validate against GOLD OA list
-- decide what to do with JURN... match? fuzzy match? create missing fatcat?
-- lots of bogus ISSN-L, like 9999-9999 or 0000-0000. should both validate
- check digit and require an ISSN-L to actually exist.