update TODO

author: Bryan Newbold <bnewbold@archive.org> 2020-06-11 19:50:01 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2020-06-11 19:50:01 -0700
commit: ef8f8d560ea64c1c02841f9b5097bb05f16c9d6f (patch)
tree: 5c4d45217e9049d58d2c02e88f66f31d327b8942
parent: bcc0e8e303a1a40fc32f7f9bb46c5e4b6d8cd71e (diff)
download: chocula-ef8f8d560ea64c1c02841f9b5097bb05f16c9d6f.tar.gz
chocula-ef8f8d560ea64c1c02841f9b5097bb05f16c9d6f.zip
1 files changed, 49 insertions, 61 deletions
diff --git a/TODO.md b/TODO.md
index a6814a0..8b4cdb9 100644
--- a/TODO.md
+++ b/TODO.md
@@ -1,76 +1,64 @@
 
-2020-05-06
-x python3.7
-x type annotations / dataclasses
-x "update-sources"
-    => makefile
-- run "everything" successfully
-- "upload-sources"
-    => to archive.org, with datetime
-- "fetch-sources"
-    => all snapshots in a single ia item, with datetime
-- scielo journal metadata
-- kbart loading
-- "platform" column in database
-- rewrite README
-
-- flag to delete old table/rows when loading (?)
-- "loaders" not directories?
-- makefile
-- black
-- refactor most code into module directory
-- tests
-    => index process
-- update upstreams
-
-refactors:
-- "directory" command with directory as arg
-- "kbart" command with directory as arg
-- "load" command with directory as arg
-
-https://isaw.nyu.edu/publications/awol-index/
-
-## Chocula
-
-- fully automated updates, luigi/gluish style
-    => downloads/uploads source metadata files
-    => outputs config file for chocula run
-    => runs chocula everything
-
 priorities:
 - coverage stats, particularly for longtail
 - "still in print" flag
 - clean out invalid ISSN-L from fatcat
 - don't list dead URLs in fatcat
-- summary report of some of above
-- when updating fatcat:
-    if title is "blah,  Proceedings of the", set type to proceedings and re-write title
-    if title like "Workshop on", set type
 
-source improvements:
-- entrez: "NLM Unique Id"
-- JURN: finish 
-- crossref: empty string identifiers?
-- scielo: https://scielo.org/en/journals/list-by-alphabetical-order/?export=csv
-- https://www.arc.gov.au/excellence-research-australia (journal list)
+## Sources
 
+- PKP OJS index
+    => mostly redundant with DOAJ?
+- dblp conferences/series
+    => no container-only metadata dump available?
+- MAG
+- vanished journals
+    => https://github.com/njahn82/vanished_journals
+    => https://isaw.nyu.edu/publications/awol-index/
+- sherpa/romeo refactor (no moreo updates)
+- entrez refactor (no moreo updates)
+- unpaywall journal-level classification
+    => ask for journal-level dump or do munging
+- SERP homepage munging
+- repositories (?)
+- jurn matches
+- datacite metadata (?)
+    => via munging
+- currated quality lists (eg, national libraries)
+    => https://www.arc.gov.au/excellence-research-australia
 - public scopus list (?)
 - scrape/munge public clarivate dumps
-- import JURN into fatcat (one way or another)
-    => try to title match and get ISSN-L
-    => manual lookups for remainders?
 - "GOLD" importer (for scopus/WoS)
+- ISSN metadata from portal.issn.org
+    scraping is done
+    only for ISSN-Ls from existing table
+    https://portal.issn.org/resource/ISSN/1561-7645?format=json
+    would require a good deal of munging (eg, MARC region -> ISO) (?)
+
+improvements:
+- entrez: "NLM Unique Id"
+- JURN: finish 
+- crossref: empty string identifiers?
+
+## Code / Behavior
+
+- black (syntax)
+- log out index issues (duplicate ISSN-L, etc) to a file
+- flag to delete old table/rows when loading (?)
+- fully automated updates, cron, luigi/gluish style
+    => downloads/uploads source metadata files
 - check that all fields actually getting imported reasonably
+- efficient fatcat export
+    => filters for changes to make
+    => not really necessary, fatcat importer already skips
 
-- could poll portal.issn.org like:
-    https://portal.issn.org/resource/ISSN/1561-7645?format=json
-    would require a good deal of munging (eg, MARC region -> ISO)
-- KBART imports (with JSON, so only a single row per slug)
+## Schema
+
+- `platform` column in database
+- `container_type` column in database
+    => munge this in various ways
+    => if title is "blah,  Proceedings of the", set type to proceedings and re-write title
+    => if title like "Workshop on", set type
 - imprint/publisher distinction (publisher is big group)
 - summary table should be superset of fatcat table
-- add timestamp columns to enable updates?
-- fatcat export (filters for changes to make, writes out as JSON)
-- update_url_status (needs re-write)
-- log out index issues (duplicate ISSN-L, etc) to a file
-- validate against GOLD OA list
-
+- `update_url_status` (needs re-write) (?)
author	Bryan Newbold <bnewbold@archive.org>	2020-06-11 19:50:01 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2020-06-11 19:50:01 -0700
commit	ef8f8d560ea64c1c02841f9b5097bb05f16c9d6f (patch)
tree	5c4d45217e9049d58d2c02e88f66f31d327b8942
parent	bcc0e8e303a1a40fc32f7f9bb46c5e4b6d8cd71e (diff)
download	chocula-ef8f8d560ea64c1c02841f9b5097bb05f16c9d6f.tar.gz chocula-ef8f8d560ea64c1c02841f9b5097bb05f16c9d6f.zip