aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-06-11 19:50:01 -0700
committerBryan Newbold <bnewbold@archive.org>2020-06-11 19:50:01 -0700
commitef8f8d560ea64c1c02841f9b5097bb05f16c9d6f (patch)
tree5c4d45217e9049d58d2c02e88f66f31d327b8942
parentbcc0e8e303a1a40fc32f7f9bb46c5e4b6d8cd71e (diff)
downloadchocula-ef8f8d560ea64c1c02841f9b5097bb05f16c9d6f.tar.gz
chocula-ef8f8d560ea64c1c02841f9b5097bb05f16c9d6f.zip
update TODO
-rw-r--r--TODO.md110
1 files changed, 49 insertions, 61 deletions
diff --git a/TODO.md b/TODO.md
index a6814a0..8b4cdb9 100644
--- a/TODO.md
+++ b/TODO.md
@@ -1,76 +1,64 @@
-2020-05-06
-x python3.7
-x type annotations / dataclasses
-x "update-sources"
- => makefile
-- run "everything" successfully
-- "upload-sources"
- => to archive.org, with datetime
-- "fetch-sources"
- => all snapshots in a single ia item, with datetime
-- scielo journal metadata
-- kbart loading
-- "platform" column in database
-- rewrite README
-
-- flag to delete old table/rows when loading (?)
-- "loaders" not directories?
-- makefile
-- black
-- refactor most code into module directory
-- tests
- => index process
-- update upstreams
-
-refactors:
-- "directory" command with directory as arg
-- "kbart" command with directory as arg
-- "load" command with directory as arg
-
-https://isaw.nyu.edu/publications/awol-index/
-
-## Chocula
-
-- fully automated updates, luigi/gluish style
- => downloads/uploads source metadata files
- => outputs config file for chocula run
- => runs chocula everything
-
priorities:
- coverage stats, particularly for longtail
- "still in print" flag
- clean out invalid ISSN-L from fatcat
- don't list dead URLs in fatcat
-- summary report of some of above
-- when updating fatcat:
- if title is "blah, Proceedings of the", set type to proceedings and re-write title
- if title like "Workshop on", set type
-source improvements:
-- entrez: "NLM Unique Id"
-- JURN: finish
-- crossref: empty string identifiers?
-- scielo: https://scielo.org/en/journals/list-by-alphabetical-order/?export=csv
-- https://www.arc.gov.au/excellence-research-australia (journal list)
+## Sources
+- PKP OJS index
+ => mostly redundant with DOAJ?
+- dblp conferences/series
+ => no container-only metadata dump available?
+- MAG
+- vanished journals
+ => https://github.com/njahn82/vanished_journals
+ => https://isaw.nyu.edu/publications/awol-index/
+- sherpa/romeo refactor (no moreo updates)
+- entrez refactor (no moreo updates)
+- unpaywall journal-level classification
+ => ask for journal-level dump or do munging
+- SERP homepage munging
+- repositories (?)
+- jurn matches
+- datacite metadata (?)
+ => via munging
+- currated quality lists (eg, national libraries)
+ => https://www.arc.gov.au/excellence-research-australia
- public scopus list (?)
- scrape/munge public clarivate dumps
-- import JURN into fatcat (one way or another)
- => try to title match and get ISSN-L
- => manual lookups for remainders?
- "GOLD" importer (for scopus/WoS)
+- ISSN metadata from portal.issn.org
+ scraping is done
+ only for ISSN-Ls from existing table
+ https://portal.issn.org/resource/ISSN/1561-7645?format=json
+ would require a good deal of munging (eg, MARC region -> ISO) (?)
+
+improvements:
+- entrez: "NLM Unique Id"
+- JURN: finish
+- crossref: empty string identifiers?
+
+## Code / Behavior
+
+- black (syntax)
+- log out index issues (duplicate ISSN-L, etc) to a file
+- flag to delete old table/rows when loading (?)
+- fully automated updates, cron, luigi/gluish style
+ => downloads/uploads source metadata files
- check that all fields actually getting imported reasonably
+- efficient fatcat export
+ => filters for changes to make
+ => not really necessary, fatcat importer already skips
-- could poll portal.issn.org like:
- https://portal.issn.org/resource/ISSN/1561-7645?format=json
- would require a good deal of munging (eg, MARC region -> ISO)
-- KBART imports (with JSON, so only a single row per slug)
+## Schema
+
+- `platform` column in database
+- `container_type` column in database
+ => munge this in various ways
+ => if title is "blah, Proceedings of the", set type to proceedings and re-write title
+ => if title like "Workshop on", set type
- imprint/publisher distinction (publisher is big group)
- summary table should be superset of fatcat table
-- add timestamp columns to enable updates?
-- fatcat export (filters for changes to make, writes out as JSON)
-- update_url_status (needs re-write)
-- log out index issues (duplicate ISSN-L, etc) to a file
-- validate against GOLD OA list
-
+- `update_url_status` (needs re-write) (?)