From ef8f8d560ea64c1c02841f9b5097bb05f16c9d6f Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Thu, 11 Jun 2020 19:50:01 -0700 Subject: update TODO --- TODO.md | 110 +++++++++++++++++++++++++++++----------------------------------- 1 file changed, 49 insertions(+), 61 deletions(-) diff --git a/TODO.md b/TODO.md index a6814a0..8b4cdb9 100644 --- a/TODO.md +++ b/TODO.md @@ -1,76 +1,64 @@ -2020-05-06 -x python3.7 -x type annotations / dataclasses -x "update-sources" - => makefile -- run "everything" successfully -- "upload-sources" - => to archive.org, with datetime -- "fetch-sources" - => all snapshots in a single ia item, with datetime -- scielo journal metadata -- kbart loading -- "platform" column in database -- rewrite README - -- flag to delete old table/rows when loading (?) -- "loaders" not directories? -- makefile -- black -- refactor most code into module directory -- tests - => index process -- update upstreams - -refactors: -- "directory" command with directory as arg -- "kbart" command with directory as arg -- "load" command with directory as arg - -https://isaw.nyu.edu/publications/awol-index/ - -## Chocula - -- fully automated updates, luigi/gluish style - => downloads/uploads source metadata files - => outputs config file for chocula run - => runs chocula everything - priorities: - coverage stats, particularly for longtail - "still in print" flag - clean out invalid ISSN-L from fatcat - don't list dead URLs in fatcat -- summary report of some of above -- when updating fatcat: - if title is "blah, Proceedings of the", set type to proceedings and re-write title - if title like "Workshop on", set type -source improvements: -- entrez: "NLM Unique Id" -- JURN: finish -- crossref: empty string identifiers? -- scielo: https://scielo.org/en/journals/list-by-alphabetical-order/?export=csv -- https://www.arc.gov.au/excellence-research-australia (journal list) +## Sources +- PKP OJS index + => mostly redundant with DOAJ? +- dblp conferences/series + => no container-only metadata dump available? +- MAG +- vanished journals + => https://github.com/njahn82/vanished_journals + => https://isaw.nyu.edu/publications/awol-index/ +- sherpa/romeo refactor (no moreo updates) +- entrez refactor (no moreo updates) +- unpaywall journal-level classification + => ask for journal-level dump or do munging +- SERP homepage munging +- repositories (?) +- jurn matches +- datacite metadata (?) + => via munging +- currated quality lists (eg, national libraries) + => https://www.arc.gov.au/excellence-research-australia - public scopus list (?) - scrape/munge public clarivate dumps -- import JURN into fatcat (one way or another) - => try to title match and get ISSN-L - => manual lookups for remainders? - "GOLD" importer (for scopus/WoS) +- ISSN metadata from portal.issn.org + scraping is done + only for ISSN-Ls from existing table + https://portal.issn.org/resource/ISSN/1561-7645?format=json + would require a good deal of munging (eg, MARC region -> ISO) (?) + +improvements: +- entrez: "NLM Unique Id" +- JURN: finish +- crossref: empty string identifiers? + +## Code / Behavior + +- black (syntax) +- log out index issues (duplicate ISSN-L, etc) to a file +- flag to delete old table/rows when loading (?) +- fully automated updates, cron, luigi/gluish style + => downloads/uploads source metadata files - check that all fields actually getting imported reasonably +- efficient fatcat export + => filters for changes to make + => not really necessary, fatcat importer already skips -- could poll portal.issn.org like: - https://portal.issn.org/resource/ISSN/1561-7645?format=json - would require a good deal of munging (eg, MARC region -> ISO) -- KBART imports (with JSON, so only a single row per slug) +## Schema + +- `platform` column in database +- `container_type` column in database + => munge this in various ways + => if title is "blah, Proceedings of the", set type to proceedings and re-write title + => if title like "Workshop on", set type - imprint/publisher distinction (publisher is big group) - summary table should be superset of fatcat table -- add timestamp columns to enable updates? -- fatcat export (filters for changes to make, writes out as JSON) -- update_url_status (needs re-write) -- log out index issues (duplicate ISSN-L, etc) to a file -- validate against GOLD OA list - +- `update_url_status` (needs re-write) (?) -- cgit v1.2.3