aboutsummaryrefslogtreecommitdiffstats
path: root/TODO.md
blob: 8b4cdb99f3d1d1410b3361817b511aa0e9323008 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

priorities:
- coverage stats, particularly for longtail
- "still in print" flag
- clean out invalid ISSN-L from fatcat
- don't list dead URLs in fatcat

## Sources

- PKP OJS index
    => mostly redundant with DOAJ?
- dblp conferences/series
    => no container-only metadata dump available?
- MAG
- vanished journals
    => https://github.com/njahn82/vanished_journals
    => https://isaw.nyu.edu/publications/awol-index/
- sherpa/romeo refactor (no moreo updates)
- entrez refactor (no moreo updates)
- unpaywall journal-level classification
    => ask for journal-level dump or do munging
- SERP homepage munging
- repositories (?)
- jurn matches
- datacite metadata (?)
    => via munging
- currated quality lists (eg, national libraries)
    => https://www.arc.gov.au/excellence-research-australia
- public scopus list (?)
- scrape/munge public clarivate dumps
- "GOLD" importer (for scopus/WoS)
- ISSN metadata from portal.issn.org
    scraping is done
    only for ISSN-Ls from existing table
    https://portal.issn.org/resource/ISSN/1561-7645?format=json
    would require a good deal of munging (eg, MARC region -> ISO) (?)

improvements:
- entrez: "NLM Unique Id"
- JURN: finish 
- crossref: empty string identifiers?

## Code / Behavior

- black (syntax)
- log out index issues (duplicate ISSN-L, etc) to a file
- flag to delete old table/rows when loading (?)
- fully automated updates, cron, luigi/gluish style
    => downloads/uploads source metadata files
- check that all fields actually getting imported reasonably
- efficient fatcat export
    => filters for changes to make
    => not really necessary, fatcat importer already skips

## Schema

- `platform` column in database
- `container_type` column in database
    => munge this in various ways
    => if title is "blah,  Proceedings of the", set type to proceedings and re-write title
    => if title like "Workshop on", set type
- imprint/publisher distinction (publisher is big group)
- summary table should be superset of fatcat table
- `update_url_status` (needs re-write) (?)