From 92189ad99ae7f799377a0fcbb928e09ff1f82a79 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Thu, 24 Jan 2019 13:06:09 -0800 Subject: first-pass journal metadata munger --- extra/journal_metadata/README.md | 71 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 71 insertions(+) create mode 100644 extra/journal_metadata/README.md (limited to 'extra/journal_metadata/README.md') diff --git a/extra/journal_metadata/README.md b/extra/journal_metadata/README.md new file mode 100644 index 00000000..61dbc6b0 --- /dev/null +++ b/extra/journal_metadata/README.md @@ -0,0 +1,71 @@ + +This folder contains scripts to merge journal metadat from multiple sources and +provide a snapshot for bulk importing into fatcat. + +Specific bots will probably be needed to do continous updates; that's out of +scope for this first import. + + +## Sources + +The `./data/fetch.sh` script will fetch mirrored snapshots of all these +datasets. + +A few sources of normalization/mappings: + +- ISSN-L (from ISSN org) + - Original: + - Snapshot: +- ISO 639-1 language codes: https://datahub.io/core/language-codes +- ISO 3166-1 alpha-2 nation codes + +In order of precedence (first higher than later): + +- NCBI Entrez (Pubmed) + - Original: + - Snapshot: +- DOAJ + - Original: + - Snapshot: +- ROAD + - Original: + - Snapshot: +- SHERPA/ROMEO + - Original: (requires reg) + - Mirror: + - Snapshot: +- Norwegian Registry + - Original: + - Snapshot: +- Wikidata (TODO: Journal-level not title-level) + - Original: + - Snapshot: +- KBART reports: LOCKSS, CLOCKSS, Portico + - Original: (multiple, see README in IA item) + - Snapshot: +- JSTOR + - Original: + - Snapshot: +- Crossref title list (not DOIs) + - Original: + - Snapshot: +- IA SIM Microfilm catalog + - Original: +- IA homepage crawl attempts + +The SHERPA/ROMEO content comes from the list helpfully munged by moreo.info. + +General form here is to build a huge python dict in memory, keyed by the +ISSN-L, then write out to disk as JSON. Then the journal-metadata importer +takes a subset of fields and inserts to fatcat. Lastly, the elasticsearch +transformer takes a subset/combination of + +## Python Helpers/Libraries + +- ftfy +- pycountry + +Debian: + + sudo apt install python3-pycountry + sudo pip3 install ftfy -- cgit v1.2.3