This folder contains scripts to merge journal metadat from multiple sources and provide a snapshot for bulk importing into fatcat. Specific bots will probably be needed to do continous updates; that's out of scope for this first import. ## Sources The `./data/fetch.sh` script will fetch mirrored snapshots of all these datasets. A few sources of normalization/mappings: - ISSN-L (from ISSN org) - Original: - Snapshot: - ISO 639-1 language codes: https://datahub.io/core/language-codes - ISO 3166-1 alpha-2 nation codes In order of precedence (first higher than later): - NCBI Entrez (Pubmed) - Original: - Snapshot: - DOAJ - Original: - Snapshot: - ROAD - Original: - Snapshot: - SHERPA/ROMEO - Original: (requires reg) - Mirror: - Snapshot: - Norwegian Registry - Original: - Snapshot: - Wikidata (TODO: Journal-level not title-level) - Original: - Snapshot: - KBART reports: LOCKSS, CLOCKSS, Portico - Original: (multiple, see README in IA item) - Snapshot: - JSTOR - Original: - Snapshot: - Crossref title list (not DOIs) - Original: - Snapshot: - IA SIM Microfilm catalog - Original: - IA homepage crawl attempts The SHERPA/ROMEO content comes from the list helpfully munged by moreo.info. General form here is to build a huge python dict in memory, keyed by the ISSN-L, then write out to disk as JSON. Then the journal-metadata importer takes a subset of fields and inserts to fatcat. Lastly, the elasticsearch transformer takes a subset/combination of ## Python Helpers/Libraries - ftfy - pycountry Debian: sudo apt install python3-pycountry sudo pip3 install ftfy