From 8f6afe1d0e355d14fd39212da407ed17f9651550 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Tue, 2 Jun 2020 18:43:00 -0700 Subject: re-write README --- README.md | 153 +++++++++++++++++++++++--------------------------------------- 1 file changed, 57 insertions(+), 96 deletions(-) diff --git a/README.md b/README.md index aec785d..f9a47b5 100644 --- a/README.md +++ b/README.md @@ -1,26 +1,37 @@ -Chocula is a python script to parse and merge journal-level metadata from -various sources into a consistent sqlite3 database file for analysis. +Chocula: Scholary Journal Metadata Munging +========================================== + +**Chocula** is a python tool for parsing and merging journal-level metadata +from various sources into a sqlite3 database file for analysis. It is currently +the main source of journal-level metadata for the [fatcat](https://fatcat.wiki) +catalog of published papers. ## Quickstart -You need `python3`, `pipenv`, and `sqlite3` installed. +You need `python3.7`, `pipenv`, and `sqlite3` installed. Commands are run via +`make`. If you don't have `python3.7` installed system-wide, try installing +`pyenv`. -First fetch datasets: +Set up dependencies and fetch source metadata: - cd data - ./fetch.sh - cd .. + make deps fetch-sources Then re-generate entire sqlite3 database from scratch: - pipenv shell - ./chocula.py everything + make database Now you can explore the database; see `chocula_schema.sql` for the output schema. sqlite3 chocula.sqlite +## Developing + +There is partial test coverage, and we verify python type annotations. Run the +tests with: + + make test + ## History / Name This is the 3rd or 4th iteration of open access journal metadata munging as @@ -54,91 +65,41 @@ filters out "unknown" ISSN-Ls unless they are coming from existing fatcat entities. -## Sources - -The `./data/fetch.sh` script will fetch mirrored snapshots of all these -datasets. - -A few sources of normalization/mappings: - -- ISSN-L (from ISSN org) - - Original: - - Snapshot: -- ISO 639-1 language codes: https://datahub.io/core/language-codes -- ISO 3166-1 alpha-2 country codes - -In order of precedence (first higher than later): - -- NCBI Entrez (Pubmed) - - Original: - - Snapshot: -- DOAJ - - Original: - - Snapshot: -- ROAD - - Original: - - Snapshot: -- SHERPA/ROMEO - - Original: (requires reg) - - Mirror: - - Snapshot: -- Norwegian Registry - - Original: - - Snapshot: -- Wikidata via SPARQL Query - - SPARQL: - - Snapshot: -- KBART reports: LOCKSS, CLOCKSS, Portico - - Original: (multiple, see README in IA item) - - Snapshot: -- JSTOR - - Original: -- Crossref title list (not DOIs) - - Original: - - Snapshot: -- OpenAPC Dataset - - Original: - - Snapshot: -- EZB Metadata - - Snapshot: -- IA SIM Microfilm catalog - - Original: -- IA homepage crawl attempts - -The SHERPA/ROMEO content comes from the list helpfully munged by moreo.info. -UPDATE: this site is now defunct (404). - -General form here is to build a huge python dict in memory, keyed by the -ISSN-L, then write out to disk as JSON. Then the journal-metadata importer -takes a subset of fields and inserts to fatcat. Lastly, the elasticsearch -transformer takes a subset/combination of - -## Fatcat Container Counts - -Generate a list of ISSN-L identifiers, fetch each from fatcat web peudo-API, and write to JSON. - - cat container_export.json | jq .issnl -r | sort -u > container_issnl.tsv - cat container_issnl.tsv | parallel -j10 curl -s 'https://fatcat.wiki/container/issnl/{}/stats.json' | jq -c . > container_stats.json - -Then load in to chocula and recaculate stats: - - pipenv shell - ./chocula.py load_fatcat_stats container_stats.json - ./chocula.py summarize - - # also, upload stats to an IA item, update fetch.sh and chocula.py variables - -## Journal Homepage Crawl Status - -The `check_issn_urls.py` script tries crawling journal homepages to see if they -are "live" on the web. To regenerate these stats: - - # assuming you have a fresh database - pipenv shell - ./chocula.py export_urls | shuf > urls_to_crawl.tsv - parallel -j10 --bar --pipepart -a urls_to_crawl.shuf.tsv ./check_issn_urls.py > url_status.json - ./chocula.py update_url_status url_status.json - ./chocula.py summarize - -Might also want to upload results at this point. +## Source Metadata + +The `sources.toml` configuration file contains a canoncial list of metadata +files, the last time they were updated, and original URLs for mirrored files. +The general workflow is that all metadata files are bunled into "source +snapshots" and uploaded/downloaded from the Internet Archive (archive.org) +together. + +There is some tooling (`make update-sources`) to automatically download fresh +copies of some files. Others need to be fetched manually. In all cases, new +files are not automatically integrated: they are added to a sub-folder of +`./data/` and must be manually copied and `sources.toml` updated with the +appropriate date before they will be used. + +Some sources of metadata were helpfully pre-parsed by the maintainer of +. Unfortunately this site is now defunct and the metadata +is out of date. + +Adding new directories or KBART preservation providers is relatively easy, by +creating new helpers in `chocula/directories/` and/or `chocula/kbart.py`. + +## Updating Homepage Status and Countainer Counts + +Run these commands from a fast connection; they will run with parallel +processes. These hit only public URLs and API +endpoints, but you would probably have the best luck running these from inside +the Internet Archive cluster IP space: + + make data/2020-06-03/homepage_status.json + make data/2020-06-03/container_status.json + +Then copy these files to `data/` (no sub-directory) and update the dates in +`sources.toml`. Update the sqlite database with: + + pipenv run python -m chocula load_fatcat_stats + pipenv run python -m chocula load_homepage_status + pipenv run python -m chocula summarize -- cgit v1.2.3