Chocula: Scholary Journal Metadata Munging
Chocula is a python tool for parsing and merging journal-level metadata from various sources into a sqlite3 database file for analysis. It is currently the main source of journal-level metadata for the fatcat catalog of published papers.
Quickstart
You need python3.8
, pipenv
, and sqlite3
installed. Commands are run via
make
. If you don't have python3.8
installed system-wide, try installing
with pyenv
.
Set up dependencies and fetch source metadata:
make dep fetch-sources
Then re-generate entire sqlite3 database from scratch:
make database
Now you can explore the database; see chocula_schema.sql
for the output schema.
sqlite3 chocula.sqlite
Developing
There is partial test coverage, and we verify python type annotations. Run the tests with:
make test
History / Name
This is the 3rd or 4th iteration of open access journal metadata munging as
part of the fatcat project; earlier attempts were crude ISSN spreadsheet
munging, then the oa-journals-analysis
repo (Jupyter notebook and a web
interface), then the fatcat:extra/journal_metadata/
script for bootstrapping
fatcat container metadata. This repo started as the fatcat journal_metadata
directory and retains the git history of that folder.
The name "chocula" comes from a half-baked pun on Count Chocula... something something counting, serials, cereal. Read more about Count Chocula.
ISSN-L Munging
Unfortunately, there seem to be plenty of legitimate ISSNs that don't end up in the ISSN-L table. On the portal.issn.org public site, these are listed as:
"This provisional record has been produced before publication of the
resource. The published resource has not yet been checked by the ISSN
Network.It is only available to subscribing users."
For example:
- 2199-3246/2199-3254: Digital Experiences in Mathematics Education
Previously these were allowed through into fatcat, so some 2000+ entries exist. This allowed through at least 110 totally bogus ISSNs. Currently, chocula filters out "unknown" ISSN-Ls unless they are coming from existing fatcat entities.
Source Metadata
The sources.toml
configuration file contains a canoncial list of metadata
files, the last time they were updated, and original URLs for mirrored files.
The general workflow is that all metadata files are bunled into "source
snapshots" and uploaded/downloaded from the Internet Archive (archive.org)
together.
There is some tooling (make update-sources
) to automatically download fresh
copies of some files. Others need to be fetched manually. In all cases, new
files are not automatically integrated: they are added to a sub-folder of
./data/
and must be manually copied and sources.toml
updated with the
appropriate date before they will be used.
Some sources of metadata were helpfully pre-parsed by the maintainer of https://moreo.info. Unfortunately this site is now defunct and the metadata is out of date.
Adding new directories or KBART preservation providers is relatively easy, by
creating new helpers in chocula/directories/
and/or chocula/kbart.py
.
Updating Homepage Status and Countainer Counts
Run these commands from a fast connection; they will run with parallel processes. These hit only public URLs and API endpoints, but you would probably have the best luck running these from inside the Internet Archive cluster IP space:
make data/2020-06-03/homepage_status.json
make data/2020-06-03/container_stats.json
Then copy these files to data/
(no sub-directory) and update the dates in
sources.toml
. Update the sqlite database with:
pipenv run python -m chocula load_fatcat_stats
pipenv run python -m chocula load_homepage_status
pipenv run python -m chocula summarize