aboutsummaryrefslogtreecommitdiffstats
path: root/README.md
blob: de10b64a48deff5448f81edb799f105982d661af (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109

Chocula: Scholary Journal Metadata Munging
==========================================

<div align="center">
<img src="extra/count_chocula.jpg">
</div>

**Chocula** is a python tool for parsing and merging journal-level metadata
from various sources into a sqlite3 database file for analysis. It is currently
the main source of journal-level metadata for the [fatcat](https://fatcat.wiki)
catalog of published papers.

## Quickstart

You need `python3.8`, `pipenv`, and `sqlite3` installed. Commands are run via
`make`. If you don't have `python3.8` installed system-wide, try installing
with `pyenv`.

Set up dependencies and fetch source metadata:

    make dep fetch-sources

Then re-generate entire sqlite3 database from scratch:

    make database

Now you can explore the database; see `chocula_schema.sql` for the output schema.

    sqlite3 chocula.sqlite

## Developing

There is partial test coverage, and we verify python type annotations. Run the
tests with:

    make test

## History / Name

This is the 3rd or 4th iteration of open access journal metadata munging as
part of the fatcat project; earlier attempts were crude ISSN spreadsheet
munging, then the `oa-journals-analysis` repo (Jupyter notebook and a web
interface), then the `fatcat:extra/journal_metadata/` script for bootstrapping
fatcat container metadata. This repo started as the fatcat `journal_metadata`
directory and retains the git history of that folder.

The name "chocula" comes from a half-baked pun on Count Chocula... something
something counting, serials, cereal.
[Read more about Count Chocula](https://teamyacht.com/ernstchoukula.com/Ernst-Choukula.html).


## ISSN-L Munging

Unfortunately, there seem to be plenty of legitimate ISSNs that don't end up in
the ISSN-L table. On the portal.issn.org public site, these are listed as:

    "This provisional record has been produced before publication of the
    resource.  The published resource has not yet been checked by the ISSN
    Network.It is only available to subscribing users."

For example:

- 2199-3246/2199-3254: Digital Experiences in Mathematics Education

Previously these were allowed through into fatcat, so some 2000+ entries exist.
This allowed through at least 110 totally bogus ISSNs. Currently, chocula
filters out "unknown" ISSN-Ls unless they are coming from existing fatcat
entities.


## Source Metadata

The `sources.toml` configuration file contains a canoncial list of metadata
files, the last time they were updated, and original URLs for mirrored files.
The general workflow is that all metadata files are bunled into "source
snapshots" and uploaded/downloaded from the Internet Archive (archive.org)
together.

There is some tooling (`make update-sources`) to automatically download fresh
copies of some files. Others need to be fetched manually. In all cases, new
files are not automatically integrated: they are added to a sub-folder of
`./data/` and must be manually copied and `sources.toml` updated with the
appropriate date before they will be used.

Some sources of metadata were helpfully pre-parsed by the maintainer of
<https://moreo.info>. Unfortunately this site is now defunct and the metadata
is out of date.

Adding new directories or KBART preservation providers is relatively easy, by
creating new helpers in `chocula/directories/` and/or `chocula/kbart.py`.

## Updating Homepage Status and Countainer Counts

Run these commands from a fast connection; they will run with parallel
processes. These hit only public URLs and API
endpoints, but you would probably have the best luck running these from inside
the Internet Archive cluster IP space:

    make data/2020-06-03/homepage_status.json
    make data/2020-06-03/container_stats.json

Then copy these files to `data/` (no sub-directory) and update the dates in
`sources.toml`. Update the sqlite database with:

    pipenv run python -m chocula load_fatcat_stats
    pipenv run python -m chocula load_homepage_status
    pipenv run python -m chocula summarize