diff options
author | Bryan Newbold <bnewbold@archive.org> | 2019-07-31 17:24:40 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2019-07-31 17:24:40 -0700 |
commit | fc54908da2aad233e2943dd62fd8c0d93120011c (patch) | |
tree | f3e20e7b3b7f372bab4c9463b0927277e07f1b50 | |
parent | 601e23d816282e125c48b9224d0e47f45d06f9f8 (diff) | |
download | chocula-fc54908da2aad233e2943dd62fd8c0d93120011c.tar.gz chocula-fc54908da2aad233e2943dd62fd8c0d93120011c.zip |
README update
-rw-r--r-- | README.md | 56 |
1 files changed, 35 insertions, 21 deletions
@@ -1,12 +1,42 @@ -This folder contains scripts to merge journal metadat from multiple sources and -provide a snapshot for bulk importing into fatcat. +Chocula is a python script to parse and merge journal-level metadata from +various sources into a consistent sqlite3 database file for analysis. -Specific bots will probably be needed to do continous updates; that's out of -scope for this first import. +See `chocula_schema.sql` for the output schema. +## History / Name -## Sources +This is the 3rd or 4th iteration of open access journal metadata munging as +part of the fatcat project; earlier attempts were crude ISSN spreadsheet +munging, then the `oa-journals-analysis` repo (Jupyter notebook and a web +interface), then the `fatcat:extra/journal_metadata/` script for bootstrapping +fatcat container metadata. This repo started as the fatcat `journal_metadata` +directory and retains the git history of that folder. + +The name "chocula" comes from a half-baked pun on Count Chocula... something +something counting, serials, cereal. +[Read more about Count Chocula](https://teamyacht.com/ernstchoukula.com/Ernst-Choukula.html). + +## ISSN-L Munging + +Unfortunately, there seem to be plenty of legitimate ISSNs that don't end up in +the ISSN-L table. On the portal.issn.org public site, these are listed as: + + "This provisional record has been produced before publication of the + resource. The published resource has not yet been checked by the ISSN + Network.It is only available to subscribing users." + +For example: + +- 2199-3246/2199-3254: Digital Experiences in Mathematics Education + +Previously these were allowed through into fatcat, so some 2000+ entries exist. +This allowed through at least 110 totally bogus ISSNs. Currently, chocula +filters out "unknown" ISSN-Ls unless they are coming from existing fatcat +entities. + + +## Sources (out of date) The `./data/fetch.sh` script will fetch mirrored snapshots of all these datasets. @@ -60,22 +90,6 @@ ISSN-L, then write out to disk as JSON. Then the journal-metadata importer takes a subset of fields and inserts to fatcat. Lastly, the elasticsearch transformer takes a subset/combination of -## ISSN-L Munging - -Unfortunately, there seem to be plenty of legitimate ISSNs that don't end up in -the ISSN-L table. On the portal.issn.org public site, these are listed as: - - "This provisional record has been produced before publication of the - resource. The published resource has not yet been checked by the ISSN - Network.It is only available to subscribing users." - -For example: - -- 2199-3246/2199-3254: Digital Experiences in Mathematics Education - -Instead of just dropping these entirely, we're currently munging these by -putting the electronic or print ISSN in the ISSN-L position. - ## Python Helpers/Libraries - ftfy |