diff options
author | Bryan Newbold <bnewbold@archive.org> | 2019-12-24 13:38:07 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2019-12-24 13:38:07 -0800 |
commit | 46e4b69c28f6132e3ae08a2e6e5bbb065458de28 (patch) | |
tree | bff65826873ed68adb997c445ce91c6b7e8df18b | |
parent | 3232f9509404c75777f23d7272416d8de4a45789 (diff) | |
download | chocula-46e4b69c28f6132e3ae08a2e6e5bbb065458de28.tar.gz chocula-46e4b69c28f6132e3ae08a2e6e5bbb065458de28.zip |
update README with better directions
-rw-r--r-- | README.md | 57 | ||||
-rw-r--r-- | README_chocula.md | 7 |
2 files changed, 48 insertions, 16 deletions
@@ -2,7 +2,24 @@ Chocula is a python script to parse and merge journal-level metadata from various sources into a consistent sqlite3 database file for analysis. -See `chocula_schema.sql` for the output schema. +## Quickstart + +You need `python3`, `pipenv`, and `sqlite3` installed. + +First fetch datasets: + + cd data + ./fetch.sh + cd .. + +Then re-generate entire sqlite3 database from scratch: + + pipenv shell + ./chocula.py everything + +Now you can explore the database; see `chocula_schema.sql` for the output schema. + + sqlite3 chocula.sqlite ## History / Name @@ -17,6 +34,7 @@ The name "chocula" comes from a half-baked pun on Count Chocula... something something counting, serials, cereal. [Read more about Count Chocula](https://teamyacht.com/ernstchoukula.com/Ernst-Choukula.html). + ## ISSN-L Munging Unfortunately, there seem to be plenty of legitimate ISSNs that don't end up in @@ -36,7 +54,7 @@ filters out "unknown" ISSN-Ls unless they are coming from existing fatcat entities. -## Sources (out of date) +## Sources The `./data/fetch.sh` script will fetch mirrored snapshots of all these datasets. @@ -88,18 +106,39 @@ In order of precedence (first higher than later): - IA homepage crawl attempts The SHERPA/ROMEO content comes from the list helpfully munged by moreo.info. +UPDATE: this site is now defunct (404). General form here is to build a huge python dict in memory, keyed by the ISSN-L, then write out to disk as JSON. Then the journal-metadata importer takes a subset of fields and inserts to fatcat. Lastly, the elasticsearch transformer takes a subset/combination of -## Python Helpers/Libraries +## Fatcat Container Counts + +Generate a list of ISSN-L identifiers, fetch each from fatcat web peudo-API, and write to JSON. + + cat container_export.json | jq .issnl -r | sort -u > container_issnl.tsv + cat container_issnl.tsv | parallel -j10 curl -s 'https://fatcat.wiki/container/issnl/{}/stats.json' > container_stats.json + +Then load in to chocula and recaculate stats: + + pipenv shell + ./chocula.py load_fatcat_stats container_stats.json + ./chocula.py summarize + + # also, upload stats to an IA item, update fetch.sh and chocula.py variables + +## Journal Homepage Crawl Status + +The `check_issn_urls.py` script tries crawling journal homepages to see if they +are "live" on the web. To regenerate these stats: + + # assuming you have a fresh database + pipenv shell + ./chocula.py export_urls | shuf > urls_to_crawl.tsv + parallel -j10 --bar --pipepart -a urls_to_crawl.shuf.tsv ./check_issn_urls.py > url_status.json + ./chocula.py update_url_status url_status.json + ./chocula.py summarize -- ftfy -- pycountry +Might also want to upload results at this point. -Debian: - - sudo apt install python3-pycountry - sudo pip3 install ftfy diff --git a/README_chocula.md b/README_chocula.md deleted file mode 100644 index 12d695e..0000000 --- a/README_chocula.md +++ /dev/null @@ -1,7 +0,0 @@ - -## Fatcat Container Counts - - cat container_export.json | jq .issnl -r | sort -u > container_issnl.tsv - cat container_issnl.tsv | parallel -j20 curl -s 'https://fatcat.wiki/container/issnl/{}/stats.json' > container_stats.json - -Takes... more than 5 minutes but less than an hour. |