aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2019-12-24 13:38:07 -0800
committerBryan Newbold <bnewbold@archive.org>2019-12-24 13:38:07 -0800
commit46e4b69c28f6132e3ae08a2e6e5bbb065458de28 (patch)
treebff65826873ed68adb997c445ce91c6b7e8df18b
parent3232f9509404c75777f23d7272416d8de4a45789 (diff)
downloadchocula-46e4b69c28f6132e3ae08a2e6e5bbb065458de28.tar.gz
chocula-46e4b69c28f6132e3ae08a2e6e5bbb065458de28.zip
update README with better directions
-rw-r--r--README.md57
-rw-r--r--README_chocula.md7
2 files changed, 48 insertions, 16 deletions
diff --git a/README.md b/README.md
index 110a43c..f931ec7 100644
--- a/README.md
+++ b/README.md
@@ -2,7 +2,24 @@
Chocula is a python script to parse and merge journal-level metadata from
various sources into a consistent sqlite3 database file for analysis.
-See `chocula_schema.sql` for the output schema.
+## Quickstart
+
+You need `python3`, `pipenv`, and `sqlite3` installed.
+
+First fetch datasets:
+
+ cd data
+ ./fetch.sh
+ cd ..
+
+Then re-generate entire sqlite3 database from scratch:
+
+ pipenv shell
+ ./chocula.py everything
+
+Now you can explore the database; see `chocula_schema.sql` for the output schema.
+
+ sqlite3 chocula.sqlite
## History / Name
@@ -17,6 +34,7 @@ The name "chocula" comes from a half-baked pun on Count Chocula... something
something counting, serials, cereal.
[Read more about Count Chocula](https://teamyacht.com/ernstchoukula.com/Ernst-Choukula.html).
+
## ISSN-L Munging
Unfortunately, there seem to be plenty of legitimate ISSNs that don't end up in
@@ -36,7 +54,7 @@ filters out "unknown" ISSN-Ls unless they are coming from existing fatcat
entities.
-## Sources (out of date)
+## Sources
The `./data/fetch.sh` script will fetch mirrored snapshots of all these
datasets.
@@ -88,18 +106,39 @@ In order of precedence (first higher than later):
- IA homepage crawl attempts
The SHERPA/ROMEO content comes from the list helpfully munged by moreo.info.
+UPDATE: this site is now defunct (404).
General form here is to build a huge python dict in memory, keyed by the
ISSN-L, then write out to disk as JSON. Then the journal-metadata importer
takes a subset of fields and inserts to fatcat. Lastly, the elasticsearch
transformer takes a subset/combination of
-## Python Helpers/Libraries
+## Fatcat Container Counts
+
+Generate a list of ISSN-L identifiers, fetch each from fatcat web peudo-API, and write to JSON.
+
+ cat container_export.json | jq .issnl -r | sort -u > container_issnl.tsv
+ cat container_issnl.tsv | parallel -j10 curl -s 'https://fatcat.wiki/container/issnl/{}/stats.json' > container_stats.json
+
+Then load in to chocula and recaculate stats:
+
+ pipenv shell
+ ./chocula.py load_fatcat_stats container_stats.json
+ ./chocula.py summarize
+
+ # also, upload stats to an IA item, update fetch.sh and chocula.py variables
+
+## Journal Homepage Crawl Status
+
+The `check_issn_urls.py` script tries crawling journal homepages to see if they
+are "live" on the web. To regenerate these stats:
+
+ # assuming you have a fresh database
+ pipenv shell
+ ./chocula.py export_urls | shuf > urls_to_crawl.tsv
+ parallel -j10 --bar --pipepart -a urls_to_crawl.shuf.tsv ./check_issn_urls.py > url_status.json
+ ./chocula.py update_url_status url_status.json
+ ./chocula.py summarize
-- ftfy
-- pycountry
+Might also want to upload results at this point.
-Debian:
-
- sudo apt install python3-pycountry
- sudo pip3 install ftfy
diff --git a/README_chocula.md b/README_chocula.md
deleted file mode 100644
index 12d695e..0000000
--- a/README_chocula.md
+++ /dev/null
@@ -1,7 +0,0 @@
-
-## Fatcat Container Counts
-
- cat container_export.json | jq .issnl -r | sort -u > container_issnl.tsv
- cat container_issnl.tsv | parallel -j20 curl -s 'https://fatcat.wiki/container/issnl/{}/stats.json' > container_stats.json
-
-Takes... more than 5 minutes but less than an hour.