update README with better directions

author: Bryan Newbold <bnewbold@archive.org> 2019-12-24 13:38:07 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2019-12-24 13:38:07 -0800
commit: 46e4b69c28f6132e3ae08a2e6e5bbb065458de28 (patch)
tree: bff65826873ed68adb997c445ce91c6b7e8df18b /README.md
parent: 3232f9509404c75777f23d7272416d8de4a45789 (diff)
download: chocula-46e4b69c28f6132e3ae08a2e6e5bbb065458de28.tar.gz
chocula-46e4b69c28f6132e3ae08a2e6e5bbb065458de28.zip
1 files changed, 48 insertions, 9 deletions
diff --git a/README.md b/README.md
index 110a43c..f931ec7 100644
--- a/README.md
+++ b/README.md
@@ -2,7 +2,24 @@
 Chocula is a python script to parse and merge journal-level metadata from
 various sources into a consistent sqlite3 database file for analysis.
 
-See `chocula_schema.sql` for the output schema.
+## Quickstart
+
+You need `python3`, `pipenv`, and `sqlite3` installed.
+
+First fetch datasets:
+
+    cd data
+    ./fetch.sh
+    cd ..
+
+Then re-generate entire sqlite3 database from scratch:
+
+    pipenv shell
+    ./chocula.py everything
+
+Now you can explore the database; see `chocula_schema.sql` for the output schema.
+
+    sqlite3 chocula.sqlite
 
 ## History / Name
 
@@ -17,6 +34,7 @@ The name "chocula" comes from a half-baked pun on Count Chocula... something
 something counting, serials, cereal.
 [Read more about Count Chocula](https://teamyacht.com/ernstchoukula.com/Ernst-Choukula.html).
 
+
 ## ISSN-L Munging
 
 Unfortunately, there seem to be plenty of legitimate ISSNs that don't end up in
@@ -36,7 +54,7 @@ filters out "unknown" ISSN-Ls unless they are coming from existing fatcat
 entities.
 
 
-## Sources (out of date)
+## Sources
 
 The `./data/fetch.sh` script will fetch mirrored snapshots of all these
 datasets.
@@ -88,18 +106,39 @@ In order of precedence (first higher than later):
 - IA homepage crawl attempts
 
 The SHERPA/ROMEO content comes from the list helpfully munged by moreo.info.
+UPDATE: this site is now defunct (404).
 
 General form here is to build a huge python dict in memory, keyed by the
 ISSN-L, then write out to disk as JSON. Then the journal-metadata importer
 takes a subset of fields and inserts to fatcat. Lastly, the elasticsearch
 transformer takes a subset/combination of 
 
-## Python Helpers/Libraries
+## Fatcat Container Counts
+
+Generate a list of ISSN-L identifiers, fetch each from fatcat web peudo-API, and write to JSON.
+
+    cat container_export.json | jq .issnl -r | sort -u > container_issnl.tsv
+    cat container_issnl.tsv | parallel -j10 curl -s 'https://fatcat.wiki/container/issnl/{}/stats.json' > container_stats.json
+
+Then load in to chocula and recaculate stats:
+
+    pipenv shell
+    ./chocula.py load_fatcat_stats container_stats.json
+    ./chocula.py summarize
+
+    # also, upload stats to an IA item, update fetch.sh and chocula.py variables
+
+## Journal Homepage Crawl Status
+
+The `check_issn_urls.py` script tries crawling journal homepages to see if they
+are "live" on the web. To regenerate these stats:
+
+    # assuming you have a fresh database
+    pipenv shell
+    ./chocula.py export_urls | shuf > urls_to_crawl.tsv
+    parallel -j10 --bar --pipepart -a urls_to_crawl.shuf.tsv ./check_issn_urls.py > url_status.json
+    ./chocula.py update_url_status url_status.json
+    ./chocula.py summarize
 
-- ftfy
-- pycountry
+Might also want to upload results at this point.
 
-Debian:
-    
-    sudo apt install python3-pycountry
-    sudo pip3 install ftfy
author	Bryan Newbold <bnewbold@archive.org>	2019-12-24 13:38:07 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2019-12-24 13:38:07 -0800
commit	46e4b69c28f6132e3ae08a2e6e5bbb065458de28 (patch)
tree	bff65826873ed68adb997c445ce91c6b7e8df18b /README.md
parent	3232f9509404c75777f23d7272416d8de4a45789 (diff)
download	chocula-46e4b69c28f6132e3ae08a2e6e5bbb065458de28.tar.gz chocula-46e4b69c28f6132e3ae08a2e6e5bbb065458de28.zip