From e34b9c285303cb1e0b98b9a7cc1f65c0c2b3c20c Mon Sep 17 00:00:00 2001
From: Martin Czygan <martin.czygan@gmail.com>
Date: Fri, 2 Apr 2021 14:52:34 +0200
Subject: update README

---
 README.md | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index 532eda2..d76cb49 100644
--- a/README.md
+++ b/README.md
@@ -2,10 +2,11 @@
 
 Scholarly citation graph related code; maintained by
 [martin@archive.org](mailto:martin@archive.org); multiple subprojects to keep
-all relevant code close:
+all relevant code close.
 
-* python: mostly luigi tasks (using [shiv](https://github.com/linkedin/shiv) for single-file deployments)
-* skate: various Go command line tools (wrapped in a deb packaged)
+* python: mostly [luigi](https://github.com/spotify/luigi) tasks (using
+  [shiv](https://github.com/linkedin/shiv) for single-file deployments)
+* skate: various Go command line tools (packaged as deb)
 
 Context: [fatcat](https://fatcat.wiki), "Mellon Grant" (20/21).
 
@@ -20,6 +21,9 @@ We use informal, internal versioning, currently v2, next will be v3.
 > As of v2, we have linkage between fatcat release entities by doi, pmid, pmcid, arxiv.
 
 * [ ] URLs in corpus linked to best possible timestamp (GWB)
+
+> CDX API probably good for sampling; we'll need to tap into `/user/wmdata2/cdx-all-index/` - (note: try pyspark)
+
 * [ ] Harvest all URLs in citation corpus (maybe do a sample first)
 
 > A seed-list (from refs; not from the full-text) is done; need to prepare a crawl and lookups in GWB.
-- 
cgit v1.2.3