diff options
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 10 |
1 files changed, 7 insertions, 3 deletions
@@ -2,10 +2,11 @@ Scholarly citation graph related code; maintained by [martin@archive.org](mailto:martin@archive.org); multiple subprojects to keep -all relevant code close: +all relevant code close. -* python: mostly luigi tasks (using [shiv](https://github.com/linkedin/shiv) for single-file deployments) -* skate: various Go command line tools (wrapped in a deb packaged) +* python: mostly [luigi](https://github.com/spotify/luigi) tasks (using + [shiv](https://github.com/linkedin/shiv) for single-file deployments) +* skate: various Go command line tools (packaged as deb) Context: [fatcat](https://fatcat.wiki), "Mellon Grant" (20/21). @@ -20,6 +21,9 @@ We use informal, internal versioning, currently v2, next will be v3. > As of v2, we have linkage between fatcat release entities by doi, pmid, pmcid, arxiv. * [ ] URLs in corpus linked to best possible timestamp (GWB) + +> CDX API probably good for sampling; we'll need to tap into `/user/wmdata2/cdx-all-index/` - (note: try pyspark) + * [ ] Harvest all URLs in citation corpus (maybe do a sample first) > A seed-list (from refs; not from the full-text) is done; need to prepare a crawl and lookups in GWB. |