From 6628731b1531435ceb4151ed87cf483ee3134119 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Fri, 30 Apr 2021 18:34:00 +0200 Subject: wip: update README --- skate/README.md | 84 ++++++++++++++++++++++++++++++--------------------------- 1 file changed, 44 insertions(+), 40 deletions(-) diff --git a/skate/README.md b/skate/README.md index 11f294b..8c05c67 100644 --- a/skate/README.md +++ b/skate/README.md @@ -1,35 +1,48 @@ # skate -This suite of command line tools have been written for various parts of the -citation graph pipeline. +This a small library and suite of command line tools related to generating a +citation graph. + +## Why? Python was a bit too slow, even when parallelized, e.g. for generating clusters of similar documents or to do verification. An option for the future would be to resort to [Cython](https://cython.org/). Parts of -[fuzzycat](https://git.archive.org/webgroup/fuzzycat) has been ported to Go for -performance. +[fuzzycat](https://git.archive.org/webgroup/fuzzycat) has been ported into this +project for performance. ![](static/zipkey.png) -## Tools +## Core Utils + +* `skate-derive-key`, `skate-map` +* `skate-cluster` +* `skate-verify-*` -### skate-wikipedia-doi -TSV (page title, DOI, doc) from wikipedia refs. +The `skate-derive-key` tool derives a key from release entity JSON documents. ``` -$ parquet-tools cat --json minimal_dataset.parquet | skate-wikipedia-doi -Rational point 10.1515/crll.1988.386.32 {"type_of_citation" ... -Cubic surface 10.2140/ant.2007.1.393 {"type_of_citation" ... +$ skate-derive-key < release_entities.jsonlines > docs.tsv +``` + +Result will be a three column TSV (ident, key, doc). + ``` +---- ident --------------- ---- key --------- ---- doc ---------- -### skate-bref-id +4lzgf5wzljcptlebhyobccj7ru 2568diamagneticsus {"abstracts":[],... +``` -Temporary helper to add a key to a biblioref document. +After this step: -### skate-cluster +* sort by key, e.g. `LC_ALL=C sort -k2,2 -S 35% --parallel 6 --compress-program pzstd ...` +* cluster, e.g. `skate-cluster ...` -Converts a sorted key output into a jsonlines clusters. +---- + +The `skate-cluster` tool converts a sorted key output into a jsonlines +clusters. For example, this: @@ -42,46 +55,37 @@ would turn into (a single line containing all docs with the same key). A single line cluster is easier to parallelize (e.g. for verification, etc.). -### skate-derive-key +---- -skate-derive-key derives a key from release entity JSON documents. +The `skate-verify-*` tools run various matching and verification algorithms. -``` -$ skate-derive-key < release_entities.jsonlines > docs.tsv -``` +## Extra -Result will be a three column TSV (ident, key, doc). +* skate-wikipedia-doi -``` ----- ident --------------- ---- key --------- ---- doc ---------- +> TSV (page title, DOI, doc) from wikipedia refs. -4lzgf5wzljcptlebhyobccj7ru 2568diamagneticsus {"abstracts":[],... +``` +$ parquet-tools cat --json minimal_dataset.parquet | skate-wikipedia-doi +Rational point 10.1515/crll.1988.386.32 {"type_of_citation" ... +Cubic surface 10.2140/ant.2007.1.393 {"type_of_citation" ... ``` -After this step: - -* sort by key, e.g. `LC_ALL=C sort -k2,2 -S 35% --parallel 6 --compress-program pzstd ...` -* cluster, e.g. `skate-cluster ...` - -### skate-from-unstructured - -Takes a refs file and plucks out identifiers from unstructured field. - -### skate-ref-to-release +* skate-bref-id -Converts a ref document to a release. Part of first run, merging refs and releases. +> Temporary helper to add a key to a biblioref document. -### skate-to-doi +* skate-from-unstructured -Sanitize DOI in tabular file. +> Takes a refs file and plucks out identifiers from unstructured field. -### skate-verify +* skate-ref-to-release -Run various matching and verification algorithms. +> Converts a ref document to a release. Part of first run, merging refs and releases. -### skate-map +* skate-to-doi -A more generic version of derive key. +> Sanitize DOI in tabular file. ## Misc -- cgit v1.2.3