aboutsummaryrefslogtreecommitdiffstats
path: root/skate
diff options
context:
space:
mode:
Diffstat (limited to 'skate')
-rw-r--r--skate/README.md84
1 files changed, 44 insertions, 40 deletions
diff --git a/skate/README.md b/skate/README.md
index 11f294b..8c05c67 100644
--- a/skate/README.md
+++ b/skate/README.md
@@ -1,35 +1,48 @@
# skate
-This suite of command line tools have been written for various parts of the
-citation graph pipeline.
+This a small library and suite of command line tools related to generating a
+citation graph.
+
+## Why?
Python was a bit too slow, even when parallelized, e.g. for generating clusters
of similar documents or to do verification. An option for the future would be
to resort to [Cython](https://cython.org/). Parts of
-[fuzzycat](https://git.archive.org/webgroup/fuzzycat) has been ported to Go for
-performance.
+[fuzzycat](https://git.archive.org/webgroup/fuzzycat) has been ported into this
+project for performance.
![](static/zipkey.png)
-## Tools
+## Core Utils
+
+* `skate-derive-key`, `skate-map`
+* `skate-cluster`
+* `skate-verify-*`
-### skate-wikipedia-doi
-TSV (page title, DOI, doc) from wikipedia refs.
+The `skate-derive-key` tool derives a key from release entity JSON documents.
```
-$ parquet-tools cat --json minimal_dataset.parquet | skate-wikipedia-doi
-Rational point 10.1515/crll.1988.386.32 {"type_of_citation" ...
-Cubic surface 10.2140/ant.2007.1.393 {"type_of_citation" ...
+$ skate-derive-key < release_entities.jsonlines > docs.tsv
+```
+
+Result will be a three column TSV (ident, key, doc).
+
```
+---- ident --------------- ---- key --------- ---- doc ----------
-### skate-bref-id
+4lzgf5wzljcptlebhyobccj7ru 2568diamagneticsus {"abstracts":[],...
+```
-Temporary helper to add a key to a biblioref document.
+After this step:
-### skate-cluster
+* sort by key, e.g. `LC_ALL=C sort -k2,2 -S 35% --parallel 6 --compress-program pzstd ...`
+* cluster, e.g. `skate-cluster ...`
-Converts a sorted key output into a jsonlines clusters.
+----
+
+The `skate-cluster` tool converts a sorted key output into a jsonlines
+clusters.
For example, this:
@@ -42,46 +55,37 @@ would turn into (a single line containing all docs with the same key).
A single line cluster is easier to parallelize (e.g. for verification, etc.).
-### skate-derive-key
+----
-skate-derive-key derives a key from release entity JSON documents.
+The `skate-verify-*` tools run various matching and verification algorithms.
-```
-$ skate-derive-key < release_entities.jsonlines > docs.tsv
-```
+## Extra
-Result will be a three column TSV (ident, key, doc).
+* skate-wikipedia-doi
-```
----- ident --------------- ---- key --------- ---- doc ----------
+> TSV (page title, DOI, doc) from wikipedia refs.
-4lzgf5wzljcptlebhyobccj7ru 2568diamagneticsus {"abstracts":[],...
+```
+$ parquet-tools cat --json minimal_dataset.parquet | skate-wikipedia-doi
+Rational point 10.1515/crll.1988.386.32 {"type_of_citation" ...
+Cubic surface 10.2140/ant.2007.1.393 {"type_of_citation" ...
```
-After this step:
-
-* sort by key, e.g. `LC_ALL=C sort -k2,2 -S 35% --parallel 6 --compress-program pzstd ...`
-* cluster, e.g. `skate-cluster ...`
-
-### skate-from-unstructured
-
-Takes a refs file and plucks out identifiers from unstructured field.
-
-### skate-ref-to-release
+* skate-bref-id
-Converts a ref document to a release. Part of first run, merging refs and releases.
+> Temporary helper to add a key to a biblioref document.
-### skate-to-doi
+* skate-from-unstructured
-Sanitize DOI in tabular file.
+> Takes a refs file and plucks out identifiers from unstructured field.
-### skate-verify
+* skate-ref-to-release
-Run various matching and verification algorithms.
+> Converts a ref document to a release. Part of first run, merging refs and releases.
-### skate-map
+* skate-to-doi
-A more generic version of derive key.
+> Sanitize DOI in tabular file.
## Misc