aboutsummaryrefslogtreecommitdiffstats
path: root/skate
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-05-29 01:16:56 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-05-29 01:16:56 +0200
commit8e473663cd695bebea35105c7ac2201b82d09ae5 (patch)
treec1c96ae6afdd3c635b9cd4897983fd1550ec3741 /skate
parentcebc33d54a74fb0e0aaa0121c54ae78c97341b22 (diff)
downloadrefcat-8e473663cd695bebea35105c7ac2201b82d09ae5.tar.gz
refcat-8e473663cd695bebea35105c7ac2201b82d09ae5.zip
cleanup README
Diffstat (limited to 'skate')
-rw-r--r--skate/README.md44
1 files changed, 3 insertions, 41 deletions
diff --git a/skate/README.md b/skate/README.md
index 5501196..a055f57 100644
--- a/skate/README.md
+++ b/skate/README.md
@@ -40,48 +40,10 @@ WIP: ...
## Core Utils
-* `skate-derive-key`, will be: `skate-map`
-* `skate-cluster`
-* `skate-verify-*`
+* `skate-map`
+* `skate-reduce`
-The `skate-derive-key` tool derives a key from release entity JSON documents.
-
-```
-$ skate-derive-key < release_entities.jsonlines > docs.tsv
-```
-
-Result will be a three column TSV (ident, key, doc).
-
-```
----- ident --------------- ---- key --------- ---- doc ----------
-
-4lzgf5wzljcptlebhyobccj7ru 2568diamagneticsus {"abstracts":[],...
-```
-
-After this step:
-
-* sort by key, e.g. `LC_ALL=C sort -k2,2 -S 35% --parallel 6 --compress-program pzstd ...`
-* cluster, e.g. `skate-cluster ...`
-
-----
-
-The `skate-cluster` tool converts a sorted key output into a jsonlines
-clusters.
-
-For example, this:
-
- id123 somekey123 {"a":"b", ...}
- id391 somekey123 {"x":"y", ...}
-
-would turn into (a single line containing all docs with the same key).
-
- {"k": "somekey123", "v": [{"a":"b", ...},{"x":"y",...}]}
-
-A single line cluster is easier to parallelize (e.g. for verification, etc.).
-
-----
-
-The `skate-verify-*` tools run various matching and verification algorithms.
+The `skate-map` extract various keys from datasets, `skate-reduce` runs various matching and verification algorithms.
## Extra