aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--skate/README.md44
1 files changed, 3 insertions, 41 deletions
diff --git a/skate/README.md b/skate/README.md
index 5501196..a055f57 100644
--- a/skate/README.md
+++ b/skate/README.md
@@ -40,48 +40,10 @@ WIP: ...
## Core Utils
-* `skate-derive-key`, will be: `skate-map`
-* `skate-cluster`
-* `skate-verify-*`
+* `skate-map`
+* `skate-reduce`
-The `skate-derive-key` tool derives a key from release entity JSON documents.
-
-```
-$ skate-derive-key < release_entities.jsonlines > docs.tsv
-```
-
-Result will be a three column TSV (ident, key, doc).
-
-```
----- ident --------------- ---- key --------- ---- doc ----------
-
-4lzgf5wzljcptlebhyobccj7ru 2568diamagneticsus {"abstracts":[],...
-```
-
-After this step:
-
-* sort by key, e.g. `LC_ALL=C sort -k2,2 -S 35% --parallel 6 --compress-program pzstd ...`
-* cluster, e.g. `skate-cluster ...`
-
-----
-
-The `skate-cluster` tool converts a sorted key output into a jsonlines
-clusters.
-
-For example, this:
-
- id123 somekey123 {"a":"b", ...}
- id391 somekey123 {"x":"y", ...}
-
-would turn into (a single line containing all docs with the same key).
-
- {"k": "somekey123", "v": [{"a":"b", ...},{"x":"y",...}]}
-
-A single line cluster is easier to parallelize (e.g. for verification, etc.).
-
-----
-
-The `skate-verify-*` tools run various matching and verification algorithms.
+The `skate-map` extract various keys from datasets, `skate-reduce` runs various matching and verification algorithms.
## Extra