diff options
-rw-r--r-- | skate/README.md | 44 |
1 files changed, 3 insertions, 41 deletions
diff --git a/skate/README.md b/skate/README.md index 5501196..a055f57 100644 --- a/skate/README.md +++ b/skate/README.md @@ -40,48 +40,10 @@ WIP: ... ## Core Utils -* `skate-derive-key`, will be: `skate-map` -* `skate-cluster` -* `skate-verify-*` +* `skate-map` +* `skate-reduce` -The `skate-derive-key` tool derives a key from release entity JSON documents. - -``` -$ skate-derive-key < release_entities.jsonlines > docs.tsv -``` - -Result will be a three column TSV (ident, key, doc). - -``` ----- ident --------------- ---- key --------- ---- doc ---------- - -4lzgf5wzljcptlebhyobccj7ru 2568diamagneticsus {"abstracts":[],... -``` - -After this step: - -* sort by key, e.g. `LC_ALL=C sort -k2,2 -S 35% --parallel 6 --compress-program pzstd ...` -* cluster, e.g. `skate-cluster ...` - ----- - -The `skate-cluster` tool converts a sorted key output into a jsonlines -clusters. - -For example, this: - - id123 somekey123 {"a":"b", ...} - id391 somekey123 {"x":"y", ...} - -would turn into (a single line containing all docs with the same key). - - {"k": "somekey123", "v": [{"a":"b", ...},{"x":"y",...}]} - -A single line cluster is easier to parallelize (e.g. for verification, etc.). - ----- - -The `skate-verify-*` tools run various matching and verification algorithms. +The `skate-map` extract various keys from datasets, `skate-reduce` runs various matching and verification algorithms. ## Extra |