1 files changed, 123 insertions, 0 deletions
diff --git a/skate/notes/misc.md b/skate/notes/misc.md
new file mode 100644
index 0000000..79ccd39
--- /dev/null
+++ b/skate/notes/misc.md
@@ -0,0 +1,123 @@
+## Transformation
+
+We take jsonlines as input and extract id and derive the key. The resulting
+file will be a TSV of the shape:
+
+```
+ID    KEY    DOC
+```
+
+The key will be sorted (optionally, but typical for the use case).
+
+## Why an extra command?
+
+We had a python program for this, which we parallelized with the great [GNU
+parallel](https://www.gnu.org/software/parallel/) - however, when sharding the
+input with parallel the program worked on each chunk; hence probably miss
+clusters (not a problem of parallel, but our code, but still).
+
+## Usage
+
+```
+$ skate-derive-key < release_entities.jsonl | sort -k2,2 | skate-cluster > cluster.jsonl
+```
+
+A few options:
+
+```
+$ skate-derive-key -h
+Usage of skate-derive-key:
+  -b int
+        batch size (default 50000)
+  -f string
+        key function name, other: title, tnorm, tnysi (default "tsand")
+  -verbose
+        show progress
+  -w int
+        number of workers (default 8)
+```
+
+Clusters are json lines;
+
+* a single string as key `k`
+* a list of documents as values `v`
+
+The reason to include the complete documents is performance - for simplicity
+and (typically) sequential reads, a "file" seems to be a good option.
+
+```json
+{
+  "k": "植字手引",
+  "v": [
+    {
+      "abstracts": [],
+      "refs": [],
+      "contribs": [
+        {
+          "index": 0,
+          "raw_name": "大久保, 猛雄",
+          "given_name": "大久保, 猛雄",
+          "role": "author"
+        }
+      ],
+      "language": "ja",
+      "publisher": "広島植字研究会",
+      "ext_ids": {
+        "doi": "10.11501/1189671"
+      },
+      "release_year": 1929,
+      "release_stage": "published",
+      "release_type": "article-journal",
+      "webcaptures": [],
+      "filesets": [],
+      "files": [],
+      "work_id": "aaaab7poljf25dg4322ebsgism",
+      "title": "植字手引",
+      "state": "active",
+      "ident": "bc5mykteevcy3masrst3zjqgwq",
+      "revision": "97846ea8-41e5-40aa-9d41-e8c4b45f67e4",
+      "extra": {
+        "jalc": {}
+      }
+    }
+  ]
+}
+```
+
+Options:
+
+```
+$ skate-cluster -h
+Usage of skate-cluster:
+  -d int
+        which column contains the doc (default 3)
+  -k int
+        which column contains the key (one based) (default 2)
+```
+
+## Performance notes
+
+* key extraction with parallel jsoniter at about 130MB/s
+* having pipes in Go, on the shell or not at all seems to make little difference
+* having a large sort buffer is key, then using pipes, the default is 1K
+
+Note: need to debug performance at some point; e.g.
+
+```
+$ zstdcat -T0 refs_titles.tsv.zst | TMPDIR=/fast/tmp LC_ALL=C sort -S20% | \
+    LC_ALL=C uniq -c | zstd -c9 > refs_titles_unique.tsv.zst
+```
+
+takes 46min, and we can iterate of 2-5M lines/s.
+
+## Misc
+
+The `skate-ref-to-release` command is a simple one-off schema converter (mostly
+decode and encode), which runs over ~1.7B docs in 81min - about 349794 docs/s.
+
+The `skate-verify` command is a port of the fuzzycat.verify (Python)
+implementation; it can run 70K verifications/s; e.g. when running over refs, we
+can verify 1M clusters in less than 3min (and a full 40M set in less than 2h;
+that's 25x faster than the Python/Parallel version).
+
+![](static/skate.png)