aboutsummaryrefslogtreecommitdiffstats
path: root/skate/notes/misc.md
diff options
context:
space:
mode:
Diffstat (limited to 'skate/notes/misc.md')
-rw-r--r--skate/notes/misc.md123
1 files changed, 123 insertions, 0 deletions
diff --git a/skate/notes/misc.md b/skate/notes/misc.md
new file mode 100644
index 0000000..79ccd39
--- /dev/null
+++ b/skate/notes/misc.md
@@ -0,0 +1,123 @@
+## Transformation
+
+We take jsonlines as input and extract id and derive the key. The resulting
+file will be a TSV of the shape:
+
+```
+ID KEY DOC
+```
+
+The key will be sorted (optionally, but typical for the use case).
+
+## Why an extra command?
+
+We had a python program for this, which we parallelized with the great [GNU
+parallel](https://www.gnu.org/software/parallel/) - however, when sharding the
+input with parallel the program worked on each chunk; hence probably miss
+clusters (not a problem of parallel, but our code, but still).
+
+## Usage
+
+```
+$ skate-derive-key < release_entities.jsonl | sort -k2,2 | skate-cluster > cluster.jsonl
+```
+
+A few options:
+
+```
+$ skate-derive-key -h
+Usage of skate-derive-key:
+ -b int
+ batch size (default 50000)
+ -f string
+ key function name, other: title, tnorm, tnysi (default "tsand")
+ -verbose
+ show progress
+ -w int
+ number of workers (default 8)
+```
+
+Clusters are json lines;
+
+* a single string as key `k`
+* a list of documents as values `v`
+
+The reason to include the complete documents is performance - for simplicity
+and (typically) sequential reads, a "file" seems to be a good option.
+
+```json
+{
+ "k": "植字手引",
+ "v": [
+ {
+ "abstracts": [],
+ "refs": [],
+ "contribs": [
+ {
+ "index": 0,
+ "raw_name": "大久保, 猛雄",
+ "given_name": "大久保, 猛雄",
+ "role": "author"
+ }
+ ],
+ "language": "ja",
+ "publisher": "広島植字研究会",
+ "ext_ids": {
+ "doi": "10.11501/1189671"
+ },
+ "release_year": 1929,
+ "release_stage": "published",
+ "release_type": "article-journal",
+ "webcaptures": [],
+ "filesets": [],
+ "files": [],
+ "work_id": "aaaab7poljf25dg4322ebsgism",
+ "title": "植字手引",
+ "state": "active",
+ "ident": "bc5mykteevcy3masrst3zjqgwq",
+ "revision": "97846ea8-41e5-40aa-9d41-e8c4b45f67e4",
+ "extra": {
+ "jalc": {}
+ }
+ }
+ ]
+}
+```
+
+Options:
+
+```
+$ skate-cluster -h
+Usage of skate-cluster:
+ -d int
+ which column contains the doc (default 3)
+ -k int
+ which column contains the key (one based) (default 2)
+```
+
+## Performance notes
+
+* key extraction with parallel jsoniter at about 130MB/s
+* having pipes in Go, on the shell or not at all seems to make little difference
+* having a large sort buffer is key, then using pipes, the default is 1K
+
+Note: need to debug performance at some point; e.g.
+
+```
+$ zstdcat -T0 refs_titles.tsv.zst | TMPDIR=/fast/tmp LC_ALL=C sort -S20% | \
+ LC_ALL=C uniq -c | zstd -c9 > refs_titles_unique.tsv.zst
+```
+
+takes 46min, and we can iterate of 2-5M lines/s.
+
+## Misc
+
+The `skate-ref-to-release` command is a simple one-off schema converter (mostly
+decode and encode), which runs over ~1.7B docs in 81min - about 349794 docs/s.
+
+The `skate-verify` command is a port of the fuzzycat.verify (Python)
+implementation; it can run 70K verifications/s; e.g. when running over refs, we
+can verify 1M clusters in less than 3min (and a full 40M set in less than 2h;
+that's 25x faster than the Python/Parallel version).
+
+![](static/skate.png)