From 09a7e8c9d013f13a1aa1ef4e9b7f397647b79967 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Sun, 21 Mar 2021 01:17:38 +0100 Subject: initial import of skate --- skate/notes/misc.md | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 123 insertions(+) create mode 100644 skate/notes/misc.md (limited to 'skate/notes') diff --git a/skate/notes/misc.md b/skate/notes/misc.md new file mode 100644 index 0000000..79ccd39 --- /dev/null +++ b/skate/notes/misc.md @@ -0,0 +1,123 @@ +## Transformation + +We take jsonlines as input and extract id and derive the key. The resulting +file will be a TSV of the shape: + +``` +ID KEY DOC +``` + +The key will be sorted (optionally, but typical for the use case). + +## Why an extra command? + +We had a python program for this, which we parallelized with the great [GNU +parallel](https://www.gnu.org/software/parallel/) - however, when sharding the +input with parallel the program worked on each chunk; hence probably miss +clusters (not a problem of parallel, but our code, but still). + +## Usage + +``` +$ skate-derive-key < release_entities.jsonl | sort -k2,2 | skate-cluster > cluster.jsonl +``` + +A few options: + +``` +$ skate-derive-key -h +Usage of skate-derive-key: + -b int + batch size (default 50000) + -f string + key function name, other: title, tnorm, tnysi (default "tsand") + -verbose + show progress + -w int + number of workers (default 8) +``` + +Clusters are json lines; + +* a single string as key `k` +* a list of documents as values `v` + +The reason to include the complete documents is performance - for simplicity +and (typically) sequential reads, a "file" seems to be a good option. + +```json +{ + "k": "植字手引", + "v": [ + { + "abstracts": [], + "refs": [], + "contribs": [ + { + "index": 0, + "raw_name": "大久保, 猛雄", + "given_name": "大久保, 猛雄", + "role": "author" + } + ], + "language": "ja", + "publisher": "広島植字研究会", + "ext_ids": { + "doi": "10.11501/1189671" + }, + "release_year": 1929, + "release_stage": "published", + "release_type": "article-journal", + "webcaptures": [], + "filesets": [], + "files": [], + "work_id": "aaaab7poljf25dg4322ebsgism", + "title": "植字手引", + "state": "active", + "ident": "bc5mykteevcy3masrst3zjqgwq", + "revision": "97846ea8-41e5-40aa-9d41-e8c4b45f67e4", + "extra": { + "jalc": {} + } + } + ] +} +``` + +Options: + +``` +$ skate-cluster -h +Usage of skate-cluster: + -d int + which column contains the doc (default 3) + -k int + which column contains the key (one based) (default 2) +``` + +## Performance notes + +* key extraction with parallel jsoniter at about 130MB/s +* having pipes in Go, on the shell or not at all seems to make little difference +* having a large sort buffer is key, then using pipes, the default is 1K + +Note: need to debug performance at some point; e.g. + +``` +$ zstdcat -T0 refs_titles.tsv.zst | TMPDIR=/fast/tmp LC_ALL=C sort -S20% | \ + LC_ALL=C uniq -c | zstd -c9 > refs_titles_unique.tsv.zst +``` + +takes 46min, and we can iterate of 2-5M lines/s. + +## Misc + +The `skate-ref-to-release` command is a simple one-off schema converter (mostly +decode and encode), which runs over ~1.7B docs in 81min - about 349794 docs/s. + +The `skate-verify` command is a port of the fuzzycat.verify (Python) +implementation; it can run 70K verifications/s; e.g. when running over refs, we +can verify 1M clusters in less than 3min (and a full 40M set in less than 2h; +that's 25x faster than the Python/Parallel version). + +![](static/skate.png) -- cgit v1.2.3