aboutsummaryrefslogtreecommitdiffstats
path: root/notes
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2020-10-18 20:25:53 +0200
committerMartin Czygan <martin.czygan@gmail.com>2020-10-21 03:47:23 +0200
commite33a0f359dd36284c31eb619c6eddd617ef3a779 (patch)
treece1b240455c20673118e0ec9cbb3167f67a25980 /notes
parent26aa121848d41860a398cac8b549531e5f21f03e (diff)
downloadfuzzycat-e33a0f359dd36284c31eb619c6eddd617ef3a779.tar.gz
fuzzycat-e33a0f359dd36284c31eb619c6eddd617ef3a779.zip
cluster variants
Diffstat (limited to 'notes')
-rw-r--r--notes/Workflow.md54
1 files changed, 54 insertions, 0 deletions
diff --git a/notes/Workflow.md b/notes/Workflow.md
new file mode 100644
index 0000000..abf0d76
--- /dev/null
+++ b/notes/Workflow.md
@@ -0,0 +1,54 @@
+# Workflow
+
+Separate problem in half, first find clusters, then examine clusters (as
+proposed).
+
+## Finding clusters
+
+* group by raw exact title
+* group by lowercase title
+* group by slug title
+* group by ngram title and authors
+* group by ngram title (prefix, suffix) and authors
+* group by elasticsearch
+* group by doi without vX prefix
+* group by soundex
+* group by a simhash over the record
+
+As for performance, the feature needs to be calculated in one pass, then the
+grouping reduces to a sort, in a second pass.
+
+The output could be a TSV file, with method and then release identifiers.
+
+```
+rawt o3utonw5qzhddo7l4lmwptgeey nnpmnwln7be2zb5hd2qanq3r7q
+```
+
+Or jsonlines for a bit of structure.
+
+```
+{"m": "rawt", "c": ["o3utonw5qzhddo7l4lmwptgeey", "nnpmnwln7be2zb5hd2qanq3r7q"]}
+```
+
+```
+$ zstdcat -T0 release_export_expanded.json.zst | fuzzycat-cluster -g > clusters.json
+```
+
+### Performance considerations
+
+* [orjson](https://github.com/ijl/orjson), [pysimdjson](https://github.com/TkTech/pysimdjson)
+
+
+## Examine cluster
+
+There will be various methods by which to examine the cluster as well.
+
+We need to fetch releases by identifier, this can be the full record or some
+partial record that has been cached somewhere.
+
+The input is then a list of releases and the output would be a equally sized or
+smaller cluster of releases which we assume represent the same record.
+
+Apart from that, there may be different relations, e.g. not the exact same
+thing, but something, that has an interval to it, like some thing that mostly
+differs in year?