aboutsummaryrefslogtreecommitdiffstats
path: root/skate/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'skate/README.md')
-rw-r--r--skate/README.md23
1 files changed, 23 insertions, 0 deletions
diff --git a/skate/README.md b/skate/README.md
new file mode 100644
index 0000000..2892190
--- /dev/null
+++ b/skate/README.md
@@ -0,0 +1,23 @@
+# skate
+
+Key extractors and zipping tools.
+
+Goal: make key extraction and comparisons fast for billions of records on a
+single machine to support deduplication work for [fatcat](https://fatcat.wiki)
+metadata.
+
+## Problem
+
+Handling a TB of JSON and billions of documents, especially for the following
+use case:
+
+* deriving a key from a document
+* sort documents by (that) key
+* clustering and verifing documents in clusters
+
+The main use case is match candidate generation and verification for fuzzy
+matching, especially for building a citation graph dataset from
+[fatcat](https://fatcat.wiki).
+
+![](static/two_cluster_synopsis.png)
+