diff options
Diffstat (limited to 'skate/README.md')
-rw-r--r-- | skate/README.md | 23 |
1 files changed, 23 insertions, 0 deletions
diff --git a/skate/README.md b/skate/README.md new file mode 100644 index 0000000..2892190 --- /dev/null +++ b/skate/README.md @@ -0,0 +1,23 @@ +# skate + +Key extractors and zipping tools. + +Goal: make key extraction and comparisons fast for billions of records on a +single machine to support deduplication work for [fatcat](https://fatcat.wiki) +metadata. + +## Problem + +Handling a TB of JSON and billions of documents, especially for the following +use case: + +* deriving a key from a document +* sort documents by (that) key +* clustering and verifing documents in clusters + +The main use case is match candidate generation and verification for fuzzy +matching, especially for building a citation graph dataset from +[fatcat](https://fatcat.wiki). + +![](static/two_cluster_synopsis.png) + |