# skate The skate suite of command line tools have been written for various parts of the citation graph pipeline. ## Tools ### skate-wikipedia-doi TSV (page title, DOI, doc) from wikipedia refs. ``` $ parquet-tools cat --json minimal_dataset.parquet | skate-wikipedia-doi Rational point 10.1515/crll.1988.386.32 {"type_of_citation" ... Cubic surface 10.2140/ant.2007.1.393 {"type_of_citation" ... ``` ### skate-bref-id Temporary helper to add a key to a biblioref document. ### skate-cluster Converts a sorted key output into a jsonlines clusters. For example, this: id123 somekey123 {"a":"b", ...} id391 somekey123 {"x":"y", ...} would turn into (a single line containing all docs with the same key). {"k": "somekey123", "v": [{"a":"b", ...},{"x":"y",...}]} A single line cluster is easier to parallelize (e.g. for verification, etc.). ### skate-derive-key skate-derive-key derives a key from release entity JSON documents. ``` $ skate-derive-key < release_entities.jsonlines > docs.tsv ``` Result will be a three column TSV (ident, key, doc). ``` ---- ident --------------- ---- key --------- ---- doc ---------- 4lzgf5wzljcptlebhyobccj7ru 2568diamagneticsus {"abstracts":[],... ``` After this step: * sort by key, e.g. `LC_ALL=C sort -k2,2 -S 35% --parallel 6 --compress-program pzstd ...` * cluster, e.g. `skate-cluster ...` ### skate-from-unstructured ### skate-ref-to-release ### skate-to-doi ### skate-verify Goal: make key extraction and comparisons fast for billions of records on a single machine to support deduplication work for [fatcat](https://fatcat.wiki) metadata. ## Problem Handling a TB of JSON and billions of documents, especially for the following use case: * deriving a key from a document * sort documents by (that) key * clustering and verifing documents in clusters The main use case is match candidate generation and verification for fuzzy matching, especially for building a citation graph dataset from [fatcat](https://fatcat.wiki). ![](static/two_cluster_synopsis.png)