# skate This suite of command line tools have been written for various parts of the citation graph pipeline. Python was a bit too slow, even when parallelized, e.g. for generating clusters of similar documents or to do verification. An option for the future would be to resort to [Cython](https://cython.org/). Parts of [fuzzycat](https://git.archive.org/webgroup/fuzzycat) has been ported to Go for performance. ![](static/zipkey.png) ## Tools ### skate-wikipedia-doi TSV (page title, DOI, doc) from wikipedia refs. ``` $ parquet-tools cat --json minimal_dataset.parquet | skate-wikipedia-doi Rational point 10.1515/crll.1988.386.32 {"type_of_citation" ... Cubic surface 10.2140/ant.2007.1.393 {"type_of_citation" ... ``` ### skate-bref-id Temporary helper to add a key to a biblioref document. ### skate-cluster Converts a sorted key output into a jsonlines clusters. For example, this: id123 somekey123 {"a":"b", ...} id391 somekey123 {"x":"y", ...} would turn into (a single line containing all docs with the same key). {"k": "somekey123", "v": [{"a":"b", ...},{"x":"y",...}]} A single line cluster is easier to parallelize (e.g. for verification, etc.). ### skate-derive-key skate-derive-key derives a key from release entity JSON documents. ``` $ skate-derive-key < release_entities.jsonlines > docs.tsv ``` Result will be a three column TSV (ident, key, doc). ``` ---- ident --------------- ---- key --------- ---- doc ---------- 4lzgf5wzljcptlebhyobccj7ru 2568diamagneticsus {"abstracts":[],... ``` After this step: * sort by key, e.g. `LC_ALL=C sort -k2,2 -S 35% --parallel 6 --compress-program pzstd ...` * cluster, e.g. `skate-cluster ...` ### skate-from-unstructured Takes a refs file and plucks out identifiers from unstructured field. ### skate-ref-to-release Converts a ref document to a release. Part of first run, merging refs and releases. ### skate-to-doi Sanitize DOI in tabular file. ### skate-verify Run various matching and verification algorithms. ### skate-map A more generic version of derive key. ## Misc Handling a TB of JSON and billions of documents, especially for the following use case: * deriving a key from a document * sort documents by (that) key * clustering and verifing documents in clusters The main use case is match candidate generation and verification for fuzzy matching, especially for building a citation graph dataset from [fatcat](https://fatcat.wiki). ![](static/two_cluster_synopsis.png)