# skate A library and suite of command line tools related to generating a [citation graph](https://en.wikipedia.org/wiki/Citation_graph). > There is no standard format for the citations in bibliographies, and the > record linkage of citations can be a time-consuming and complicated process. ## Background Python was a bit too slow, even when parallelized (with GNU parallel), e.g. for generating clusters of similar documents or to do verification. An option for the future would be to resort to [Cython](https://cython.org/). Parts of [fuzzycat](https://git.archive.org/webgroup/fuzzycat) has been ported into this project for performance (and we saw a 25x speedup for certain tasks). ![](static/zipkey.png) ## Overview First, generate a "sorted key file" - for our purposes a TSV containing a key and the original document. Various mappers are implemented and it is relatively easy to add another one. ``` $ skate-map -m ts < file.jsonl | sort -k1,1 > map.tsv ``` Repeat the mapping for any file you want to compare against the catalog. Then, decide which *reduce* mode is desired. ``` $ skate-reduce -r bref -f file.1 -g file.2 ``` Depending on what the reducer does, it can generate a verification status or some export schema. WIP: ... ## Core Utils * `skate-map` * `skate-reduce` The `skate-map` extract various keys from datasets, `skate-reduce` runs various matching and verification algorithms. ## Extra * skate-wikipedia-doi > TSV (page title, DOI, doc) from wikipedia refs. ``` $ parquet-tools cat --json minimal_dataset.parquet | skate-wikipedia-doi Rational point 10.1515/crll.1988.386.32 {"type_of_citation" ... Cubic surface 10.2140/ant.2007.1.393 {"type_of_citation" ... ``` * skate-bref-id > Temporary helper to add a key to a biblioref document. * skate-from-unstructured > Takes a refs file and plucks out identifiers from unstructured field. * skate-conv > Converts a ref (or open library) document to a release. Part of first step, > merging refs and releases. * skate-to-doi > Sanitize DOI in tabular file. ## Misc Handling a TB of JSON and billions of documents, especially for the following use case: * deriving a key from a document * sort documents by (that) key * clustering and verifing documents in clusters The main use case is match candidate generation and verification for fuzzy matching, especially for building a citation graph dataset from [fatcat](https://fatcat.wiki). ![](static/two_cluster_synopsis.png)