# skate A library and suite of command line tools related to generating a [citation graph](https://en.wikipedia.org/wiki/Citation_graph). > There is no standard format for the citations in bibliographies, and the > record linkage of citations can be a time-consuming and complicated process. ## Background Python was a bit too slow, even when parallelized (with GNU parallel), e.g. for generating clusters of similar documents or to do verification. An option for the future would be to resort to [Cython](https://cython.org/). Parts of [fuzzycat](https://git.archive.org/webgroup/fuzzycat) has been ported into this project for performance (and we saw a 25x speedup for certain tasks). ![](static/zipkey.png) ## Overview We follow a map-reduce style approach (on a single machine): We extract specific keys from data. We group items with the same *key* together and apply some computation on these groups. Mapper is defined as function type, mapping a blob of data (e.g. a single JSON object) to a number of fields (e.g. key, value). ```go // Mapper maps a blob to an arbitrary number of fields, e.g. for (key, // doc). We want fields, but we do not want to bake in TSV into each function. type Mapper func([]byte) ([][]byte, error) ``` We can attach a serialization method to this function type to emit TSV - this way we only have to deal with TSV only once. ```go // AsTSV serializes the result of a field mapper as TSV. This is a slim // adapter, e.g. to parallel.Processor, which expects this function signature. // A newline will be appended, if not there already. func (f Mapper) AsTSV(p []byte) ([]byte, error) { var ( fields [][]byte err error b []byte ) if fields, err = f(p); err != nil { return nil, err } if len(fields) == 0 { return nil, nil } b = bytes.Join(fields, bTab) if len(b) > 0 && !bytes.HasSuffix(b, bNewline) { b = append(b, bNewline...) } return b, nil } ``` Reducers typically take two sorted streams of (key, doc) lines and will find all documents sharing a key, then apply a function on this group. This is made a bit generic in subpackage [zipkey](zipkey). ### Example Map/Reduce * extract DOI (and other identifiers) and emit "biblioref" * extract normalized titles (or container titles), verify candidates and emit biblioref for exact and strong matches; e.g. between papers and between papers and books, etc. * extract ids and find unmatched refs in the raw blob Scale: few millions to up to few billions of docs