# skate A library and suite of command line tools related to generating a [citation graph](https://en.wikipedia.org/wiki/Citation_graph). > There is no standard format for the citations in bibliographies, and the > record linkage of citations can be a time-consuming and complicated process. ## Background Python was a bit too slow, even when parallelized (with GNU parallel), e.g. for generating clusters of similar documents or to do verification. An option for the future would be to resort to [Cython](https://cython.org/). Parts of [fuzzycat](https://git.archive.org/webgroup/fuzzycat) has been ported into this project for performance (and we saw a 25x speedup for certain tasks). ![](static/zipkey.png) ## Overview We follow a map-reduce style approach (on a single machine): We extract specific keys from data. We group items (via sort) with the same *key* together and apply some computation on these groups. Mapper is defined as function type, mapping a blob of data (e.g. a single JSON object) to a number of fields (e.g. key, value). ```go // Mapper maps a blob to an arbitrary number of fields, e.g. for (key, // doc). We want fields, but we do not want to bake in TSV into each function. type Mapper func([]byte) ([][]byte, error) ``` We can attach a serialization method to this function type to emit TSV - this way we only have to deal with TSV only once. ```go // AsTSV serializes the result of a field mapper as TSV. This is a slim // adapter, e.g. to parallel.Processor, which expects this function signature. // A newline will be appended, if not there already. func (f Mapper) AsTSV(p []byte) ([]byte, error) { var ( fields [][]byte err error b []byte ) if fields, err = f(p); err != nil { return nil, err } if len(fields) == 0 { return nil, nil } b = bytes.Join(fields, bTab) if len(b) > 0 && !bytes.HasSuffix(b, bNewline) { b = append(b, bNewline...) } return b, nil } ``` Reducers typically take two sorted streams of (key, doc) lines and will find all documents sharing a key, then apply a function on this group. This is made a bit generic in subpackage [zipkey](zipkey). ### Example Map/Reduce * extract DOI (and other identifiers) and emit "biblioref" * extract normalized titles (or container titles), verify candidates and emit biblioref for exact and strong matches; e.g. between papers and between papers and books, etc. * extract ids and find unmatched refs in the raw blob Scale: few millions to up to few billions of docs ## TODO and Issues + [ ] a clearer way to deduplicate edges Currently, we use an the `source_release_ident` and `ref_index` as an elasticsearch key. This means that we reference docs coming from different sources (e.g. crossref, grobid, etc.) but representing the same item must match in their index, which we neither can guarantee nor be robust for various sources, where indices may change frequently. A clearer path could be: 1. group all matches (and non-matches) by source work id (already the case) 2. generate a list of unique refs, unique by source-target, w/o any index 3. for any source-target with more than one occurance, understand whićh one of these we want to include Step 3 can now go into all depth understanding multiplicities, e.g. is it an "ebd", "ff" type? Does it come from different source (e.g. then choose the one most likely being correct, etc), ...