diff options
-rw-r--r-- | skate/README.md | 103 |
1 files changed, 42 insertions, 61 deletions
diff --git a/skate/README.md b/skate/README.md index a055f57..f3a4463 100644 --- a/skate/README.md +++ b/skate/README.md @@ -18,74 +18,55 @@ project for performance (and we saw a 25x speedup for certain tasks). ## Overview -First, generate a "sorted key file" - for our purposes a TSV containing a key -and the original document. Various mappers are implemented and it is relatively -easy to add another one. +We follow a map-reduce style approach (on a single machine): We extract +specific keys from data. We group items with the same *key* together and apply +some computation on these groups. -``` -$ skate-map -m ts < file.jsonl | sort -k1,1 > map.tsv -``` - -Repeat the mapping for any file you want to compare against the catalog. Then, -decide which *reduce* mode is desired. +Mapper is defined as function type, mapping a blob of data (e.g. a single JSON +object) to a number of fields (e.g. key, value). +```go +// Mapper maps a blob to an arbitrary number of fields, e.g. for (key, +// doc). We want fields, but we do not want to bake in TSV into each function. +type Mapper func([]byte) ([][]byte, error) ``` -$ skate-reduce -r bref -f file.1 -g file.2 -``` - -Depending on what the reducer does, it can generate a verification status or -some export schema. - -WIP: ... - -## Core Utils - -* `skate-map` -* `skate-reduce` - -The `skate-map` extract various keys from datasets, `skate-reduce` runs various matching and verification algorithms. - -## Extra - -* skate-wikipedia-doi -> TSV (page title, DOI, doc) from wikipedia refs. - -``` -$ parquet-tools cat --json minimal_dataset.parquet | skate-wikipedia-doi -Rational point 10.1515/crll.1988.386.32 {"type_of_citation" ... -Cubic surface 10.2140/ant.2007.1.393 {"type_of_citation" ... +We can attach a serialization method to this function type to emit TSV - this +way we only have to deal with TSV only once. + +```go +// AsTSV serializes the result of a field mapper as TSV. This is a slim +// adapter, e.g. to parallel.Processor, which expects this function signature. +// A newline will be appended, if not there already. +func (f Mapper) AsTSV(p []byte) ([]byte, error) { + var ( + fields [][]byte + err error + b []byte + ) + if fields, err = f(p); err != nil { + return nil, err + } + if len(fields) == 0 { + return nil, nil + } + b = bytes.Join(fields, bTab) + if len(b) > 0 && !bytes.HasSuffix(b, bNewline) { + b = append(b, bNewline...) + } + return b, nil +} ``` -* skate-bref-id - -> Temporary helper to add a key to a biblioref document. - -* skate-from-unstructured - -> Takes a refs file and plucks out identifiers from unstructured field. - -* skate-conv - -> Converts a ref (or open library) document to a release. Part of first step, -> merging refs and releases. - -* skate-to-doi - -> Sanitize DOI in tabular file. - -## Misc - -Handling a TB of JSON and billions of documents, especially for the following -use case: +Reducers typically take two sorted streams of (key, doc) lines and will find +all documents sharing a key, then apply a function on this group. This is made +a bit generic in subpackage [zipkey](zipkey). -* deriving a key from a document -* sort documents by (that) key -* clustering and verifing documents in clusters +### Example Map/Reduce -The main use case is match candidate generation and verification for fuzzy -matching, especially for building a citation graph dataset from -[fatcat](https://fatcat.wiki). +* extract DOI (and other identifiers) and emit "biblioref" +* extract normalized titles (or container titles), verify candidates and emit biblioref for exact and strong matches; e.g. between papers and between papers and books, etc. +* extract ids and find unmatched refs in the raw blob -![](static/two_cluster_synopsis.png) +Scale: few millions to up to few billions of docs |