aboutsummaryrefslogtreecommitdiffstats
path: root/skate
diff options
context:
space:
mode:
Diffstat (limited to 'skate')
-rw-r--r--skate/README.md103
1 files changed, 42 insertions, 61 deletions
diff --git a/skate/README.md b/skate/README.md
index a055f57..f3a4463 100644
--- a/skate/README.md
+++ b/skate/README.md
@@ -18,74 +18,55 @@ project for performance (and we saw a 25x speedup for certain tasks).
## Overview
-First, generate a "sorted key file" - for our purposes a TSV containing a key
-and the original document. Various mappers are implemented and it is relatively
-easy to add another one.
+We follow a map-reduce style approach (on a single machine): We extract
+specific keys from data. We group items with the same *key* together and apply
+some computation on these groups.
-```
-$ skate-map -m ts < file.jsonl | sort -k1,1 > map.tsv
-```
-
-Repeat the mapping for any file you want to compare against the catalog. Then,
-decide which *reduce* mode is desired.
+Mapper is defined as function type, mapping a blob of data (e.g. a single JSON
+object) to a number of fields (e.g. key, value).
+```go
+// Mapper maps a blob to an arbitrary number of fields, e.g. for (key,
+// doc). We want fields, but we do not want to bake in TSV into each function.
+type Mapper func([]byte) ([][]byte, error)
```
-$ skate-reduce -r bref -f file.1 -g file.2
-```
-
-Depending on what the reducer does, it can generate a verification status or
-some export schema.
-
-WIP: ...
-
-## Core Utils
-
-* `skate-map`
-* `skate-reduce`
-
-The `skate-map` extract various keys from datasets, `skate-reduce` runs various matching and verification algorithms.
-
-## Extra
-
-* skate-wikipedia-doi
-> TSV (page title, DOI, doc) from wikipedia refs.
-
-```
-$ parquet-tools cat --json minimal_dataset.parquet | skate-wikipedia-doi
-Rational point 10.1515/crll.1988.386.32 {"type_of_citation" ...
-Cubic surface 10.2140/ant.2007.1.393 {"type_of_citation" ...
+We can attach a serialization method to this function type to emit TSV - this
+way we only have to deal with TSV only once.
+
+```go
+// AsTSV serializes the result of a field mapper as TSV. This is a slim
+// adapter, e.g. to parallel.Processor, which expects this function signature.
+// A newline will be appended, if not there already.
+func (f Mapper) AsTSV(p []byte) ([]byte, error) {
+ var (
+ fields [][]byte
+ err error
+ b []byte
+ )
+ if fields, err = f(p); err != nil {
+ return nil, err
+ }
+ if len(fields) == 0 {
+ return nil, nil
+ }
+ b = bytes.Join(fields, bTab)
+ if len(b) > 0 && !bytes.HasSuffix(b, bNewline) {
+ b = append(b, bNewline...)
+ }
+ return b, nil
+}
```
-* skate-bref-id
-
-> Temporary helper to add a key to a biblioref document.
-
-* skate-from-unstructured
-
-> Takes a refs file and plucks out identifiers from unstructured field.
-
-* skate-conv
-
-> Converts a ref (or open library) document to a release. Part of first step,
-> merging refs and releases.
-
-* skate-to-doi
-
-> Sanitize DOI in tabular file.
-
-## Misc
-
-Handling a TB of JSON and billions of documents, especially for the following
-use case:
+Reducers typically take two sorted streams of (key, doc) lines and will find
+all documents sharing a key, then apply a function on this group. This is made
+a bit generic in subpackage [zipkey](zipkey).
-* deriving a key from a document
-* sort documents by (that) key
-* clustering and verifing documents in clusters
+### Example Map/Reduce
-The main use case is match candidate generation and verification for fuzzy
-matching, especially for building a citation graph dataset from
-[fatcat](https://fatcat.wiki).
+* extract DOI (and other identifiers) and emit "biblioref"
+* extract normalized titles (or container titles), verify candidates and emit biblioref for exact and strong matches; e.g. between papers and between papers and books, etc.
+* extract ids and find unmatched refs in the raw blob
-![](static/two_cluster_synopsis.png)
+Scale: few millions to up to few billions of docs