diff options
Diffstat (limited to 'skate')
-rw-r--r-- | skate/.gitignore | 5 | ||||
-rw-r--r-- | skate/README.md | 48 | ||||
-rw-r--r-- | skate/cmd/skate-from-unstructured/main.go | 4 |
3 files changed, 50 insertions, 7 deletions
diff --git a/skate/.gitignore b/skate/.gitignore index 723853e..4e893a0 100644 --- a/skate/.gitignore +++ b/skate/.gitignore @@ -17,14 +17,11 @@ /skate-ref-to-release /skate-derive-key /skate-cluster -/skate-cluster-stats -/skate-biblioref /skate-verify -/skate-fixup /skate-to-doi /skate-bref-id /skate-from-unstructured -/skate-biblioref-from-wikipedia +/skate-wikipedia-doi packaging/debian/skate/usr skate_*_amd64.deb diff --git a/skate/README.md b/skate/README.md index bd66c3d..1962dc6 100644 --- a/skate/README.md +++ b/skate/README.md @@ -5,10 +5,56 @@ citation graph pipeline. ## Tools -### skate-biblioref-from-wikipedia +### skate-wikipedia-doi + +TSV (page title, DOI, doc) from wikipedia refs. + +``` +$ parquet-tools cat --json minimal_dataset.parquet | skate-wikipedia-doi +Rational point 10.1515/crll.1988.386.32 {"type_of_citation" ... +Cubic surface 10.2140/ant.2007.1.393 {"type_of_citation" ... +``` + ### skate-bref-id + +Temporary helper to add a key to a biblioref document. + ### skate-cluster + +Converts a sorted key output into a jsonlines clusters. + +For example, this: + + id123 somekey123 {"a":"b", ...} + id391 somekey123 {"x":"y", ...} + +would turn into (a single line containing all docs with the same key). + + {"k": "somekey123", "v": [{"a":"b", ...},{"x":"y",...}]} + +A single line cluster is easier to parallelize (e.g. for verification, etc.). + ### skate-derive-key + +skate-derive-key derives a key from release entity JSON documents. + +``` +$ skate-derive-key < release_entities.jsonlines > docs.tsv +``` + +Result will be a three column TSV (ident, key, doc). + +``` +---- ident --------------- ---- key --------- ---- doc ---------- + +4lzgf5wzljcptlebhyobccj7ru 2568diamagneticsus {"abstracts":[],... +``` + +After this step: + +* sort by key, e.g. `LC_ALL=C sort -k2,2 -S 35% --parallel 6 --compress-program pzstd ...` +* cluster, e.g. `skate-cluster ...` + ### skate-from-unstructured ### skate-ref-to-release ### skate-to-doi diff --git a/skate/cmd/skate-from-unstructured/main.go b/skate/cmd/skate-from-unstructured/main.go index 1775f4d..0208d91 100644 --- a/skate/cmd/skate-from-unstructured/main.go +++ b/skate/cmd/skate-from-unstructured/main.go @@ -1,5 +1,5 @@ -// skate-from-unstructured tries to parse various pieces of information from an -// unstrctured string. +// skate-from-unstructured tries to parse various pieces of information from +// the unstructured field in refs. package main import ( |