aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-04-01 00:29:03 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-04-01 00:29:03 +0200
commit304a994951daf3930d0951b80c7ba22103f3a7f0 (patch)
treeb1ad10048b6bb0961cdd2785750ecdbcf2882714
parentd60dff7db926cc40a288584ac3f9970bb85c30c0 (diff)
downloadrefcat-304a994951daf3930d0951b80c7ba22103f3a7f0.tar.gz
refcat-304a994951daf3930d0951b80c7ba22103f3a7f0.zip
update README
-rw-r--r--skate/.gitignore5
-rw-r--r--skate/README.md48
-rw-r--r--skate/cmd/skate-from-unstructured/main.go4
3 files changed, 50 insertions, 7 deletions
diff --git a/skate/.gitignore b/skate/.gitignore
index 723853e..4e893a0 100644
--- a/skate/.gitignore
+++ b/skate/.gitignore
@@ -17,14 +17,11 @@
/skate-ref-to-release
/skate-derive-key
/skate-cluster
-/skate-cluster-stats
-/skate-biblioref
/skate-verify
-/skate-fixup
/skate-to-doi
/skate-bref-id
/skate-from-unstructured
-/skate-biblioref-from-wikipedia
+/skate-wikipedia-doi
packaging/debian/skate/usr
skate_*_amd64.deb
diff --git a/skate/README.md b/skate/README.md
index bd66c3d..1962dc6 100644
--- a/skate/README.md
+++ b/skate/README.md
@@ -5,10 +5,56 @@ citation graph pipeline.
## Tools
-### skate-biblioref-from-wikipedia
+### skate-wikipedia-doi
+
+TSV (page title, DOI, doc) from wikipedia refs.
+
+```
+$ parquet-tools cat --json minimal_dataset.parquet | skate-wikipedia-doi
+Rational point 10.1515/crll.1988.386.32 {"type_of_citation" ...
+Cubic surface 10.2140/ant.2007.1.393 {"type_of_citation" ...
+```
+
### skate-bref-id
+
+Temporary helper to add a key to a biblioref document.
+
### skate-cluster
+
+Converts a sorted key output into a jsonlines clusters.
+
+For example, this:
+
+ id123 somekey123 {"a":"b", ...}
+ id391 somekey123 {"x":"y", ...}
+
+would turn into (a single line containing all docs with the same key).
+
+ {"k": "somekey123", "v": [{"a":"b", ...},{"x":"y",...}]}
+
+A single line cluster is easier to parallelize (e.g. for verification, etc.).
+
### skate-derive-key
+
+skate-derive-key derives a key from release entity JSON documents.
+
+```
+$ skate-derive-key < release_entities.jsonlines > docs.tsv
+```
+
+Result will be a three column TSV (ident, key, doc).
+
+```
+---- ident --------------- ---- key --------- ---- doc ----------
+
+4lzgf5wzljcptlebhyobccj7ru 2568diamagneticsus {"abstracts":[],...
+```
+
+After this step:
+
+* sort by key, e.g. `LC_ALL=C sort -k2,2 -S 35% --parallel 6 --compress-program pzstd ...`
+* cluster, e.g. `skate-cluster ...`
+
### skate-from-unstructured
### skate-ref-to-release
### skate-to-doi
diff --git a/skate/cmd/skate-from-unstructured/main.go b/skate/cmd/skate-from-unstructured/main.go
index 1775f4d..0208d91 100644
--- a/skate/cmd/skate-from-unstructured/main.go
+++ b/skate/cmd/skate-from-unstructured/main.go
@@ -1,5 +1,5 @@
-// skate-from-unstructured tries to parse various pieces of information from an
-// unstrctured string.
+// skate-from-unstructured tries to parse various pieces of information from
+// the unstructured field in refs.
package main
import (