aboutsummaryrefslogtreecommitdiffstats
path: root/python/notes
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-04-14 00:24:46 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-04-19 20:29:17 +0200
commit9a66d7c4896f3415816d2df97bba9a01ac0ebf0c (patch)
tree2371a84935e92044ab024e7b849f00265c4ab2b9 /python/notes
parentc7b6745dd75fdf4b0e636f39f2bc256da5231195 (diff)
downloadrefcat-9a66d7c4896f3415816d2df97bba9a01ac0ebf0c.tar.gz
refcat-9a66d7c4896f3415816d2df97bba9a01ac0ebf0c.zip
update notes
Diffstat (limited to 'python/notes')
-rw-r--r--python/notes/version_3.md22
1 files changed, 22 insertions, 0 deletions
diff --git a/python/notes/version_3.md b/python/notes/version_3.md
index 71d4dd1..891b61c 100644
--- a/python/notes/version_3.md
+++ b/python/notes/version_3.md
@@ -208,3 +208,25 @@ $ time zstdcat -T0 /magna/refcat/UnmatchedRefs/date-2021-02-20.json.zst | LC_ALL
A first run only got 64008 docs; improbable that we are missing so many doi.
Also, need to generalize some skate code a bit.
+
+----
+
+# Verification stats
+
+* have 40257623 clusters, `zstdcat -T0 /magna/refcat/RefsFatcatClusters/date-2021-02-20.json.zst | wc -l`
+* have X cluster of size less than 10
+
+```
+$ zstdcat -T0 /magna/refcat/RefsFatcatClusters/date-2021-02-20.json.zst |
+ jq -rc 'select(.v|length < 10)' | LC_ALL=C wc -l
+```
+
+A 5M sample.
+
+```
+$ awk '{print $3}' cluster_verify_5m.txt | sort | uniq -c | sort -nr
+6886124 StatusDifferent
+4619805 StatusStrong
+3587478 StatusExact
+ 120215 StatusAmbiguous
+```