aboutsummaryrefslogtreecommitdiffstats
path: root/python
diff options
context:
space:
mode:
Diffstat (limited to 'python')
-rw-r--r--python/notes/version_3.md22
1 files changed, 22 insertions, 0 deletions
diff --git a/python/notes/version_3.md b/python/notes/version_3.md
index 71d4dd1..891b61c 100644
--- a/python/notes/version_3.md
+++ b/python/notes/version_3.md
@@ -208,3 +208,25 @@ $ time zstdcat -T0 /magna/refcat/UnmatchedRefs/date-2021-02-20.json.zst | LC_ALL
A first run only got 64008 docs; improbable that we are missing so many doi.
Also, need to generalize some skate code a bit.
+
+----
+
+# Verification stats
+
+* have 40257623 clusters, `zstdcat -T0 /magna/refcat/RefsFatcatClusters/date-2021-02-20.json.zst | wc -l`
+* have X cluster of size less than 10
+
+```
+$ zstdcat -T0 /magna/refcat/RefsFatcatClusters/date-2021-02-20.json.zst |
+ jq -rc 'select(.v|length < 10)' | LC_ALL=C wc -l
+```
+
+A 5M sample.
+
+```
+$ awk '{print $3}' cluster_verify_5m.txt | sort | uniq -c | sort -nr
+6886124 StatusDifferent
+4619805 StatusStrong
+3587478 StatusExact
+ 120215 StatusAmbiguous
+```