diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-04-14 00:24:46 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-04-19 20:29:17 +0200 |
commit | 9a66d7c4896f3415816d2df97bba9a01ac0ebf0c (patch) | |
tree | 2371a84935e92044ab024e7b849f00265c4ab2b9 /python | |
parent | c7b6745dd75fdf4b0e636f39f2bc256da5231195 (diff) | |
download | refcat-9a66d7c4896f3415816d2df97bba9a01ac0ebf0c.tar.gz refcat-9a66d7c4896f3415816d2df97bba9a01ac0ebf0c.zip |
update notes
Diffstat (limited to 'python')
-rw-r--r-- | python/notes/version_3.md | 22 |
1 files changed, 22 insertions, 0 deletions
diff --git a/python/notes/version_3.md b/python/notes/version_3.md index 71d4dd1..891b61c 100644 --- a/python/notes/version_3.md +++ b/python/notes/version_3.md @@ -208,3 +208,25 @@ $ time zstdcat -T0 /magna/refcat/UnmatchedRefs/date-2021-02-20.json.zst | LC_ALL A first run only got 64008 docs; improbable that we are missing so many doi. Also, need to generalize some skate code a bit. + +---- + +# Verification stats + +* have 40257623 clusters, `zstdcat -T0 /magna/refcat/RefsFatcatClusters/date-2021-02-20.json.zst | wc -l` +* have X cluster of size less than 10 + +``` +$ zstdcat -T0 /magna/refcat/RefsFatcatClusters/date-2021-02-20.json.zst | + jq -rc 'select(.v|length < 10)' | LC_ALL=C wc -l +``` + +A 5M sample. + +``` +$ awk '{print $3}' cluster_verify_5m.txt | sort | uniq -c | sort -nr +6886124 StatusDifferent +4619805 StatusStrong +3587478 StatusExact + 120215 StatusAmbiguous +``` |