aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--notes/2022_01_10_refcat_update.md37
1 files changed, 37 insertions, 0 deletions
diff --git a/notes/2022_01_10_refcat_update.md b/notes/2022_01_10_refcat_update.md
index f5c6bb5..a7de46c 100644
--- a/notes/2022_01_10_refcat_update.md
+++ b/notes/2022_01_10_refcat_update.md
@@ -15,3 +15,40 @@ $ grep -c DOI minimal.json
```
Convert format to existing minimal format, for "BrefZipWikiDOI" task.
+
+First result, bref combined.
+
+Previous version:
+
+```
+$ time zstdcat -T0 date-2021-07-28.json.zst |pv -l|wc -lc
+2.08G 0:45:56 [ 753k/s] [ <=> ]
+2077597833 981406745860
+```
+
+Current:
+
+```
+$ zstdcat -T0 date-2022-01-03.json.zst | pv -l | wc -lc
+2.28G 0:37:55 [1.00M/s] [ <=> ]
+2282864413 1077436490574
+```
+
+* 2,282,864,413 edges (matched and unmatched)
+* 1,077,436,490,574 / 1T
+
+About 11G more compressed, about 80G more data; estimated (from 100M sample)
+1.483B matches (ratio, 0.65)
+
+Previous (v1):
+
+* 1,323,423,672 - estimate based on filesize: 1.439B matches.
+
+Current (v2):
+
+* 1,481,079,426 (76,235,927 strong, 1,404,843,499 exact, still about 5% fuzzy)
+
+Diff:
+
+* about 12% increase in number of edges
+* latest (v12) OCI: 1,235,170,583 (so refcat about 19% larger with 1,481,079,426)