From 84661d967f889fa4b38e4172a1341b9b64f17b83 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Sun, 16 Jan 2022 20:31:48 +0100 Subject: update notes --- notes/2022_01_10_refcat_update.md | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/notes/2022_01_10_refcat_update.md b/notes/2022_01_10_refcat_update.md index f5c6bb5..a7de46c 100644 --- a/notes/2022_01_10_refcat_update.md +++ b/notes/2022_01_10_refcat_update.md @@ -15,3 +15,40 @@ $ grep -c DOI minimal.json ``` Convert format to existing minimal format, for "BrefZipWikiDOI" task. + +First result, bref combined. + +Previous version: + +``` +$ time zstdcat -T0 date-2021-07-28.json.zst |pv -l|wc -lc +2.08G 0:45:56 [ 753k/s] [ <=> ] +2077597833 981406745860 +``` + +Current: + +``` +$ zstdcat -T0 date-2022-01-03.json.zst | pv -l | wc -lc +2.28G 0:37:55 [1.00M/s] [ <=> ] +2282864413 1077436490574 +``` + +* 2,282,864,413 edges (matched and unmatched) +* 1,077,436,490,574 / 1T + +About 11G more compressed, about 80G more data; estimated (from 100M sample) +1.483B matches (ratio, 0.65) + +Previous (v1): + +* 1,323,423,672 - estimate based on filesize: 1.439B matches. + +Current (v2): + +* 1,481,079,426 (76,235,927 strong, 1,404,843,499 exact, still about 5% fuzzy) + +Diff: + +* about 12% increase in number of edges +* latest (v12) OCI: 1,235,170,583 (so refcat about 19% larger with 1,481,079,426) -- cgit v1.2.3