# Refcat update * new refs export, about 10% more (2.7B) * new fatcat export New wikipedia extraction: ``` martin@ia601101:/magna/data/wikipedia_citations_2020-07-14 $ LC_ALL=C grep ID_list minimal_dataset.json | grep -c DOI 1442189 $ jq -rc '.refs[] | select(.ID_list != null) | {"URL": .URL, "Title": .title, "ID_list": .ID_list}' enwiki-20211201-pages-articles.citations.json | pv -l > minimal.json $ grep -c DOI minimal.json 1932578 ``` Convert format to existing minimal format, for "BrefZipWikiDOI" task. First result, bref combined. Previous version: ``` $ time zstdcat -T0 date-2021-07-28.json.zst |pv -l|wc -lc 2.08G 0:45:56 [ 753k/s] [ <=> ] 2077597833 981406745860 ``` Current: ``` $ zstdcat -T0 date-2022-01-03.json.zst | pv -l | wc -lc 2.28G 0:37:55 [1.00M/s] [ <=> ] 2282864413 1077436490574 ``` * 2,282,864,413 edges (matched and unmatched) * 1,077,436,490,574 / 1T About 11G more compressed, about 80G more data; estimated (from 100M sample) 1.483B matches (ratio, 0.65) Previous (v1): * 1,323,423,672 - estimate based on filesize: 1.439B matches. Current (v2): * 1,481,079,426 (76,235,927 strong, 1,404,843,499 exact, still about 5% fuzzy) Diff: * about 12% increase in number of edges * latest (v12) OCI: 1,235,170,583 (so refcat about 19% larger with 1,481,079,426)