# Refcat update * new refs export, about 10% more (2.7B) * new fatcat export ## TL;DR * v1: 1,323,423,672 * v2: 1,481,079,426 (+11.9%, 76,235,927 strong, 1,404,843,499 exact, still about 5% fuzzy) ``` $ du -hs 2022-01-03/ 1.6T 2022-01-03/ $ tree -sh 2022-01-03/Bref* 2022-01-03/Bref └── [ 81G] date-2022-01-03.json.zst 2022-01-03/BrefCombined ├── [135G] date-2022-01-03.json.zst ├── [4.5M] matches_sorted.tsv.zst ├── [1.1G] matches.tsv.zst └── [2.4K] uniqc.txt 2022-01-03/BrefOpenLibraryZipISBN └── [ 30M] date-2022-01-03.json.zst 2022-01-03/BrefSortedByWorkID └── [ 72G] date-2022-01-03.json.zst 2022-01-03/BrefZipArxiv └── [266M] date-2022-01-03.json.zst 2022-01-03/BrefZipDOI └── [ 70G] date-2022-01-03.json.zst 2022-01-03/BrefZipFuzzy └── [6.1G] date-2022-01-03-mapper-ts.json.zst 2022-01-03/BrefZipOpenLibrary └── [ 43M] date-2022-01-03.json.zst 2022-01-03/BrefZipPMCID └── [8.7M] date-2022-01-03.json.zst 2022-01-03/BrefZipPMID └── [4.9G] date-2022-01-03.json.zst 2022-01-03/BrefZipWikiDOI └── [ 75M] date-2022-01-03.json.zst 0 directories, 14 files ``` Match distribution: ``` 1060312017 crossref exact doi 366035675 crossref unmatched unknown 353068331 grobid unmatched unknown 180656009 fatcat-datacite exact doi 65244213 fatcat-pubmed exact pmid 59858436 crossref-grobid unmatched unknown 52388120 fuzzy strong jaccardauthors 48732594 grobid exact doi 32262589 fatcat-pubmed exact doi 14236248 fatcat unmatched unknown 12671780 fuzzy strong slugtitleauthormatch 9711647 fuzzy strong tokenizedauthors 8277050 fatcat-crossref exact doi 4126772 fatcat-crossref unmatched unknown 3998894 grobid exact arxiv 3962173 fatcat-pubmed unmatched unknown 2561175 fuzzy exact titleauthormatch 1621193 grobid exact pmid 563569 fuzzy strong versioneddoi 519064 grobid exact isbn 497352 fatcat-datacite unmatched unknown 366615 crossref strong jaccardauthors 260217 crossref-grobid exact arxiv 166014 crossref-grobid exact doi 92927 crossref exact isbn 76785 fuzzy strong dataciterelatedid 75798 fatcat-pubmed strong jaccardauthors 71430 grobid strong jaccardauthors 65643 fatcat-crossref strong jaccardauthors 63527 fuzzy strong pmiddoipair 53837 crossref exact arxiv 47504 fuzzy strong arxivversion 43016 fuzzy strong customieeearxiv 40166 grobid exact pmcid 22094 crossref-grobid strong jaccardauthors 21836 crossref strong tokenizedauthors 13587 grobid strong slugtitleauthormatch 9589 crossref strong slugtitleauthormatch 8936 crossref exact titleauthormatch 8750 fatcat-pubmed exact arxiv 6990 crossref-grobid exact pmid 6455 fatcat-crossref strong tokenizedauthors 4670 grobid exact titleauthormatch 4573 crossref-grobid strong slugtitleauthormatch 4363 grobid strong tokenizedauthors 3581 fatcat-pubmed exact isbn 3364 fatcat-crossref exact arxiv 3344 fatcat-crossref exact isbn 2654 fatcat-pubmed strong tokenizedauthors 2129 fuzzy exact workid 1591 fatcat-pubmed strong slugtitleauthormatch 1579 crossref-grobid exact titleauthormatch 1174 fuzzy strong figshareversion 1149 crossref-grobid strong tokenizedauthors 1029 fatcat-pubmed exact titleauthormatch 721 crossref-grobid exact pmcid 625 fatcat-crossref strong slugtitleauthormatch 448 grobid strong titleartifact 446 fatcat-crossref exact titleauthormatch 181 fuzzy strong titleartifact 84 crossref-grobid strong titleartifact 80 crossref strong titleartifact 5 fuzzy strong custombsiundated 3 fuzzy strong custombsisubdoc 2 fatcat-pubmed strong titleartifact 1 fatcat exact doi ``` ## Misc New wikipedia extraction: ``` martin@ia601101:/magna/data/wikipedia_citations_2020-07-14 $ LC_ALL=C grep ID_list minimal_dataset.json | grep -c DOI 1442189 $ jq -rc '.refs[] | select(.ID_list != null) | {"URL": .URL, "Title": .title, "ID_list": .ID_list}' enwiki-20211201-pages-articles.citations.json | pv -l > minimal.json $ grep -c DOI minimal.json 1932578 ``` Convert format to existing minimal format, for "BrefZipWikiDOI" task. First result, bref combined. Previous version: ``` $ time zstdcat -T0 date-2021-07-28.json.zst |pv -l|wc -lc 2.08G 0:45:56 [ 753k/s] [ <=> ] 2077597833 981406745860 ``` Current: ``` $ zstdcat -T0 date-2022-01-03.json.zst | pv -l | wc -lc 2.28G 0:37:55 [1.00M/s] [ <=> ] 2282864413 1077436490574 ``` * 2,282,864,413 edges (matched and unmatched) * 1,077,436,490,574 / 1T About 11G more compressed, about 80G more data; estimated (from 100M sample) 1.483B matches (ratio, 0.65) Previous (v1): * 1,323,423,672 - estimate based on filesize: 1.439B matches. Current (v2): * 1,481,079,426 (76,235,927 strong, 1,404,843,499 exact, still about 5% fuzzy) Diff: * about 12% increase in number of edges * latest (v12) OCI: 1,235,170,583 (so refcat about 19% larger with 1,481,079,426)