From 5f2b492cdef6710abb916eeac545fd91ff600c06 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Tue, 18 Jan 2022 17:02:53 +0100 Subject: refcat-v2: notes --- notes/2022_01_10_refcat_update.md | 112 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 112 insertions(+) (limited to 'notes') diff --git a/notes/2022_01_10_refcat_update.md b/notes/2022_01_10_refcat_update.md index a7de46c..41b2024 100644 --- a/notes/2022_01_10_refcat_update.md +++ b/notes/2022_01_10_refcat_update.md @@ -3,6 +3,118 @@ * new refs export, about 10% more (2.7B) * new fatcat export +## TL;DR + +* v1: 1,323,423,672 +* v2: 1,481,079,426 (+11.9%, 76,235,927 strong, 1,404,843,499 exact, still about 5% fuzzy) + +``` +$ du -hs 2022-01-03/ +1.6T 2022-01-03/ + +$ tree -sh 2022-01-03/Bref* +2022-01-03/Bref +└── [ 81G] date-2022-01-03.json.zst +2022-01-03/BrefCombined +├── [135G] date-2022-01-03.json.zst +├── [4.5M] matches_sorted.tsv.zst +├── [1.1G] matches.tsv.zst +└── [2.4K] uniqc.txt +2022-01-03/BrefOpenLibraryZipISBN +└── [ 30M] date-2022-01-03.json.zst +2022-01-03/BrefSortedByWorkID +└── [ 72G] date-2022-01-03.json.zst +2022-01-03/BrefZipArxiv +└── [266M] date-2022-01-03.json.zst +2022-01-03/BrefZipDOI +└── [ 70G] date-2022-01-03.json.zst +2022-01-03/BrefZipFuzzy +└── [6.1G] date-2022-01-03-mapper-ts.json.zst +2022-01-03/BrefZipOpenLibrary +└── [ 43M] date-2022-01-03.json.zst +2022-01-03/BrefZipPMCID +└── [8.7M] date-2022-01-03.json.zst +2022-01-03/BrefZipPMID +└── [4.9G] date-2022-01-03.json.zst +2022-01-03/BrefZipWikiDOI +└── [ 75M] date-2022-01-03.json.zst + +0 directories, 14 files +``` + +Match distribution: + +``` +1060312017 crossref exact doi +366035675 crossref unmatched unknown +353068331 grobid unmatched unknown +180656009 fatcat-datacite exact doi +65244213 fatcat-pubmed exact pmid +59858436 crossref-grobid unmatched unknown +52388120 fuzzy strong jaccardauthors +48732594 grobid exact doi +32262589 fatcat-pubmed exact doi +14236248 fatcat unmatched unknown +12671780 fuzzy strong slugtitleauthormatch +9711647 fuzzy strong tokenizedauthors +8277050 fatcat-crossref exact doi +4126772 fatcat-crossref unmatched unknown +3998894 grobid exact arxiv +3962173 fatcat-pubmed unmatched unknown +2561175 fuzzy exact titleauthormatch +1621193 grobid exact pmid +563569 fuzzy strong versioneddoi +519064 grobid exact isbn +497352 fatcat-datacite unmatched unknown +366615 crossref strong jaccardauthors +260217 crossref-grobid exact arxiv +166014 crossref-grobid exact doi +92927 crossref exact isbn +76785 fuzzy strong dataciterelatedid +75798 fatcat-pubmed strong jaccardauthors +71430 grobid strong jaccardauthors +65643 fatcat-crossref strong jaccardauthors +63527 fuzzy strong pmiddoipair +53837 crossref exact arxiv +47504 fuzzy strong arxivversion +43016 fuzzy strong customieeearxiv +40166 grobid exact pmcid +22094 crossref-grobid strong jaccardauthors +21836 crossref strong tokenizedauthors +13587 grobid strong slugtitleauthormatch +9589 crossref strong slugtitleauthormatch +8936 crossref exact titleauthormatch +8750 fatcat-pubmed exact arxiv +6990 crossref-grobid exact pmid +6455 fatcat-crossref strong tokenizedauthors +4670 grobid exact titleauthormatch +4573 crossref-grobid strong slugtitleauthormatch +4363 grobid strong tokenizedauthors +3581 fatcat-pubmed exact isbn +3364 fatcat-crossref exact arxiv +3344 fatcat-crossref exact isbn +2654 fatcat-pubmed strong tokenizedauthors +2129 fuzzy exact workid +1591 fatcat-pubmed strong slugtitleauthormatch +1579 crossref-grobid exact titleauthormatch +1174 fuzzy strong figshareversion +1149 crossref-grobid strong tokenizedauthors +1029 fatcat-pubmed exact titleauthormatch +721 crossref-grobid exact pmcid +625 fatcat-crossref strong slugtitleauthormatch +448 grobid strong titleartifact +446 fatcat-crossref exact titleauthormatch +181 fuzzy strong titleartifact +84 crossref-grobid strong titleartifact +80 crossref strong titleartifact +5 fuzzy strong custombsiundated +3 fuzzy strong custombsisubdoc +2 fatcat-pubmed strong titleartifact +1 fatcat exact doi +``` + +## Misc + New wikipedia extraction: ``` -- cgit v1.2.3