diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2022-01-18 17:02:53 +0100 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2022-01-18 17:02:53 +0100 |
commit | 5f2b492cdef6710abb916eeac545fd91ff600c06 (patch) | |
tree | f7c4e3bffc1d9a6f985d685eefbae5538b3ad92b | |
parent | 84661d967f889fa4b38e4172a1341b9b64f17b83 (diff) | |
download | refcat-5f2b492cdef6710abb916eeac545fd91ff600c06.tar.gz refcat-5f2b492cdef6710abb916eeac545fd91ff600c06.zip |
-rw-r--r-- | notes/2022_01_10_refcat_update.md | 112 |
1 files changed, 112 insertions, 0 deletions
diff --git a/notes/2022_01_10_refcat_update.md b/notes/2022_01_10_refcat_update.md index a7de46c..41b2024 100644 --- a/notes/2022_01_10_refcat_update.md +++ b/notes/2022_01_10_refcat_update.md @@ -3,6 +3,118 @@ * new refs export, about 10% more (2.7B) * new fatcat export +## TL;DR + +* v1: 1,323,423,672 +* v2: 1,481,079,426 (+11.9%, 76,235,927 strong, 1,404,843,499 exact, still about 5% fuzzy) + +``` +$ du -hs 2022-01-03/ +1.6T 2022-01-03/ + +$ tree -sh 2022-01-03/Bref* +2022-01-03/Bref +└── [ 81G] date-2022-01-03.json.zst +2022-01-03/BrefCombined +├── [135G] date-2022-01-03.json.zst +├── [4.5M] matches_sorted.tsv.zst +├── [1.1G] matches.tsv.zst +└── [2.4K] uniqc.txt +2022-01-03/BrefOpenLibraryZipISBN +└── [ 30M] date-2022-01-03.json.zst +2022-01-03/BrefSortedByWorkID +└── [ 72G] date-2022-01-03.json.zst +2022-01-03/BrefZipArxiv +└── [266M] date-2022-01-03.json.zst +2022-01-03/BrefZipDOI +└── [ 70G] date-2022-01-03.json.zst +2022-01-03/BrefZipFuzzy +└── [6.1G] date-2022-01-03-mapper-ts.json.zst +2022-01-03/BrefZipOpenLibrary +└── [ 43M] date-2022-01-03.json.zst +2022-01-03/BrefZipPMCID +└── [8.7M] date-2022-01-03.json.zst +2022-01-03/BrefZipPMID +└── [4.9G] date-2022-01-03.json.zst +2022-01-03/BrefZipWikiDOI +└── [ 75M] date-2022-01-03.json.zst + +0 directories, 14 files +``` + +Match distribution: + +``` +1060312017 crossref exact doi +366035675 crossref unmatched unknown +353068331 grobid unmatched unknown +180656009 fatcat-datacite exact doi +65244213 fatcat-pubmed exact pmid +59858436 crossref-grobid unmatched unknown +52388120 fuzzy strong jaccardauthors +48732594 grobid exact doi +32262589 fatcat-pubmed exact doi +14236248 fatcat unmatched unknown +12671780 fuzzy strong slugtitleauthormatch +9711647 fuzzy strong tokenizedauthors +8277050 fatcat-crossref exact doi +4126772 fatcat-crossref unmatched unknown +3998894 grobid exact arxiv +3962173 fatcat-pubmed unmatched unknown +2561175 fuzzy exact titleauthormatch +1621193 grobid exact pmid +563569 fuzzy strong versioneddoi +519064 grobid exact isbn +497352 fatcat-datacite unmatched unknown +366615 crossref strong jaccardauthors +260217 crossref-grobid exact arxiv +166014 crossref-grobid exact doi +92927 crossref exact isbn +76785 fuzzy strong dataciterelatedid +75798 fatcat-pubmed strong jaccardauthors +71430 grobid strong jaccardauthors +65643 fatcat-crossref strong jaccardauthors +63527 fuzzy strong pmiddoipair +53837 crossref exact arxiv +47504 fuzzy strong arxivversion +43016 fuzzy strong customieeearxiv +40166 grobid exact pmcid +22094 crossref-grobid strong jaccardauthors +21836 crossref strong tokenizedauthors +13587 grobid strong slugtitleauthormatch +9589 crossref strong slugtitleauthormatch +8936 crossref exact titleauthormatch +8750 fatcat-pubmed exact arxiv +6990 crossref-grobid exact pmid +6455 fatcat-crossref strong tokenizedauthors +4670 grobid exact titleauthormatch +4573 crossref-grobid strong slugtitleauthormatch +4363 grobid strong tokenizedauthors +3581 fatcat-pubmed exact isbn +3364 fatcat-crossref exact arxiv +3344 fatcat-crossref exact isbn +2654 fatcat-pubmed strong tokenizedauthors +2129 fuzzy exact workid +1591 fatcat-pubmed strong slugtitleauthormatch +1579 crossref-grobid exact titleauthormatch +1174 fuzzy strong figshareversion +1149 crossref-grobid strong tokenizedauthors +1029 fatcat-pubmed exact titleauthormatch +721 crossref-grobid exact pmcid +625 fatcat-crossref strong slugtitleauthormatch +448 grobid strong titleartifact +446 fatcat-crossref exact titleauthormatch +181 fuzzy strong titleartifact +84 crossref-grobid strong titleartifact +80 crossref strong titleartifact +5 fuzzy strong custombsiundated +3 fuzzy strong custombsisubdoc +2 fatcat-pubmed strong titleartifact +1 fatcat exact doi +``` + +## Misc + New wikipedia extraction: ``` |