diff options
| author | Martin Czygan <martin.czygan@gmail.com> | 2022-01-18 17:02:53 +0100 | 
|---|---|---|
| committer | Martin Czygan <martin.czygan@gmail.com> | 2022-01-18 17:02:53 +0100 | 
| commit | 5f2b492cdef6710abb916eeac545fd91ff600c06 (patch) | |
| tree | f7c4e3bffc1d9a6f985d685eefbae5538b3ad92b | |
| parent | 84661d967f889fa4b38e4172a1341b9b64f17b83 (diff) | |
| download | refcat-master.tar.gz refcat-master.zip  | |
| -rw-r--r-- | notes/2022_01_10_refcat_update.md | 112 | 
1 files changed, 112 insertions, 0 deletions
diff --git a/notes/2022_01_10_refcat_update.md b/notes/2022_01_10_refcat_update.md index a7de46c..41b2024 100644 --- a/notes/2022_01_10_refcat_update.md +++ b/notes/2022_01_10_refcat_update.md @@ -3,6 +3,118 @@  * new refs export, about 10% more (2.7B)  * new fatcat export +## TL;DR + +* v1: 1,323,423,672 +* v2: 1,481,079,426 (+11.9%, 76,235,927 strong, 1,404,843,499 exact, still about 5% fuzzy) + +``` +$ du -hs 2022-01-03/ +1.6T    2022-01-03/ + +$ tree -sh 2022-01-03/Bref* +2022-01-03/Bref +└── [ 81G]  date-2022-01-03.json.zst +2022-01-03/BrefCombined +├── [135G]  date-2022-01-03.json.zst +├── [4.5M]  matches_sorted.tsv.zst +├── [1.1G]  matches.tsv.zst +└── [2.4K]  uniqc.txt +2022-01-03/BrefOpenLibraryZipISBN +└── [ 30M]  date-2022-01-03.json.zst +2022-01-03/BrefSortedByWorkID +└── [ 72G]  date-2022-01-03.json.zst +2022-01-03/BrefZipArxiv +└── [266M]  date-2022-01-03.json.zst +2022-01-03/BrefZipDOI +└── [ 70G]  date-2022-01-03.json.zst +2022-01-03/BrefZipFuzzy +└── [6.1G]  date-2022-01-03-mapper-ts.json.zst +2022-01-03/BrefZipOpenLibrary +└── [ 43M]  date-2022-01-03.json.zst +2022-01-03/BrefZipPMCID +└── [8.7M]  date-2022-01-03.json.zst +2022-01-03/BrefZipPMID +└── [4.9G]  date-2022-01-03.json.zst +2022-01-03/BrefZipWikiDOI +└── [ 75M]  date-2022-01-03.json.zst + +0 directories, 14 files +``` + +Match distribution: + +``` +1060312017  crossref         exact      doi +366035675   crossref         unmatched  unknown +353068331   grobid           unmatched  unknown +180656009   fatcat-datacite  exact      doi +65244213    fatcat-pubmed    exact      pmid +59858436    crossref-grobid  unmatched  unknown +52388120    fuzzy            strong     jaccardauthors +48732594    grobid           exact      doi +32262589    fatcat-pubmed    exact      doi +14236248    fatcat           unmatched  unknown +12671780    fuzzy            strong     slugtitleauthormatch +9711647     fuzzy            strong     tokenizedauthors +8277050     fatcat-crossref  exact      doi +4126772     fatcat-crossref  unmatched  unknown +3998894     grobid           exact      arxiv +3962173     fatcat-pubmed    unmatched  unknown +2561175     fuzzy            exact      titleauthormatch +1621193     grobid           exact      pmid +563569      fuzzy            strong     versioneddoi +519064      grobid           exact      isbn +497352      fatcat-datacite  unmatched  unknown +366615      crossref         strong     jaccardauthors +260217      crossref-grobid  exact      arxiv +166014      crossref-grobid  exact      doi +92927       crossref         exact      isbn +76785       fuzzy            strong     dataciterelatedid +75798       fatcat-pubmed    strong     jaccardauthors +71430       grobid           strong     jaccardauthors +65643       fatcat-crossref  strong     jaccardauthors +63527       fuzzy            strong     pmiddoipair +53837       crossref         exact      arxiv +47504       fuzzy            strong     arxivversion +43016       fuzzy            strong     customieeearxiv +40166       grobid           exact      pmcid +22094       crossref-grobid  strong     jaccardauthors +21836       crossref         strong     tokenizedauthors +13587       grobid           strong     slugtitleauthormatch +9589        crossref         strong     slugtitleauthormatch +8936        crossref         exact      titleauthormatch +8750        fatcat-pubmed    exact      arxiv +6990        crossref-grobid  exact      pmid +6455        fatcat-crossref  strong     tokenizedauthors +4670        grobid           exact      titleauthormatch +4573        crossref-grobid  strong     slugtitleauthormatch +4363        grobid           strong     tokenizedauthors +3581        fatcat-pubmed    exact      isbn +3364        fatcat-crossref  exact      arxiv +3344        fatcat-crossref  exact      isbn +2654        fatcat-pubmed    strong     tokenizedauthors +2129        fuzzy            exact      workid +1591        fatcat-pubmed    strong     slugtitleauthormatch +1579        crossref-grobid  exact      titleauthormatch +1174        fuzzy            strong     figshareversion +1149        crossref-grobid  strong     tokenizedauthors +1029        fatcat-pubmed    exact      titleauthormatch +721         crossref-grobid  exact      pmcid +625         fatcat-crossref  strong     slugtitleauthormatch +448         grobid           strong     titleartifact +446         fatcat-crossref  exact      titleauthormatch +181         fuzzy            strong     titleartifact +84          crossref-grobid  strong     titleartifact +80          crossref         strong     titleartifact +5           fuzzy            strong     custombsiundated +3           fuzzy            strong     custombsisubdoc +2           fatcat-pubmed    strong     titleartifact +1           fatcat           exact      doi +``` + +## Misc +  New wikipedia extraction:  ```  | 
