aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2022-01-18 17:02:53 +0100
committerMartin Czygan <martin.czygan@gmail.com>2022-01-18 17:02:53 +0100
commit5f2b492cdef6710abb916eeac545fd91ff600c06 (patch)
treef7c4e3bffc1d9a6f985d685eefbae5538b3ad92b
parent84661d967f889fa4b38e4172a1341b9b64f17b83 (diff)
downloadrefcat-5f2b492cdef6710abb916eeac545fd91ff600c06.tar.gz
refcat-5f2b492cdef6710abb916eeac545fd91ff600c06.zip
refcat-v2: notesHEADmaster
-rw-r--r--notes/2022_01_10_refcat_update.md112
1 files changed, 112 insertions, 0 deletions
diff --git a/notes/2022_01_10_refcat_update.md b/notes/2022_01_10_refcat_update.md
index a7de46c..41b2024 100644
--- a/notes/2022_01_10_refcat_update.md
+++ b/notes/2022_01_10_refcat_update.md
@@ -3,6 +3,118 @@
* new refs export, about 10% more (2.7B)
* new fatcat export
+## TL;DR
+
+* v1: 1,323,423,672
+* v2: 1,481,079,426 (+11.9%, 76,235,927 strong, 1,404,843,499 exact, still about 5% fuzzy)
+
+```
+$ du -hs 2022-01-03/
+1.6T 2022-01-03/
+
+$ tree -sh 2022-01-03/Bref*
+2022-01-03/Bref
+└── [ 81G] date-2022-01-03.json.zst
+2022-01-03/BrefCombined
+├── [135G] date-2022-01-03.json.zst
+├── [4.5M] matches_sorted.tsv.zst
+├── [1.1G] matches.tsv.zst
+└── [2.4K] uniqc.txt
+2022-01-03/BrefOpenLibraryZipISBN
+└── [ 30M] date-2022-01-03.json.zst
+2022-01-03/BrefSortedByWorkID
+└── [ 72G] date-2022-01-03.json.zst
+2022-01-03/BrefZipArxiv
+└── [266M] date-2022-01-03.json.zst
+2022-01-03/BrefZipDOI
+└── [ 70G] date-2022-01-03.json.zst
+2022-01-03/BrefZipFuzzy
+└── [6.1G] date-2022-01-03-mapper-ts.json.zst
+2022-01-03/BrefZipOpenLibrary
+└── [ 43M] date-2022-01-03.json.zst
+2022-01-03/BrefZipPMCID
+└── [8.7M] date-2022-01-03.json.zst
+2022-01-03/BrefZipPMID
+└── [4.9G] date-2022-01-03.json.zst
+2022-01-03/BrefZipWikiDOI
+└── [ 75M] date-2022-01-03.json.zst
+
+0 directories, 14 files
+```
+
+Match distribution:
+
+```
+1060312017 crossref exact doi
+366035675 crossref unmatched unknown
+353068331 grobid unmatched unknown
+180656009 fatcat-datacite exact doi
+65244213 fatcat-pubmed exact pmid
+59858436 crossref-grobid unmatched unknown
+52388120 fuzzy strong jaccardauthors
+48732594 grobid exact doi
+32262589 fatcat-pubmed exact doi
+14236248 fatcat unmatched unknown
+12671780 fuzzy strong slugtitleauthormatch
+9711647 fuzzy strong tokenizedauthors
+8277050 fatcat-crossref exact doi
+4126772 fatcat-crossref unmatched unknown
+3998894 grobid exact arxiv
+3962173 fatcat-pubmed unmatched unknown
+2561175 fuzzy exact titleauthormatch
+1621193 grobid exact pmid
+563569 fuzzy strong versioneddoi
+519064 grobid exact isbn
+497352 fatcat-datacite unmatched unknown
+366615 crossref strong jaccardauthors
+260217 crossref-grobid exact arxiv
+166014 crossref-grobid exact doi
+92927 crossref exact isbn
+76785 fuzzy strong dataciterelatedid
+75798 fatcat-pubmed strong jaccardauthors
+71430 grobid strong jaccardauthors
+65643 fatcat-crossref strong jaccardauthors
+63527 fuzzy strong pmiddoipair
+53837 crossref exact arxiv
+47504 fuzzy strong arxivversion
+43016 fuzzy strong customieeearxiv
+40166 grobid exact pmcid
+22094 crossref-grobid strong jaccardauthors
+21836 crossref strong tokenizedauthors
+13587 grobid strong slugtitleauthormatch
+9589 crossref strong slugtitleauthormatch
+8936 crossref exact titleauthormatch
+8750 fatcat-pubmed exact arxiv
+6990 crossref-grobid exact pmid
+6455 fatcat-crossref strong tokenizedauthors
+4670 grobid exact titleauthormatch
+4573 crossref-grobid strong slugtitleauthormatch
+4363 grobid strong tokenizedauthors
+3581 fatcat-pubmed exact isbn
+3364 fatcat-crossref exact arxiv
+3344 fatcat-crossref exact isbn
+2654 fatcat-pubmed strong tokenizedauthors
+2129 fuzzy exact workid
+1591 fatcat-pubmed strong slugtitleauthormatch
+1579 crossref-grobid exact titleauthormatch
+1174 fuzzy strong figshareversion
+1149 crossref-grobid strong tokenizedauthors
+1029 fatcat-pubmed exact titleauthormatch
+721 crossref-grobid exact pmcid
+625 fatcat-crossref strong slugtitleauthormatch
+448 grobid strong titleartifact
+446 fatcat-crossref exact titleauthormatch
+181 fuzzy strong titleartifact
+84 crossref-grobid strong titleartifact
+80 crossref strong titleartifact
+5 fuzzy strong custombsiundated
+3 fuzzy strong custombsisubdoc
+2 fatcat-pubmed strong titleartifact
+1 fatcat exact doi
+```
+
+## Misc
+
New wikipedia extraction:
```