1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
|
# Refcat update
* new refs export, about 10% more (2.7B)
* new fatcat export
## TL;DR
* v1: 1,323,423,672
* v2: 1,481,079,426 (+11.9%, 76,235,927 strong, 1,404,843,499 exact, still about 5% fuzzy)
```
$ du -hs 2022-01-03/
1.6T 2022-01-03/
$ tree -sh 2022-01-03/Bref*
2022-01-03/Bref
└── [ 81G] date-2022-01-03.json.zst
2022-01-03/BrefCombined
├── [135G] date-2022-01-03.json.zst
├── [4.5M] matches_sorted.tsv.zst
├── [1.1G] matches.tsv.zst
└── [2.4K] uniqc.txt
2022-01-03/BrefOpenLibraryZipISBN
└── [ 30M] date-2022-01-03.json.zst
2022-01-03/BrefSortedByWorkID
└── [ 72G] date-2022-01-03.json.zst
2022-01-03/BrefZipArxiv
└── [266M] date-2022-01-03.json.zst
2022-01-03/BrefZipDOI
└── [ 70G] date-2022-01-03.json.zst
2022-01-03/BrefZipFuzzy
└── [6.1G] date-2022-01-03-mapper-ts.json.zst
2022-01-03/BrefZipOpenLibrary
└── [ 43M] date-2022-01-03.json.zst
2022-01-03/BrefZipPMCID
└── [8.7M] date-2022-01-03.json.zst
2022-01-03/BrefZipPMID
└── [4.9G] date-2022-01-03.json.zst
2022-01-03/BrefZipWikiDOI
└── [ 75M] date-2022-01-03.json.zst
0 directories, 14 files
```
Match distribution:
```
1060312017 crossref exact doi
366035675 crossref unmatched unknown
353068331 grobid unmatched unknown
180656009 fatcat-datacite exact doi
65244213 fatcat-pubmed exact pmid
59858436 crossref-grobid unmatched unknown
52388120 fuzzy strong jaccardauthors
48732594 grobid exact doi
32262589 fatcat-pubmed exact doi
14236248 fatcat unmatched unknown
12671780 fuzzy strong slugtitleauthormatch
9711647 fuzzy strong tokenizedauthors
8277050 fatcat-crossref exact doi
4126772 fatcat-crossref unmatched unknown
3998894 grobid exact arxiv
3962173 fatcat-pubmed unmatched unknown
2561175 fuzzy exact titleauthormatch
1621193 grobid exact pmid
563569 fuzzy strong versioneddoi
519064 grobid exact isbn
497352 fatcat-datacite unmatched unknown
366615 crossref strong jaccardauthors
260217 crossref-grobid exact arxiv
166014 crossref-grobid exact doi
92927 crossref exact isbn
76785 fuzzy strong dataciterelatedid
75798 fatcat-pubmed strong jaccardauthors
71430 grobid strong jaccardauthors
65643 fatcat-crossref strong jaccardauthors
63527 fuzzy strong pmiddoipair
53837 crossref exact arxiv
47504 fuzzy strong arxivversion
43016 fuzzy strong customieeearxiv
40166 grobid exact pmcid
22094 crossref-grobid strong jaccardauthors
21836 crossref strong tokenizedauthors
13587 grobid strong slugtitleauthormatch
9589 crossref strong slugtitleauthormatch
8936 crossref exact titleauthormatch
8750 fatcat-pubmed exact arxiv
6990 crossref-grobid exact pmid
6455 fatcat-crossref strong tokenizedauthors
4670 grobid exact titleauthormatch
4573 crossref-grobid strong slugtitleauthormatch
4363 grobid strong tokenizedauthors
3581 fatcat-pubmed exact isbn
3364 fatcat-crossref exact arxiv
3344 fatcat-crossref exact isbn
2654 fatcat-pubmed strong tokenizedauthors
2129 fuzzy exact workid
1591 fatcat-pubmed strong slugtitleauthormatch
1579 crossref-grobid exact titleauthormatch
1174 fuzzy strong figshareversion
1149 crossref-grobid strong tokenizedauthors
1029 fatcat-pubmed exact titleauthormatch
721 crossref-grobid exact pmcid
625 fatcat-crossref strong slugtitleauthormatch
448 grobid strong titleartifact
446 fatcat-crossref exact titleauthormatch
181 fuzzy strong titleartifact
84 crossref-grobid strong titleartifact
80 crossref strong titleartifact
5 fuzzy strong custombsiundated
3 fuzzy strong custombsisubdoc
2 fatcat-pubmed strong titleartifact
1 fatcat exact doi
```
## Misc
New wikipedia extraction:
```
martin@ia601101:/magna/data/wikipedia_citations_2020-07-14 $ LC_ALL=C grep ID_list minimal_dataset.json | grep -c DOI
1442189
$ jq -rc '.refs[] | select(.ID_list != null) | {"URL": .URL, "Title": .title, "ID_list": .ID_list}' enwiki-20211201-pages-articles.citations.json | pv -l > minimal.json
$ grep -c DOI minimal.json
1932578
```
Convert format to existing minimal format, for "BrefZipWikiDOI" task.
First result, bref combined.
Previous version:
```
$ time zstdcat -T0 date-2021-07-28.json.zst |pv -l|wc -lc
2.08G 0:45:56 [ 753k/s] [ <=> ]
2077597833 981406745860
```
Current:
```
$ zstdcat -T0 date-2022-01-03.json.zst | pv -l | wc -lc
2.28G 0:37:55 [1.00M/s] [ <=> ]
2282864413 1077436490574
```
* 2,282,864,413 edges (matched and unmatched)
* 1,077,436,490,574 / 1T
About 11G more compressed, about 80G more data; estimated (from 100M sample)
1.483B matches (ratio, 0.65)
Previous (v1):
* 1,323,423,672 - estimate based on filesize: 1.439B matches.
Current (v2):
* 1,481,079,426 (76,235,927 strong, 1,404,843,499 exact, still about 5% fuzzy)
Diff:
* about 12% increase in number of edges
* latest (v12) OCI: 1,235,170,583 (so refcat about 19% larger with 1,481,079,426)
|