aboutsummaryrefslogtreecommitdiffstats
path: root/notes/2022_01_10_refcat_update.md
blob: 41b2024cc2097c73bc4e83353fef7dbfa1ec863c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
# Refcat update

* new refs export, about 10% more (2.7B)
* new fatcat export

## TL;DR

* v1: 1,323,423,672
* v2: 1,481,079,426 (+11.9%, 76,235,927 strong, 1,404,843,499 exact, still about 5% fuzzy)

```
$ du -hs 2022-01-03/
1.6T    2022-01-03/

$ tree -sh 2022-01-03/Bref*
2022-01-03/Bref
└── [ 81G]  date-2022-01-03.json.zst
2022-01-03/BrefCombined
├── [135G]  date-2022-01-03.json.zst
├── [4.5M]  matches_sorted.tsv.zst
├── [1.1G]  matches.tsv.zst
└── [2.4K]  uniqc.txt
2022-01-03/BrefOpenLibraryZipISBN
└── [ 30M]  date-2022-01-03.json.zst
2022-01-03/BrefSortedByWorkID
└── [ 72G]  date-2022-01-03.json.zst
2022-01-03/BrefZipArxiv
└── [266M]  date-2022-01-03.json.zst
2022-01-03/BrefZipDOI
└── [ 70G]  date-2022-01-03.json.zst
2022-01-03/BrefZipFuzzy
└── [6.1G]  date-2022-01-03-mapper-ts.json.zst
2022-01-03/BrefZipOpenLibrary
└── [ 43M]  date-2022-01-03.json.zst
2022-01-03/BrefZipPMCID
└── [8.7M]  date-2022-01-03.json.zst
2022-01-03/BrefZipPMID
└── [4.9G]  date-2022-01-03.json.zst
2022-01-03/BrefZipWikiDOI
└── [ 75M]  date-2022-01-03.json.zst

0 directories, 14 files
```

Match distribution:

```
1060312017  crossref         exact      doi
366035675   crossref         unmatched  unknown
353068331   grobid           unmatched  unknown
180656009   fatcat-datacite  exact      doi
65244213    fatcat-pubmed    exact      pmid
59858436    crossref-grobid  unmatched  unknown
52388120    fuzzy            strong     jaccardauthors
48732594    grobid           exact      doi
32262589    fatcat-pubmed    exact      doi
14236248    fatcat           unmatched  unknown
12671780    fuzzy            strong     slugtitleauthormatch
9711647     fuzzy            strong     tokenizedauthors
8277050     fatcat-crossref  exact      doi
4126772     fatcat-crossref  unmatched  unknown
3998894     grobid           exact      arxiv
3962173     fatcat-pubmed    unmatched  unknown
2561175     fuzzy            exact      titleauthormatch
1621193     grobid           exact      pmid
563569      fuzzy            strong     versioneddoi
519064      grobid           exact      isbn
497352      fatcat-datacite  unmatched  unknown
366615      crossref         strong     jaccardauthors
260217      crossref-grobid  exact      arxiv
166014      crossref-grobid  exact      doi
92927       crossref         exact      isbn
76785       fuzzy            strong     dataciterelatedid
75798       fatcat-pubmed    strong     jaccardauthors
71430       grobid           strong     jaccardauthors
65643       fatcat-crossref  strong     jaccardauthors
63527       fuzzy            strong     pmiddoipair
53837       crossref         exact      arxiv
47504       fuzzy            strong     arxivversion
43016       fuzzy            strong     customieeearxiv
40166       grobid           exact      pmcid
22094       crossref-grobid  strong     jaccardauthors
21836       crossref         strong     tokenizedauthors
13587       grobid           strong     slugtitleauthormatch
9589        crossref         strong     slugtitleauthormatch
8936        crossref         exact      titleauthormatch
8750        fatcat-pubmed    exact      arxiv
6990        crossref-grobid  exact      pmid
6455        fatcat-crossref  strong     tokenizedauthors
4670        grobid           exact      titleauthormatch
4573        crossref-grobid  strong     slugtitleauthormatch
4363        grobid           strong     tokenizedauthors
3581        fatcat-pubmed    exact      isbn
3364        fatcat-crossref  exact      arxiv
3344        fatcat-crossref  exact      isbn
2654        fatcat-pubmed    strong     tokenizedauthors
2129        fuzzy            exact      workid
1591        fatcat-pubmed    strong     slugtitleauthormatch
1579        crossref-grobid  exact      titleauthormatch
1174        fuzzy            strong     figshareversion
1149        crossref-grobid  strong     tokenizedauthors
1029        fatcat-pubmed    exact      titleauthormatch
721         crossref-grobid  exact      pmcid
625         fatcat-crossref  strong     slugtitleauthormatch
448         grobid           strong     titleartifact
446         fatcat-crossref  exact      titleauthormatch
181         fuzzy            strong     titleartifact
84          crossref-grobid  strong     titleartifact
80          crossref         strong     titleartifact
5           fuzzy            strong     custombsiundated
3           fuzzy            strong     custombsisubdoc
2           fatcat-pubmed    strong     titleartifact
1           fatcat           exact      doi
```

## Misc

New wikipedia extraction:

```
martin@ia601101:/magna/data/wikipedia_citations_2020-07-14 $ LC_ALL=C grep ID_list minimal_dataset.json | grep -c DOI
1442189

$ jq -rc '.refs[] | select(.ID_list != null) | {"URL": .URL, "Title": .title, "ID_list": .ID_list}' enwiki-20211201-pages-articles.citations.json | pv -l  > minimal.json
$ grep -c DOI minimal.json
1932578
```

Convert format to existing minimal format, for "BrefZipWikiDOI" task.

First result, bref combined.

Previous version:

```
$ time zstdcat -T0 date-2021-07-28.json.zst |pv -l|wc -lc
2.08G 0:45:56 [ 753k/s] [                                                                                                                                                                                    <=>                              ]
2077597833 981406745860
```

Current:

```
$ zstdcat -T0 date-2022-01-03.json.zst | pv -l | wc -lc
2.28G 0:37:55 [1.00M/s] [                                                                                                    <=>                                                                                                              ]
2282864413 1077436490574
```

* 2,282,864,413 edges (matched and unmatched)
* 1,077,436,490,574 / 1T

About 11G more compressed, about 80G more data; estimated (from 100M sample)
1.483B matches (ratio, 0.65)

Previous (v1):

* 1,323,423,672 - estimate based on filesize: 1.439B matches.

Current (v2):

* 1,481,079,426 (76,235,927 strong, 1,404,843,499 exact, still about 5% fuzzy)

Diff:

* about 12% increase in number of edges
* latest (v12) OCI: 1,235,170,583 (so refcat about 19% larger with 1,481,079,426)