1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
|
# V3
V2 plus:
* [ ] wikipedia
* [ ] some unstrucutured refs
* [ ] OL
* [ ] weblinks
## Unstructured
* about 300M w/o title, etc.
* some docs mention a "doi" in "unstructured"
Possible extractable information:
* pages ranges with regex
* doi, isbn, issn
* author names with some NER?
* journal abbreviation
Numbers:
$ time zstdcat -T0 dataset-full-date-2021-02-20.json.zst | LC_ALL=C pv -l | LC_ALL=C grep -c -i "doi"
2772622
Sometimes, the key contains an ISBN:
```
"key":"9781108604222#EMT-rl-1_BIBe-r-213"
```
key with doi:
```
"index":63,"key":"10.1002/9781118960608.gbm01177-BIB6970|gbm01177-cit-6970","locator":"7
```
ISBN format:
* 978-9279113639
* 9781566773362
* 978-80-7357-299-0
URLs may be broken:
```
http://www. unaids.org/hivaidsinfo/statistics/fact_sheets/pdfs/Thailand_en.pdf
```
* 2030021 DOI
* 36376 arxiv
Some cases only contain authors and year, e.g.
```
{
"biblio": {
"contrib_raw_names": [
"W H Hartmann",
"B H Hahn",
"H Abbey",
"L E Shulman"
],
"unstructured": "Hartmann, W. H., Hahn, B. H., Abbey, H., and Shulman, L. E., Lancer, 1965, 1, 123.",
"year": 1965
},
```
Here, we could run a query, e.g.
https://fatcat.wiki/release/search?q=hahn+shulman+abbey+hartmann, and check for
result set size, year, etc.
Other example:
* https://fatcat.wiki/release/search?q=Goudie+Anderson+Gray+boyle+buchanan+year%3A1965
```
{
"biblio": {
"contrib_raw_names": [
"R B Goudie",
"J R Anderson",
"K G Gray",
"J A Boyle",
"W W Buchanar"
],
"unstructured": "Goudie, R. B., Anderson, J. R., Gray, K. G., Boyle, J. A., and Buchanar, W. W., ibid., 1965, 1, 322.",
"year": 1965
},
```
----
With `skate-from-unstructured` we get some more doi and arxiv identifiers from
unstructured refs (unstructured, key). How many?
```
$ time zstdcat -T0 dataset-full-date-2021-02-20.json.zst | pv -l | \
skate-from-unstructured | jq -rc 'select(.biblio.doi != null or .biblio.arxiv_id != null)' | wc -l
```
The https://anystyle.io/ CRF implementation seems really useful to parse out
the rest of the unstructured data.
* [ ] parse fields with some containerized anystyle (create an oci container
and somehow get it running w/ or w/o docker; maybe podman allows to run as
library?)
Example:
```
$ anystyle -f json parse xxxx.txt
[
{
"citation-number": [
"3. "
],
"author": [
{
"family": "JP",
"given": "Morgan"
},
{
"family": "CS",
"given": "Bailey"
}
],
"title": [
"Cauda equina syndrome in the dog: radiographical evaluation"
],
"volume": [
"21"
],
"pages": [
"45 – 58"
],
"type": "article-journal",
"container-title": [
"J Small Anim Practice"
],
"date": [
"1980"
]
}
]
```
Can dump the whole unstructured list in to a single file (one per line).
* 10K lines take: 32s
* 100M would take probably ~100h to parse.
----
* from 308 "UnmatchedRefs" we would extract doi/arxiv for 47696153.
Stats:
* 759,516,507 citation links.
* ~723,350,228 + 47,696,153
* 771046381 edges
----
* aitio has docker installed
```
Client:
Version: 17.06.0-ce
API version: 1.30
Go version: go1.8.3
Git commit: 02c1d87
Built: Fri Jun 23 21:23:31 2017
OS/Arch: linux/amd64
Server:
Version: 17.06.0-ce
API version: 1.30 (minimum version 1.12)
Go version: go1.8.3
Git commit: 02c1d87
Built: Fri Jun 23 21:19:04 2017
OS/Arch: linux/amd64
Experimental: false
```
Maybe build an alpine based image?
|