aboutsummaryrefslogtreecommitdiffstats
path: root/python/notes/version_3.md
blob: 4ed4df4ec0cdc31d803310b191950776c16fb4d7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
# V3

V2 plus:

* [ ] no dups
* [ ] unmatched
* [ ] wikipedia
* [ ] some unstrucutured refs
* [ ] OL
* [ ] weblinks

## Duplicates

```
$ zstdcat -T0 /magna/refcat/BiblioRefV2/date-2021-02-20.json.zst | jq -rc 'select(.source_release_ident == .target_release_ident)'
```

Only 0.001% though.

## Unstructured

* about 300M w/o title, etc.
* some docs mention a "doi" in "unstructured"

Possible extractable information:

* pages ranges with regex
* doi, isbn, issn
* author names with some NER?
* journal abbreviation

Numbers:

$ time zstdcat -T0 dataset-full-date-2021-02-20.json.zst | LC_ALL=C pv -l | LC_ALL=C grep -c -i "doi"
2772622

Sometimes, the key contains an ISBN:

```
"key":"9781108604222#EMT-rl-1_BIBe-r-213"
```

key with doi:

```
"index":63,"key":"10.1002/9781118960608.gbm01177-BIB6970|gbm01177-cit-6970","locator":"7
```

ISBN format:

* 978-9279113639
* 9781566773362
* 978-80-7357-299-0

URLs may be broken:

```
http://www. unaids.org/hivaidsinfo/statistics/fact_sheets/pdfs/Thailand_en.pdf
```

* 2030021 DOI
* 36376 arxiv

Some cases only contain authors and year, e.g.

```
{
  "biblio": {
    "contrib_raw_names": [
      "W H Hartmann",
      "B H Hahn",
      "H Abbey",
      "L E Shulman"
    ],
    "unstructured": "Hartmann, W. H., Hahn, B. H., Abbey, H., and Shulman, L. E., Lancer, 1965, 1, 123.",
    "year": 1965
  },
```

Here, we could run a query, e.g.
https://fatcat.wiki/release/search?q=hahn+shulman+abbey+hartmann, and check for
result set size, year, etc.

Other example:

* https://fatcat.wiki/release/search?q=Goudie+Anderson+Gray+boyle+buchanan+year%3A1965

```
{
  "biblio": {
    "contrib_raw_names": [
      "R B Goudie",
      "J R Anderson",
      "K G Gray",
      "J A Boyle",
      "W W Buchanar"
    ],
    "unstructured": "Goudie, R. B., Anderson, J. R., Gray, K. G., Boyle, J. A., and Buchanar, W. W., ibid., 1965, 1, 322.",
    "year": 1965
  },
```

----

With `skate-from-unstructured` we get some more doi and arxiv identifiers from
unstructured refs (unstructured, key). How many?

```
$ time zstdcat -T0 dataset-full-date-2021-02-20.json.zst | pv -l | \
    skate-from-unstructured | jq -rc 'select(.biblio.doi != null or .biblio.arxiv_id != null)' | wc -l
```

The https://anystyle.io/ CRF implementation seems really useful to parse out
the rest of the unstructured data.

* [ ] parse fields with some containerized anystyle (create an oci container
  and somehow get it running w/ or w/o docker; maybe podman allows to run as
library?)

Example:

```
$ anystyle -f json parse xxxx.txt
[
  {
    "citation-number": [
      "3. "
    ],
    "author": [
      {
        "family": "JP",
        "given": "Morgan"
      },
      {
        "family": "CS",
        "given": "Bailey"
      }
    ],
    "title": [
      "Cauda equina syndrome in the dog: radiographical evaluation"
    ],
    "volume": [
      "21"
    ],
    "pages": [
      "45 – 58"
    ],
    "type": "article-journal",
    "container-title": [
      "J Small Anim Practice"
    ],
    "date": [
      "1980"
    ]
  }
]
```

Can dump the whole unstructured list in to a single file (one per line).

* 10K lines take: 32s
* 100M would take probably ~100h to parse.

----

* from 308 "UnmatchedRefs" we would extract doi/arxiv for 47696153.

Stats:

* 759,516,507 citation links.
* ~723,350,228 + 47,696,153
* 771046381 edges

----

* aitio has docker installed

```
Client:
 Version:      17.06.0-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:23:31 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:19:04 2017
 OS/Arch:      linux/amd64
 Experimental: false
```

Maybe build an alpine based image?

Both anystyle and grobid use wapiti under the hood; but they seem to differ
slightly. anystyle seems to be a smaller codebase overall. Grobid has an api
and various modes.

Note-to-self: Run a comparison between wapiti based citation extractors.

----

```
$ time zstdcat -T0 /magna/refcat/UnmatchedRefs/date-2021-02-20.json.zst | LC_ALL=C wc -l
260768384
```

----

# Wikipedia

* /magna/data/wikipedia_citations_2020-07-14

A first run only got 64008 docs; improbable that we are missing so many doi.

Also, need to generalize some skate code a bit.

----

# Verification stats

* have 40257623 clusters, `zstdcat -T0 /magna/refcat/RefsFatcatClusters/date-2021-02-20.json.zst | wc -l`
* have 29290668 clusters of size <= 10

```
$ zstdcat -T0 /magna/refcat/RefsFatcatClusters/date-2021-02-20.json.zst |
    jq -rc 'select(.v|length < 10)' | LC_ALL=C wc -l
```

A 5M sample.

```
$ awk '{print $3}' cluster_verify_5m.txt | sort | uniq -c | sort -nr
6886124 StatusDifferent
4619805 StatusStrong
3587478 StatusExact
 120215 StatusAmbiguous
```

----

# Unmatched

* We want the unmatched refs as well, e.g. to display.

In order to do that offline, we would need to sort all matches by source and
the original refs file by source ident.

The iterate over both files and fill in the unmatched targets (unstructured, csl, ...)

Options:

* we have `source ident` and `ref_index` (+1)
* can sort biblioref by source ident
* can sort refs by source ident

That's almost the same, as the matching process, just another function working on the match group.

----