1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
|
# Grobid reparse
Want: Better match yield.
> Find out what we have not matched yet and try to parse remaining data
with grobid, again.
## TODO
* [ ] find all reparsable strings, e.g. "unmatched refs"
* [ ] run via `grobid_xml_parse`
* [ ] collect examples of parsing issues
Reparsing the whole corpus will be part of the scholar raw refs pipeline.
## Notes
```
martin@ia601101:/magna/refcat/2021-07-28/UnmatchedRefs $ zstdcat -T0 date-2021-07-28.json.zst | pv -l | wc -l
272M 0:05:13 [ 867k/s] [ <=> ]
272119381
```
Unmatched refs seems small: 272119381 docs, currently, start with that, anyway.
Expecting 70% docs with "unstructured" field; but many have other fields also, already.
```
$ zstdcat -T0 date-2021-07-28.json.zst | pv -l | LC_ALL=C grep -c -F '"unstructured"'
272M 0:04:51 [ 933k/s] [ <=> ]
192754239
```
192M have unstructured (70%), but may have other fields, too.
Sample field counts:
```
$ zstdcat -T0 date-2021-07-28.json.zst | head -1000000 | indigo.py | jq .c
{
"biblio": 1000000,
"biblio.container_name": 362777,
"biblio.contrib_raw_names": 544585,
"biblio.pages": 356993,
"biblio.volume": 338590,
"biblio.year": 441336,
"biblio.extra": 1000000,
"biblio.extra.isbn": 1000000,
"index": 1000000,
"key": 968748,
"ref_source": 1000000,
"release_year": 944441,
"release_ident": 1000000,
"release_stage": 945897,
"work_ident": 1000000,
"biblio.unstructured": 706717,
"biblio.issue": 50639,
"biblio.publisher": 12162,
"locator": 12808,
"biblio.url": 7418
}
```
A first run with `grobid-tei-xml`, single threaded, about 50min for 100K
citations, or 33 qps. Each request uses http, we do not batch; this will
probably be much faster.
About 1000 citations/s possible with threads, etc; baseline: 30.
|