aboutsummaryrefslogtreecommitdiffstats
path: root/notes/2021_10_grobid_reparse.md
blob: 8101ad6eb2c7bcb4cbf810acb3d72c7e2eedd41f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# Grobid reparse

Want: Better match yield.

> Find out what we have not matched yet and try to parse remaining data
with grobid, again.

## TODO

* [ ] find all reparsable strings, e.g. "unmatched refs"
* [ ] run via `grobid_xml_parse`
* [ ] collect examples of parsing issues

Reparsing the whole corpus will be part of the scholar raw refs pipeline.

## Notes

```
martin@ia601101:/magna/refcat/2021-07-28/UnmatchedRefs $ zstdcat -T0 date-2021-07-28.json.zst | pv -l | wc -l
272M 0:05:13 [ 867k/s] [                                                          <=>                                                                                                                                                        ]
272119381
```

Unmatched refs seems small: 272119381 docs, currently, start with that, anyway.

Expecting 70% docs with "unstructured" field; but many have other fields also, already.

```
$ zstdcat -T0 date-2021-07-28.json.zst | pv -l | LC_ALL=C grep -c -F '"unstructured"'
272M 0:04:51 [ 933k/s] [                                 <=>                                                                                                                                                                                 ]
192754239
```

192M have unstructured (70%), but may have other fields, too.

Sample field counts:

```
$ zstdcat -T0 date-2021-07-28.json.zst | head -1000000 | indigo.py | jq .c
{
  "biblio": 1000000,
  "biblio.container_name": 362777,
  "biblio.contrib_raw_names": 544585,
  "biblio.pages": 356993,
  "biblio.volume": 338590,
  "biblio.year": 441336,
  "biblio.extra": 1000000,
  "biblio.extra.isbn": 1000000,
  "index": 1000000,
  "key": 968748,
  "ref_source": 1000000,
  "release_year": 944441,
  "release_ident": 1000000,
  "release_stage": 945897,
  "work_ident": 1000000,
  "biblio.unstructured": 706717,
  "biblio.issue": 50639,
  "biblio.publisher": 12162,
  "locator": 12808,
  "biblio.url": 7418
}
```

A first run with `grobid-tei-xml`, single threaded, about 50min for 100K
citations, or 33 qps. Each request uses http, we do not batch; this will
probably be much faster.

About 1000 citations/s possible with threads, etc; baseline: 30.