diff options
-rw-r--r-- | notes/2021_10_grobid_reparse.md | 8 |
1 files changed, 7 insertions, 1 deletions
diff --git a/notes/2021_10_grobid_reparse.md b/notes/2021_10_grobid_reparse.md index 0cca5d8..8101ad6 100644 --- a/notes/2021_10_grobid_reparse.md +++ b/notes/2021_10_grobid_reparse.md @@ -9,6 +9,9 @@ with grobid, again. * [ ] find all reparsable strings, e.g. "unmatched refs" * [ ] run via `grobid_xml_parse` +* [ ] collect examples of parsing issues + +Reparsing the whole corpus will be part of the scholar raw refs pipeline. ## Notes @@ -58,5 +61,8 @@ $ zstdcat -T0 date-2021-07-28.json.zst | head -1000000 | indigo.py | jq .c } ``` +A first run with `grobid-tei-xml`, single threaded, about 50min for 100K +citations, or 33 qps. Each request uses http, we do not batch; this will +probably be much faster. - +About 1000 citations/s possible with threads, etc; baseline: 30. |