From 393b5bb2c6cc53f879239c2410f6b18a7736a1de Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Thu, 28 Oct 2021 20:28:31 +0200 Subject: grobid: update notes --- notes/2021_10_grobid_reparse.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/notes/2021_10_grobid_reparse.md b/notes/2021_10_grobid_reparse.md index 0cca5d8..8101ad6 100644 --- a/notes/2021_10_grobid_reparse.md +++ b/notes/2021_10_grobid_reparse.md @@ -9,6 +9,9 @@ with grobid, again. * [ ] find all reparsable strings, e.g. "unmatched refs" * [ ] run via `grobid_xml_parse` +* [ ] collect examples of parsing issues + +Reparsing the whole corpus will be part of the scholar raw refs pipeline. ## Notes @@ -58,5 +61,8 @@ $ zstdcat -T0 date-2021-07-28.json.zst | head -1000000 | indigo.py | jq .c } ``` +A first run with `grobid-tei-xml`, single threaded, about 50min for 100K +citations, or 33 qps. Each request uses http, we do not batch; this will +probably be much faster. - +About 1000 citations/s possible with threads, etc; baseline: 30. -- cgit v1.2.3