grobid: update notes

author: Martin Czygan <martin.czygan@gmail.com> 2021-10-28 20:28:31 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2021-10-28 20:28:31 +0200
commit: 393b5bb2c6cc53f879239c2410f6b18a7736a1de (patch)
tree: 01738beb331175e28d50859ca85ed6b2e3a83db4
parent: 7e757b19a4f88ec2639008bfffbe50894674d28d (diff)
download: refcat-393b5bb2c6cc53f879239c2410f6b18a7736a1de.tar.gz
refcat-393b5bb2c6cc53f879239c2410f6b18a7736a1de.zip
1 files changed, 7 insertions, 1 deletions
diff --git a/notes/2021_10_grobid_reparse.md b/notes/2021_10_grobid_reparse.md
index 0cca5d8..8101ad6 100644
--- a/notes/2021_10_grobid_reparse.md
+++ b/notes/2021_10_grobid_reparse.md
@@ -9,6 +9,9 @@ with grobid, again.
 
 * [ ] find all reparsable strings, e.g. "unmatched refs"
 * [ ] run via `grobid_xml_parse`
+* [ ] collect examples of parsing issues
+
+Reparsing the whole corpus will be part of the scholar raw refs pipeline.
 
 ## Notes
 
@@ -58,5 +61,8 @@ $ zstdcat -T0 date-2021-07-28.json.zst | head -1000000 | indigo.py | jq .c
 }
 ```
 
+A first run with `grobid-tei-xml`, single threaded, about 50min for 100K
+citations, or 33 qps. Each request uses http, we do not batch; this will
+probably be much faster.
 
-
+About 1000 citations/s possible with threads, etc; baseline: 30.
author	Martin Czygan <martin.czygan@gmail.com>	2021-10-28 20:28:31 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2021-10-28 20:28:31 +0200
commit	393b5bb2c6cc53f879239c2410f6b18a7736a1de (patch)
tree	01738beb331175e28d50859ca85ed6b2e3a83db4
parent	7e757b19a4f88ec2639008bfffbe50894674d28d (diff)
download	refcat-393b5bb2c6cc53f879239c2410f6b18a7736a1de.tar.gz refcat-393b5bb2c6cc53f879239c2410f6b18a7736a1de.zip