diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-10-28 20:28:31 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-10-28 20:28:31 +0200 |
commit | 393b5bb2c6cc53f879239c2410f6b18a7736a1de (patch) | |
tree | 01738beb331175e28d50859ca85ed6b2e3a83db4 | |
parent | 7e757b19a4f88ec2639008bfffbe50894674d28d (diff) | |
download | refcat-393b5bb2c6cc53f879239c2410f6b18a7736a1de.tar.gz refcat-393b5bb2c6cc53f879239c2410f6b18a7736a1de.zip |
grobid: update notes
-rw-r--r-- | notes/2021_10_grobid_reparse.md | 8 |
1 files changed, 7 insertions, 1 deletions
diff --git a/notes/2021_10_grobid_reparse.md b/notes/2021_10_grobid_reparse.md index 0cca5d8..8101ad6 100644 --- a/notes/2021_10_grobid_reparse.md +++ b/notes/2021_10_grobid_reparse.md @@ -9,6 +9,9 @@ with grobid, again. * [ ] find all reparsable strings, e.g. "unmatched refs" * [ ] run via `grobid_xml_parse` +* [ ] collect examples of parsing issues + +Reparsing the whole corpus will be part of the scholar raw refs pipeline. ## Notes @@ -58,5 +61,8 @@ $ zstdcat -T0 date-2021-07-28.json.zst | head -1000000 | indigo.py | jq .c } ``` +A first run with `grobid-tei-xml`, single threaded, about 50min for 100K +citations, or 33 qps. Each request uses http, we do not batch; this will +probably be much faster. - +About 1000 citations/s possible with threads, etc; baseline: 30. |