aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-10-28 20:28:31 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-10-28 20:28:31 +0200
commit393b5bb2c6cc53f879239c2410f6b18a7736a1de (patch)
tree01738beb331175e28d50859ca85ed6b2e3a83db4
parent7e757b19a4f88ec2639008bfffbe50894674d28d (diff)
downloadrefcat-393b5bb2c6cc53f879239c2410f6b18a7736a1de.tar.gz
refcat-393b5bb2c6cc53f879239c2410f6b18a7736a1de.zip
grobid: update notes
-rw-r--r--notes/2021_10_grobid_reparse.md8
1 files changed, 7 insertions, 1 deletions
diff --git a/notes/2021_10_grobid_reparse.md b/notes/2021_10_grobid_reparse.md
index 0cca5d8..8101ad6 100644
--- a/notes/2021_10_grobid_reparse.md
+++ b/notes/2021_10_grobid_reparse.md
@@ -9,6 +9,9 @@ with grobid, again.
* [ ] find all reparsable strings, e.g. "unmatched refs"
* [ ] run via `grobid_xml_parse`
+* [ ] collect examples of parsing issues
+
+Reparsing the whole corpus will be part of the scholar raw refs pipeline.
## Notes
@@ -58,5 +61,8 @@ $ zstdcat -T0 date-2021-07-28.json.zst | head -1000000 | indigo.py | jq .c
}
```
+A first run with `grobid-tei-xml`, single threaded, about 50min for 100K
+citations, or 33 qps. Each request uses http, we do not batch; this will
+probably be much faster.
-
+About 1000 citations/s possible with threads, etc; baseline: 30.