aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--notes/2021_10_grobid_reparse.md8
1 files changed, 7 insertions, 1 deletions
diff --git a/notes/2021_10_grobid_reparse.md b/notes/2021_10_grobid_reparse.md
index 0cca5d8..8101ad6 100644
--- a/notes/2021_10_grobid_reparse.md
+++ b/notes/2021_10_grobid_reparse.md
@@ -9,6 +9,9 @@ with grobid, again.
* [ ] find all reparsable strings, e.g. "unmatched refs"
* [ ] run via `grobid_xml_parse`
+* [ ] collect examples of parsing issues
+
+Reparsing the whole corpus will be part of the scholar raw refs pipeline.
## Notes
@@ -58,5 +61,8 @@ $ zstdcat -T0 date-2021-07-28.json.zst | head -1000000 | indigo.py | jq .c
}
```
+A first run with `grobid-tei-xml`, single threaded, about 50min for 100K
+citations, or 33 qps. Each request uses http, we do not batch; this will
+probably be much faster.
-
+About 1000 citations/s possible with threads, etc; baseline: 30.