1 files changed, 62 insertions, 0 deletions
diff --git a/notes/2021_10_grobid_reparse.md b/notes/2021_10_grobid_reparse.md
new file mode 100644
index 0000000..0cca5d8
--- /dev/null
+++ b/notes/2021_10_grobid_reparse.md
@@ -0,0 +1,62 @@
+# Grobid reparse
+
+Want: Better match yield.
+
+> Find out what we have not matched yet and try to parse remaining data
+with grobid, again.
+
+## TODO
+
+* [ ] find all reparsable strings, e.g. "unmatched refs"
+* [ ] run via `grobid_xml_parse`
+
+## Notes
+
+```
+martin@ia601101:/magna/refcat/2021-07-28/UnmatchedRefs $ zstdcat -T0 date-2021-07-28.json.zst | pv -l | wc -l
+272M 0:05:13 [ 867k/s] [                                                          <=>                                                                                                                                                        ]
+272119381
+```
+
+Unmatched refs seems small: 272119381 docs, currently, start with that, anyway.
+
+Expecting 70% docs with "unstructured" field; but many have other fields also, already.
+
+```
+$ zstdcat -T0 date-2021-07-28.json.zst | pv -l | LC_ALL=C grep -c -F '"unstructured"'
+272M 0:04:51 [ 933k/s] [                                 <=>                                                                                                                                                                                 ]
+192754239
+```
+
+192M have unstructured (70%), but may have other fields, too.
+
+Sample field counts:
+
+```
+$ zstdcat -T0 date-2021-07-28.json.zst | head -1000000 | indigo.py | jq .c
+{
+  "biblio": 1000000,
+  "biblio.container_name": 362777,
+  "biblio.contrib_raw_names": 544585,
+  "biblio.pages": 356993,
+  "biblio.volume": 338590,
+  "biblio.year": 441336,
+  "biblio.extra": 1000000,
+  "biblio.extra.isbn": 1000000,
+  "index": 1000000,
+  "key": 968748,
+  "ref_source": 1000000,
+  "release_year": 944441,
+  "release_ident": 1000000,
+  "release_stage": 945897,
+  "work_ident": 1000000,
+  "biblio.unstructured": 706717,
+  "biblio.issue": 50639,
+  "biblio.publisher": 12162,
+  "locator": 12808,
+  "biblio.url": 7418
+}
+```
+
+
+