aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--notes/2021_10_grobid_reparse.md62
1 files changed, 62 insertions, 0 deletions
diff --git a/notes/2021_10_grobid_reparse.md b/notes/2021_10_grobid_reparse.md
new file mode 100644
index 0000000..0cca5d8
--- /dev/null
+++ b/notes/2021_10_grobid_reparse.md
@@ -0,0 +1,62 @@
+# Grobid reparse
+
+Want: Better match yield.
+
+> Find out what we have not matched yet and try to parse remaining data
+with grobid, again.
+
+## TODO
+
+* [ ] find all reparsable strings, e.g. "unmatched refs"
+* [ ] run via `grobid_xml_parse`
+
+## Notes
+
+```
+martin@ia601101:/magna/refcat/2021-07-28/UnmatchedRefs $ zstdcat -T0 date-2021-07-28.json.zst | pv -l | wc -l
+272M 0:05:13 [ 867k/s] [ <=> ]
+272119381
+```
+
+Unmatched refs seems small: 272119381 docs, currently, start with that, anyway.
+
+Expecting 70% docs with "unstructured" field; but many have other fields also, already.
+
+```
+$ zstdcat -T0 date-2021-07-28.json.zst | pv -l | LC_ALL=C grep -c -F '"unstructured"'
+272M 0:04:51 [ 933k/s] [ <=> ]
+192754239
+```
+
+192M have unstructured (70%), but may have other fields, too.
+
+Sample field counts:
+
+```
+$ zstdcat -T0 date-2021-07-28.json.zst | head -1000000 | indigo.py | jq .c
+{
+ "biblio": 1000000,
+ "biblio.container_name": 362777,
+ "biblio.contrib_raw_names": 544585,
+ "biblio.pages": 356993,
+ "biblio.volume": 338590,
+ "biblio.year": 441336,
+ "biblio.extra": 1000000,
+ "biblio.extra.isbn": 1000000,
+ "index": 1000000,
+ "key": 968748,
+ "ref_source": 1000000,
+ "release_year": 944441,
+ "release_ident": 1000000,
+ "release_stage": 945897,
+ "work_ident": 1000000,
+ "biblio.unstructured": 706717,
+ "biblio.issue": 50639,
+ "biblio.publisher": 12162,
+ "locator": 12808,
+ "biblio.url": 7418
+}
+```
+
+
+