diff options
-rw-r--r-- | notes/2021_10_grobid_reparse.md | 62 |
1 files changed, 62 insertions, 0 deletions
diff --git a/notes/2021_10_grobid_reparse.md b/notes/2021_10_grobid_reparse.md new file mode 100644 index 0000000..0cca5d8 --- /dev/null +++ b/notes/2021_10_grobid_reparse.md @@ -0,0 +1,62 @@ +# Grobid reparse + +Want: Better match yield. + +> Find out what we have not matched yet and try to parse remaining data +with grobid, again. + +## TODO + +* [ ] find all reparsable strings, e.g. "unmatched refs" +* [ ] run via `grobid_xml_parse` + +## Notes + +``` +martin@ia601101:/magna/refcat/2021-07-28/UnmatchedRefs $ zstdcat -T0 date-2021-07-28.json.zst | pv -l | wc -l +272M 0:05:13 [ 867k/s] [ <=> ] +272119381 +``` + +Unmatched refs seems small: 272119381 docs, currently, start with that, anyway. + +Expecting 70% docs with "unstructured" field; but many have other fields also, already. + +``` +$ zstdcat -T0 date-2021-07-28.json.zst | pv -l | LC_ALL=C grep -c -F '"unstructured"' +272M 0:04:51 [ 933k/s] [ <=> ] +192754239 +``` + +192M have unstructured (70%), but may have other fields, too. + +Sample field counts: + +``` +$ zstdcat -T0 date-2021-07-28.json.zst | head -1000000 | indigo.py | jq .c +{ + "biblio": 1000000, + "biblio.container_name": 362777, + "biblio.contrib_raw_names": 544585, + "biblio.pages": 356993, + "biblio.volume": 338590, + "biblio.year": 441336, + "biblio.extra": 1000000, + "biblio.extra.isbn": 1000000, + "index": 1000000, + "key": 968748, + "ref_source": 1000000, + "release_year": 944441, + "release_ident": 1000000, + "release_stage": 945897, + "work_ident": 1000000, + "biblio.unstructured": 706717, + "biblio.issue": 50639, + "biblio.publisher": 12162, + "locator": 12808, + "biblio.url": 7418 +} +``` + + + |