# Grobid reparse Want: Better match yield. > Find out what we have not matched yet and try to parse remaining data with grobid, again. ## TODO * [ ] find all reparsable strings, e.g. "unmatched refs" * [ ] run via `grobid_xml_parse` * [ ] collect examples of parsing issues Reparsing the whole corpus will be part of the scholar raw refs pipeline. ## Notes ``` martin@ia601101:/magna/refcat/2021-07-28/UnmatchedRefs $ zstdcat -T0 date-2021-07-28.json.zst | pv -l | wc -l 272M 0:05:13 [ 867k/s] [ <=> ] 272119381 ``` Unmatched refs seems small: 272119381 docs, currently, start with that, anyway. Expecting 70% docs with "unstructured" field; but many have other fields also, already. ``` $ zstdcat -T0 date-2021-07-28.json.zst | pv -l | LC_ALL=C grep -c -F '"unstructured"' 272M 0:04:51 [ 933k/s] [ <=> ] 192754239 ``` 192M have unstructured (70%), but may have other fields, too. Sample field counts: ``` $ zstdcat -T0 date-2021-07-28.json.zst | head -1000000 | indigo.py | jq .c { "biblio": 1000000, "biblio.container_name": 362777, "biblio.contrib_raw_names": 544585, "biblio.pages": 356993, "biblio.volume": 338590, "biblio.year": 441336, "biblio.extra": 1000000, "biblio.extra.isbn": 1000000, "index": 1000000, "key": 968748, "ref_source": 1000000, "release_year": 944441, "release_ident": 1000000, "release_stage": 945897, "work_ident": 1000000, "biblio.unstructured": 706717, "biblio.issue": 50639, "biblio.publisher": 12162, "locator": 12808, "biblio.url": 7418 } ``` A first run with `grobid-tei-xml`, single threaded, about 50min for 100K citations, or 33 qps. Each request uses http, we do not batch; this will probably be much faster. About 1000 citations/s possible with threads, etc; baseline: 30.