# Overview ## Data inputs Mostly JSON, but each one different in form and quality. Core inputs: * refs schema, from metadata or grobid (1-4B) * fatcat release entities (100-200M) * open library solr export (10-50M) Other inputs: * researchgate sitemap, titles (10-30M) * oai-pmh harvest metadata (50-200M) * sim (serials in microfilm, "microfilm") metadata Inputs related to evaluation: * BASE md dump (200-300M) * Microsoft Academic, MAG (100-300M) Casually: * a single title, e.g. ILL related (1) * lists of titles (1-1M) ## Targets ### BiblioRef Most important high level target; basic schema for current setup; elasticsearch indexable, small JSON docs, allowing basic aggregations and lookups. This is not just a conversion, but may involve clustering, verification, etc. ## Approach We may call it "local map-reduce", and we try to do it all in a single MR setup, e.g. * extract relevant fields and sort (map) * apply computation on groups (reduce) As we want performance and sometimes custom code (e.g. for finding information in unstructured data), we try to group code into a Go library with a suite of command line tools. Easy to build and deploy. If the scaffoling is good, we can plug in mappers and reducers as we go, and expose them in the tools.