parsing: - multiple editors - proceedings and '' event name; also address fields? - DOI and URL redundancy url: "DOI=http://doi.acm.org/10.1145/3098275.3051123" doi: "=10.1080/14786440109462720" - arxiv:cs.LG/1301.0604 priorities: x test with citationList output x level=a (in title matches) x fix URL parsing support (?) - parse_citation_xml() => returns None if didn't parse well => does not set 'index' - parse_citation_list_xml() current parse_citations_xml, which then aliases to this . test with GROBID 0.7.0/0.7.1 output x PDF MD5 nope ISBN/ISBN13: nope subtitle: nope number edition publisherPlace suffix (?) skipping: oaUrl ptr type=\"open-access\" target done, needs test: PII ark istexId email: under author orcid: journal_abbrev , persName <contributor role=\"editor meeting for proceedings <meeting> address fields web ptr target <ptr target="https://sfp.dpe.gov.bd/site/policies/b675228c-7bba-4feb-ae02-eb55de027fca/" /> alternative URL? conf (conference stuff generally) keywords => rg -i abbrev tests/files/ => 'url' in a citation ? handle orgName for <author> refactors: x remove old grobid2json (?) - more test coverage: URL _simplify_dict() journal name variants fields/coverage: - other/report identifier Johnson, C. K. (1976). ORTEPII. Report ORNL-5138. Oak Ridge National Laboratory, Tennessee, USA. <idno>ORNL-5138</idno> Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, November 1990. <idno>RFC 1191</idno> - parse old-style arxiv identifiers out of <idno> K.R. Dienes, C. Kolda and J. March-Russell, hep-ph/9610479. <idno>hep-ph/9610479</idno> B.A. Dobrescu, hep-ph/9510424. S.P. Martin, hep-ph/9608224. https://github.com/internetarchive/fatcat/issues/84 features: x citations as well as full bodies x default is to parse into a dataclass (?) similar to XML format x ... which can transform to JSON x to_csl_dict() helper (both for header/document and citation) - to_s2orc_metadata_dict() / to_s2orc_body_dict() - structured parsing of body, abstract, etc => paragraphs / sections => citation contexts => table / figure / equation => footers - all_urls() on document, including footnotes, body, bibref, etc - optional post-processing "cleanups" (run on doc or at parse time via kwarg flag) => DOI and other identifier validity, via regex (?) => year/date validity (eg, sane year range, valid month/day) => "ibid" - optional quality checks on header, citation, body => is_empty() => "is reference a stub" => "is header metadata valid" => "was body extracted successfully" => {title, author}, {journal, volume, issue, page}, {journal, title, year} infrastructure: - tox testing, for multiple python versions