From 3f4f785857481a2e7c9a65a19a9eb2bf0480e153 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Wed, 4 Jan 2023 21:38:27 -0800 Subject: commit old TODO file --- TODO | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 114 insertions(+) create mode 100644 TODO diff --git a/TODO b/TODO new file mode 100644 index 0000000..5b6ef7a --- /dev/null +++ b/TODO @@ -0,0 +1,114 @@ + +parsing: +- multiple editors +- proceedings and '' event name; also address fields? +- DOI and URL redundancy + url: "DOI=http://doi.acm.org/10.1145/3098275.3051123" + doi: "=10.1080/14786440109462720" +- arxiv:cs.LG/1301.0604 + +priorities: +x test with citationList output +x level=a (in title matches) +x fix URL parsing support (?) +- parse_citation_xml() + => returns None if didn't parse well + => does not set 'index' +- parse_citation_list_xml() + current parse_citations_xml, which then aliases to this +. test with GROBID 0.7.0/0.7.1 output + x PDF MD5 + nope + ISBN/ISBN13: nope + subtitle: nope + number + edition + publisherPlace + suffix (?) + skipping: + oaUrl + ptr type=\"open-access\" target + done, needs test: + PII + ark + istexId + email: under author + orcid: + journal_abbrev + , persName + <contributor role=\"editor + meeting + for proceedings + <meeting> + address fields + + web + ptr target + <ptr target="https://sfp.dpe.gov.bd/site/policies/b675228c-7bba-4feb-ae02-eb55de027fca/" /> + alternative URL? + + conf (conference stuff generally) + keywords + + + => rg -i abbrev tests/files/ + => 'url' in a citation +? handle orgName for <author> + +refactors: +x remove old grobid2json (?) +- more test coverage: + URL + _simplify_dict() + journal name variants + +fields/coverage: +- other/report identifier + Johnson, C. K. (1976). ORTEPII. Report ORNL-5138. Oak Ridge National Laboratory, Tennessee, USA. + <idno>ORNL-5138</idno> + Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, November 1990. + <idno>RFC 1191</idno> +- parse old-style arxiv identifiers out of <idno> + K.R. Dienes, C. Kolda and J. March-Russell, hep-ph/9610479. + <idno>hep-ph/9610479</idno> + B.A. Dobrescu, hep-ph/9510424. + S.P. Martin, hep-ph/9608224. + https://github.com/internetarchive/fatcat/issues/84 + +features: +x citations as well as full bodies +x default is to parse into a dataclass (?) similar to XML format +x ... which can transform to JSON +x to_csl_dict() helper (both for header/document and citation) +- to_s2orc_metadata_dict() / to_s2orc_body_dict() +- structured parsing of body, abstract, etc + => paragraphs / sections + => citation contexts + => table / figure / equation + => footers +- all_urls() on document, including footnotes, body, bibref, etc +- optional post-processing "cleanups" (run on doc or at parse time via kwarg flag) + => DOI and other identifier validity, via regex (?) + => year/date validity (eg, sane year range, valid month/day) + => "ibid" +- optional quality checks on header, citation, body + => is_empty() + => "is reference a stub" + => "is header metadata valid" + => "was body extracted successfully" + => {title, author}, {journal, volume, issue, page}, {journal, title, year} + +infrastructure: +- tox testing, for multiple python versions + + -- cgit v1.2.3