diff options
author | Bryan Newbold <bnewbold@archive.org> | 2023-01-04 21:38:27 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2023-01-04 21:38:27 -0800 |
commit | 3f4f785857481a2e7c9a65a19a9eb2bf0480e153 (patch) | |
tree | 55fc8addfa380c59c6be3e700678a1870ced427b /TODO | |
parent | 035c662020ec1e98ec25ab707a076e55791ae212 (diff) | |
download | grobid_tei_xml-3f4f785857481a2e7c9a65a19a9eb2bf0480e153.tar.gz grobid_tei_xml-3f4f785857481a2e7c9a65a19a9eb2bf0480e153.zip |
commit old TODO filemain
Diffstat (limited to 'TODO')
-rw-r--r-- | TODO | 114 |
1 files changed, 114 insertions, 0 deletions
@@ -0,0 +1,114 @@ + +parsing: +- multiple editors +- proceedings and '<meeting>' event name; also address fields? +- DOI and URL redundancy + url: "DOI=http://doi.acm.org/10.1145/3098275.3051123" + doi: "=10.1080/14786440109462720" +- <idno>arxiv:cs.LG/1301.0604</idno> + +priorities: +x test with citationList output +x level=a (in title matches) +x fix URL parsing support (?) +- parse_citation_xml() + => returns None if didn't parse well + => does not set 'index' +- parse_citation_list_xml() + current parse_citations_xml, which then aliases to this +. test with GROBID 0.7.0/0.7.1 output + x PDF MD5 + nope + ISBN/ISBN13: nope + subtitle: nope + number + edition + publisherPlace + suffix (?) + skipping: + oaUrl + ptr type=\"open-access\" target + done, needs test: + PII + ark + istexId + email: <email> under author + orcid: <idno type=\"ORCID\"> + journal_abbrev + <title level=\"j\" type=\"abbrev + journal + level=\"j + bookTitle + title level=\"m (not main) + serieTitle + title level=\"s (not main) + institution (why?) + respStmt, orgName + editor + <editor>, persName + <contributor role=\"editor + meeting + for proceedings + <meeting> + address fields + + web + ptr target + <ptr target="https://sfp.dpe.gov.bd/site/policies/b675228c-7bba-4feb-ae02-eb55de027fca/" /> + alternative URL? + + conf (conference stuff generally) + keywords + + + => rg -i abbrev tests/files/ + => 'url' in a citation +? handle orgName for <author> + +refactors: +x remove old grobid2json (?) +- more test coverage: + URL + _simplify_dict() + journal name variants + +fields/coverage: +- other/report identifier + Johnson, C. K. (1976). ORTEPII. Report ORNL-5138. Oak Ridge National Laboratory, Tennessee, USA. + <idno>ORNL-5138</idno> + Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, November 1990. + <idno>RFC 1191</idno> +- parse old-style arxiv identifiers out of <idno> + K.R. Dienes, C. Kolda and J. March-Russell, hep-ph/9610479. + <idno>hep-ph/9610479</idno> + B.A. Dobrescu, hep-ph/9510424. + S.P. Martin, hep-ph/9608224. + https://github.com/internetarchive/fatcat/issues/84 + +features: +x citations as well as full bodies +x default is to parse into a dataclass (?) similar to XML format +x ... which can transform to JSON +x to_csl_dict() helper (both for header/document and citation) +- to_s2orc_metadata_dict() / to_s2orc_body_dict() +- structured parsing of body, abstract, etc + => paragraphs / sections + => citation contexts + => table / figure / equation + => footers +- all_urls() on document, including footnotes, body, bibref, etc +- optional post-processing "cleanups" (run on doc or at parse time via kwarg flag) + => DOI and other identifier validity, via regex (?) + => year/date validity (eg, sane year range, valid month/day) + => "ibid" +- optional quality checks on header, citation, body + => is_empty() + => "is reference a stub" + => "is header metadata valid" + => "was body extracted successfully" + => {title, author}, {journal, volume, issue, page}, {journal, title, year} + +infrastructure: +- tox testing, for multiple python versions + + |