aboutsummaryrefslogtreecommitdiffstats
path: root/TODO
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2023-01-04 21:38:27 -0800
committerBryan Newbold <bnewbold@archive.org>2023-01-04 21:38:27 -0800
commit3f4f785857481a2e7c9a65a19a9eb2bf0480e153 (patch)
tree55fc8addfa380c59c6be3e700678a1870ced427b /TODO
parent035c662020ec1e98ec25ab707a076e55791ae212 (diff)
downloadgrobid_tei_xml-3f4f785857481a2e7c9a65a19a9eb2bf0480e153.tar.gz
grobid_tei_xml-3f4f785857481a2e7c9a65a19a9eb2bf0480e153.zip
commit old TODO filemain
Diffstat (limited to 'TODO')
-rw-r--r--TODO114
1 files changed, 114 insertions, 0 deletions
diff --git a/TODO b/TODO
new file mode 100644
index 0000000..5b6ef7a
--- /dev/null
+++ b/TODO
@@ -0,0 +1,114 @@
+
+parsing:
+- multiple editors
+- proceedings and '<meeting>' event name; also address fields?
+- DOI and URL redundancy
+ url: "DOI=http://doi.acm.org/10.1145/3098275.3051123"
+ doi: "=10.1080/14786440109462720"
+- <idno>arxiv:cs.LG/1301.0604</idno>
+
+priorities:
+x test with citationList output
+x level=a (in title matches)
+x fix URL parsing support (?)
+- parse_citation_xml()
+ => returns None if didn't parse well
+ => does not set 'index'
+- parse_citation_list_xml()
+ current parse_citations_xml, which then aliases to this
+. test with GROBID 0.7.0/0.7.1 output
+ x PDF MD5
+ nope
+ ISBN/ISBN13: nope
+ subtitle: nope
+ number
+ edition
+ publisherPlace
+ suffix (?)
+ skipping:
+ oaUrl
+ ptr type=\"open-access\" target
+ done, needs test:
+ PII
+ ark
+ istexId
+ email: <email> under author
+ orcid: <idno type=\"ORCID\">
+ journal_abbrev
+ <title level=\"j\" type=\"abbrev
+ journal
+ level=\"j
+ bookTitle
+ title level=\"m (not main)
+ serieTitle
+ title level=\"s (not main)
+ institution (why?)
+ respStmt, orgName
+ editor
+ <editor>, persName
+ <contributor role=\"editor
+ meeting
+ for proceedings
+ <meeting>
+ address fields
+
+ web
+ ptr target
+ <ptr target="https://sfp.dpe.gov.bd/site/policies/b675228c-7bba-4feb-ae02-eb55de027fca/" />
+ alternative URL?
+
+ conf (conference stuff generally)
+ keywords
+
+
+ => rg -i abbrev tests/files/
+ => 'url' in a citation
+? handle orgName for <author>
+
+refactors:
+x remove old grobid2json (?)
+- more test coverage:
+ URL
+ _simplify_dict()
+ journal name variants
+
+fields/coverage:
+- other/report identifier
+ Johnson, C. K. (1976). ORTEPII. Report ORNL-5138. Oak Ridge National Laboratory, Tennessee, USA.
+ <idno>ORNL-5138</idno>
+ Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, November 1990.
+ <idno>RFC 1191</idno>
+- parse old-style arxiv identifiers out of <idno>
+ K.R. Dienes, C. Kolda and J. March-Russell, hep-ph/9610479.
+ <idno>hep-ph/9610479</idno>
+ B.A. Dobrescu, hep-ph/9510424.
+ S.P. Martin, hep-ph/9608224.
+ https://github.com/internetarchive/fatcat/issues/84
+
+features:
+x citations as well as full bodies
+x default is to parse into a dataclass (?) similar to XML format
+x ... which can transform to JSON
+x to_csl_dict() helper (both for header/document and citation)
+- to_s2orc_metadata_dict() / to_s2orc_body_dict()
+- structured parsing of body, abstract, etc
+ => paragraphs / sections
+ => citation contexts
+ => table / figure / equation
+ => footers
+- all_urls() on document, including footnotes, body, bibref, etc
+- optional post-processing "cleanups" (run on doc or at parse time via kwarg flag)
+ => DOI and other identifier validity, via regex (?)
+ => year/date validity (eg, sane year range, valid month/day)
+ => "ibid"
+- optional quality checks on header, citation, body
+ => is_empty()
+ => "is reference a stub"
+ => "is header metadata valid"
+ => "was body extracted successfully"
+ => {title, author}, {journal, volume, issue, page}, {journal, title, year}
+
+infrastructure:
+- tox testing, for multiple python versions
+
+