commit old TODO filemain

author: Bryan Newbold <bnewbold@archive.org> 2023-01-04 21:38:27 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2023-01-04 21:38:27 -0800
commit: 3f4f785857481a2e7c9a65a19a9eb2bf0480e153 (patch)
tree: 55fc8addfa380c59c6be3e700678a1870ced427b /TODO
parent: 035c662020ec1e98ec25ab707a076e55791ae212 (diff)
download: grobid_tei_xml-3f4f785857481a2e7c9a65a19a9eb2bf0480e153.tar.gz
grobid_tei_xml-3f4f785857481a2e7c9a65a19a9eb2bf0480e153.zip
1 files changed, 114 insertions, 0 deletions
diff --git a/TODO b/TODO
new file mode 100644
index 0000000..5b6ef7a
--- /dev/null
+++ b/TODO
@@ -0,0 +1,114 @@
+
+parsing:
+- multiple editors
+- proceedings and '<meeting>' event name; also address fields?
+- DOI and URL redundancy
+    url: "DOI=http://doi.acm.org/10.1145/3098275.3051123"
+    doi: "=10.1080/14786440109462720"
+- <idno>arxiv:cs.LG/1301.0604</idno>
+
+priorities:
+x test with citationList output
+x level=a (in title matches)
+x fix URL parsing support (?)
+- parse_citation_xml()
+    => returns None if didn't parse well
+    => does not set 'index'
+- parse_citation_list_xml()
+    current parse_citations_xml, which then aliases to this
+. test with GROBID 0.7.0/0.7.1 output
+    x  PDF MD5
+        nope
+            ISBN/ISBN13:  nope
+            subtitle: nope
+            number
+            edition
+            publisherPlace
+            suffix (?)
+        skipping:
+            oaUrl
+                ptr type=\"open-access\" target
+        done, needs test:
+            PII
+            ark
+            istexId
+            email: <email> under author
+            orcid: <idno type=\"ORCID\">
+            journal_abbrev
+                <title level=\"j\" type=\"abbrev
+            journal
+                level=\"j
+            bookTitle
+                title level=\"m (not main)
+            serieTitle
+                title level=\"s (not main)
+            institution (why?)
+                respStmt, orgName
+        editor
+            <editor>, persName
+            <contributor role=\"editor
+        meeting
+            for proceedings
+            <meeting>
+                address fields
+
+        web
+            ptr target
+            <ptr target="https://sfp.dpe.gov.bd/site/policies/b675228c-7bba-4feb-ae02-eb55de027fca/" />
+            alternative URL?
+
+        conf (conference stuff generally)
+        keywords
+
+
+    => rg -i abbrev tests/files/
+    => 'url' in a citation
+? handle orgName for <author>
+
+refactors:
+x remove old grobid2json (?)
+- more test coverage:
+    URL
+    _simplify_dict()
+    journal name variants
+
+fields/coverage:
+- other/report identifier
+    Johnson, C. K. (1976). ORTEPII. Report ORNL-5138. Oak Ridge National Laboratory, Tennessee, USA.
+        <idno>ORNL-5138</idno>
+    Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, November 1990.
+        <idno>RFC 1191</idno>
+- parse old-style arxiv identifiers out of <idno>
+    K.R. Dienes, C. Kolda and J. March-Russell, hep-ph/9610479.
+         <idno>hep-ph/9610479</idno>
+    B.A. Dobrescu, hep-ph/9510424.
+    S.P. Martin, hep-ph/9608224.
+    https://github.com/internetarchive/fatcat/issues/84
+
+features:
+x citations as well as full bodies
+x default is to parse into a dataclass (?) similar to XML format
+x ... which can transform to JSON
+x to_csl_dict() helper (both for header/document and citation)
+- to_s2orc_metadata_dict() / to_s2orc_body_dict()
+- structured parsing of body, abstract, etc
+    => paragraphs / sections
+    => citation contexts
+    => table / figure / equation
+    => footers
+- all_urls() on document, including footnotes, body, bibref, etc
+- optional post-processing "cleanups" (run on doc or at parse time via kwarg flag)
+    => DOI and other identifier validity, via regex (?)
+    => year/date validity (eg, sane year range, valid month/day)
+    => "ibid"
+- optional quality checks on header, citation, body
+    => is_empty()
+    => "is reference a stub"
+    => "is header metadata valid"
+    => "was body extracted successfully"
+    => {title, author}, {journal, volume, issue, page}, {journal, title, year}
+
+infrastructure:
+- tox testing, for multiple python versions
+
+
author	Bryan Newbold <bnewbold@archive.org>	2023-01-04 21:38:27 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2023-01-04 21:38:27 -0800
commit	3f4f785857481a2e7c9a65a19a9eb2bf0480e153 (patch)
tree	55fc8addfa380c59c6be3e700678a1870ced427b /TODO
parent	035c662020ec1e98ec25ab707a076e55791ae212 (diff)
download	grobid_tei_xml-3f4f785857481a2e7c9a65a19a9eb2bf0480e153.tar.gz grobid_tei_xml-3f4f785857481a2e7c9a65a19a9eb2bf0480e153.zip