diff options
| -rw-r--r-- | TODO | 114 | 
1 files changed, 114 insertions, 0 deletions
| @@ -0,0 +1,114 @@ + +parsing: +- multiple editors +- proceedings and '<meeting>' event name; also address fields? +- DOI and URL redundancy +    url: "DOI=http://doi.acm.org/10.1145/3098275.3051123" +    doi: "=10.1080/14786440109462720" +- <idno>arxiv:cs.LG/1301.0604</idno> + +priorities: +x test with citationList output +x level=a (in title matches) +x fix URL parsing support (?) +- parse_citation_xml() +    => returns None if didn't parse well +    => does not set 'index' +- parse_citation_list_xml() +    current parse_citations_xml, which then aliases to this +. test with GROBID 0.7.0/0.7.1 output +    x  PDF MD5 +        nope +            ISBN/ISBN13:  nope +            subtitle: nope +            number +            edition +            publisherPlace +            suffix (?) +        skipping: +            oaUrl +                ptr type=\"open-access\" target +        done, needs test: +            PII +            ark +            istexId +            email: <email> under author +            orcid: <idno type=\"ORCID\"> +            journal_abbrev +                <title level=\"j\" type=\"abbrev +            journal +                level=\"j +            bookTitle +                title level=\"m (not main) +            serieTitle +                title level=\"s (not main) +            institution (why?) +                respStmt, orgName +        editor +            <editor>, persName +            <contributor role=\"editor +        meeting +            for proceedings +            <meeting> +                address fields + +        web +            ptr target +            <ptr target="https://sfp.dpe.gov.bd/site/policies/b675228c-7bba-4feb-ae02-eb55de027fca/" /> +            alternative URL? + +        conf (conference stuff generally) +        keywords + + +    => rg -i abbrev tests/files/ +    => 'url' in a citation +? handle orgName for <author> + +refactors: +x remove old grobid2json (?) +- more test coverage: +    URL +    _simplify_dict() +    journal name variants + +fields/coverage: +- other/report identifier +    Johnson, C. K. (1976). ORTEPII. Report ORNL-5138. Oak Ridge National Laboratory, Tennessee, USA. +        <idno>ORNL-5138</idno> +    Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, November 1990. +        <idno>RFC 1191</idno> +- parse old-style arxiv identifiers out of <idno> +    K.R. Dienes, C. Kolda and J. March-Russell, hep-ph/9610479. +         <idno>hep-ph/9610479</idno> +    B.A. Dobrescu, hep-ph/9510424. +    S.P. Martin, hep-ph/9608224. +    https://github.com/internetarchive/fatcat/issues/84 + +features: +x citations as well as full bodies +x default is to parse into a dataclass (?) similar to XML format +x ... which can transform to JSON +x to_csl_dict() helper (both for header/document and citation) +- to_s2orc_metadata_dict() / to_s2orc_body_dict() +- structured parsing of body, abstract, etc +    => paragraphs / sections +    => citation contexts +    => table / figure / equation +    => footers +- all_urls() on document, including footnotes, body, bibref, etc +- optional post-processing "cleanups" (run on doc or at parse time via kwarg flag) +    => DOI and other identifier validity, via regex (?) +    => year/date validity (eg, sane year range, valid month/day) +    => "ibid" +- optional quality checks on header, citation, body +    => is_empty() +    => "is reference a stub" +    => "is header metadata valid" +    => "was body extracted successfully" +    => {title, author}, {journal, volume, issue, page}, {journal, title, year} + +infrastructure: +- tox testing, for multiple python versions + + | 
