aboutsummaryrefslogtreecommitdiffstats
path: root/TODO
blob: 5b6ef7a00b5f347315ed1f1c29a4fc5afa5f0298 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114

parsing:
- multiple editors
- proceedings and '<meeting>' event name; also address fields?
- DOI and URL redundancy
    url: "DOI=http://doi.acm.org/10.1145/3098275.3051123"
    doi: "=10.1080/14786440109462720"
- <idno>arxiv:cs.LG/1301.0604</idno>

priorities:
x test with citationList output
x level=a (in title matches)
x fix URL parsing support (?)
- parse_citation_xml()
    => returns None if didn't parse well
    => does not set 'index'
- parse_citation_list_xml()
    current parse_citations_xml, which then aliases to this
. test with GROBID 0.7.0/0.7.1 output
    x  PDF MD5
        nope
            ISBN/ISBN13:  nope
            subtitle: nope
            number
            edition
            publisherPlace
            suffix (?)
        skipping:
            oaUrl
                ptr type=\"open-access\" target
        done, needs test:
            PII
            ark
            istexId
            email: <email> under author
            orcid: <idno type=\"ORCID\">
            journal_abbrev
                <title level=\"j\" type=\"abbrev
            journal
                level=\"j
            bookTitle
                title level=\"m (not main)
            serieTitle
                title level=\"s (not main)
            institution (why?)
                respStmt, orgName
        editor
            <editor>, persName
            <contributor role=\"editor
        meeting
            for proceedings
            <meeting>
                address fields

        web
            ptr target
            <ptr target="https://sfp.dpe.gov.bd/site/policies/b675228c-7bba-4feb-ae02-eb55de027fca/" />
            alternative URL?

        conf (conference stuff generally)
        keywords


    => rg -i abbrev tests/files/
    => 'url' in a citation
? handle orgName for <author>

refactors:
x remove old grobid2json (?)
- more test coverage:
    URL
    _simplify_dict()
    journal name variants

fields/coverage:
- other/report identifier
    Johnson, C. K. (1976). ORTEPII. Report ORNL-5138. Oak Ridge National Laboratory, Tennessee, USA.
        <idno>ORNL-5138</idno>
    Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, November 1990.
        <idno>RFC 1191</idno>
- parse old-style arxiv identifiers out of <idno>
    K.R. Dienes, C. Kolda and J. March-Russell, hep-ph/9610479.
         <idno>hep-ph/9610479</idno>
    B.A. Dobrescu, hep-ph/9510424.
    S.P. Martin, hep-ph/9608224.
    https://github.com/internetarchive/fatcat/issues/84

features:
x citations as well as full bodies
x default is to parse into a dataclass (?) similar to XML format
x ... which can transform to JSON
x to_csl_dict() helper (both for header/document and citation)
- to_s2orc_metadata_dict() / to_s2orc_body_dict()
- structured parsing of body, abstract, etc
    => paragraphs / sections
    => citation contexts
    => table / figure / equation
    => footers
- all_urls() on document, including footnotes, body, bibref, etc
- optional post-processing "cleanups" (run on doc or at parse time via kwarg flag)
    => DOI and other identifier validity, via regex (?)
    => year/date validity (eg, sane year range, valid month/day)
    => "ibid"
- optional quality checks on header, citation, body
    => is_empty()
    => "is reference a stub"
    => "is header metadata valid"
    => "was body extracted successfully"
    => {title, author}, {journal, volume, issue, page}, {journal, title, year}

infrastructure:
- tox testing, for multiple python versions