README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132


`grobid_tei_xml`: Python parser and transforms for GROBID-flavor TEI-XML
========================================================================

This is a simple python library for parsing the TEI-XML structured documents
returned by [GROBID](https://github.com/kermitt2/grobid), a machine learning
tool for extracting text and bibliographic metadata from research article PDFs.

TEI-XML is a standard format, and there exist other libraries to parse entire
documents and work with annotated text. This library is focused specifically on
extracting "header" metadata from document (eg, title, authors, journal name,
volume, issue), content in flattened text form (full abstract and body text as
single strings, for things like search indexing), and structured citation
metadata.


## Quickstart

`grobid_tei_xml` works with Python 3, using only the standard library. It does
not talk to the GROBID HTTP API or read files off disk on it's own, but see
examples below. The library is packaged on [pypi.org](https://pypi.org).

Install using `pip`, usually within a `virtualenv`:

    pip install grobid_tei_xml

The main entry points are the functions `process_document_xml(xml_text)` and
`process_citation_xml(xml_text)` (or `process_citation_list_xml(xml_text)` for
multiple citations), which return python dataclass objects. The helper method
`.to_dict()` can be useful for, eg, serializing these objects to JSON.


## Usage Examples

Read an XML file from disk, parse it, and print to stdout as JSON:

```python
import json
import grobid_tei_xml

xml_path = "./tests/files/small.xml"

with open(xml_path, 'r') as xml_file:
    doc = grobid_tei_xml.parse_document_xml(xml_file.read())

print(json.dumps(doc.to_dict(), indent=2))
```

Use `requests` to download a PDF from the web, submit to GROBID (via HTTP API),
parse the TEI-XML response with `grobid_tei_xml`, and print some metadata
fields:

```python
import requests
import grobid_tei_xml

pdf_resp = requests.get("https://arxiv.org/pdf/1802.01168v3")
pdf_resp.raise_for_status()

grobid_resp = requests.post(
    "https://cloud.science-miner.com/grobid/api/processFulltextDocument",
    files={
        'input': pdf_resp.content,
        'consolidate_Citations': 0,
        'includeRawCitations': 1,
    },
    timeout=60.0,
)
grobid_resp.raise_for_status()

doc = grobid_tei_xml.parse_document_xml(grobid_resp.text)

print("title: " + doc.header.title)
print("authors: " + ", ".join([a.full_name for a in doc.header.authors]))
print("doi: " + str(doc.header.doi))
print("citation count: " + str(len(doc.citations)))
print("abstract: " + doc.abstract)
```

Use `requests` to submit a "raw" citation string to GROBID for extraction,
parse the response with `grobid_tei_xml`, and print the structured output to
stdout:

```python
import requests
import grobid_tei_xml

raw_citation = "Kvenvolden K.A. and Field M.E. 1981. Thermogenic hydrocarbons in unconsolidated sediment of Eel River Basin, offshore northern California. AAPG Bulletin 65:1642-1646"

grobid_resp = requests.post(
    "https://cloud.science-miner.com/grobid/api/processCitation",
    data={
        'citations': raw_citation,
        'consolidateCitations': 0,
        'includeRawCitations': 1,
    },
    timeout=10.0,
)
grobid_resp.raise_for_status()

citation = grobid_tei_xml.parse_citation_xml(grobid_resp.text)
print(citation)
```

## See Also

[`grobid_client_python`](https://github.com/kermitt2/grobid_client_python):
Python client and CLI tool for making requests to GROBID via HTTP API. Returns
TEI-XML; could be used with this library (`grobid_tei_xml`) for parsing into
python object or, eg, JSON.

[GROBID Documentation](https://grobid.readthedocs.io/en/latest/)

[s2orc-doc2json](https://github.com/allenai/s2orc-doc2json): Python library
from AI2 which includes a similar Python library for extracting both
bibliographic metadata and (structured) full text from GROBID TEI-XML. Has nice
features like resolving references to bibliography entry.

[delb](https://github.com/funkyfuture/delb): more flexible/powerful interface
to TEI-XML documents. would be a better tool for working with structured text
(body, abstract, etc)

["Parsing TEI XML documents with
Python"](https://komax.github.io/blog/text/python/xml/parsing_tei_xml_python/)
(2019): blog post about basic parsing of GROBID TEI-XML files into Pandas
DataFrames


## License

This library is available under the permissive MIT License. See `LICENSE.txt`
for a copy.