summaryrefslogtreecommitdiffstats
path: root/README.md
blob: 0700757e6d0ae30a8094132a7e3ebe327703cda6 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111

`grobid_tei_xml`: Python parser and transforms for GROBID-flavor TEI-XML
========================================================================

This is a simple python library for parsing the TEI-XML structured documents
returned by [GROBID](https://github.com/kermitt2/grobid), a machine learning
tool for extracting text and bibliographic metadata from research article PDFs.

TEI-XML is a standard format, and there are other libraries to parse entire
documents and work with annotated text. This library is focused specifically on
extracting "header" metadata from document (eg, title, authors, journal name,
volume, issue), content in flattened text form (full abstract and body text as
single strings, for things like search indexing), and structured citation
metadata.

`grobid_tei_xml` works with Python 3, using only the standard library. It does
not talk to the GROBID HTTP API or read files off disk on it's own, but see
examples below.

In the near future, it should be possible to install `grobid_tei_xml` from
[pypi.org](https://pypi.org) using `pip`.


## Use Examples

Read an XML file from disk, parse it, and print to stdout as JSON:

```python
import json
import grobid_tei_xml

xml_path = "./tests/files/small.xml"

with open(xml_path, 'r') as xml_file:
    doc = grobid_tei_xml.parse_document_xml(xml_file.read())

print(json.dumps(doc.to_dict(), indent=2))
```

Use `requests` to download a PDF from the web, submit to GROBID (via HTTP API),
parse the TEI-XML response with `grobid_tei_xml`, and print some metadata
fields:

```python
import requests
import grobid_tei_xml

pdf_resp = requests.get("https://arxiv.org/pdf/1802.01168v3")
pdf_resp.raise_for_status()

grobid_resp = requests.post(
    "https://cloud.science-miner.com/grobid/api/processFulltextDocument",
    files={
        'input': pdf_resp.content,
        'consolidate_Citations': 0,
        'includeRawCitations': 1,
    },
    timeout=60.0,
)
grobid_resp.raise_for_status()

doc = grobid_tei_xml.parse_document_xml(grobid_resp.text)

print("title: " + doc.header.title)
print("authors: " + ", ".join([a.name for a in doc.header.authors]))
print("doi: " + str(doc.header.doi))
print("citation count: " + str(len(doc.citations)))
print("abstract: " + doc.abstract)
```

Use `requests` to submit a "raw" citation string to GROBID for extraction,
parse the response with `grobid_tei_xml`, and print the structured output to
stdout:

```python
import requests
import grobid_tei_xml

raw_citation = "Kvenvolden K.A. and Field M.E. 1981. Thermogenic hydrocarbons in unconsolidated sediment of Eel River Basin, offshore northern California. AAPG Bulletin 65:1642-1646"

grobid_resp = requests.post(
    "https://cloud.science-miner.com/grobid/api/processCitation",
    data={
        'citations': raw_citation,
        'consolidateCitations': 0,
    },
    timeout=10.0,
)
grobid_resp.raise_for_status()

citation = grobid_tei_xml.parse_citations_xml(grobid_resp.text)[0]
print(citation)
```

## See Also

[`grobid_client_python`](https://github.com/kermitt2/grobid_client_python):
Python client and CLI tool for making requests to GROBID via HTTP API. Returns
TEI-XML; could be used with this library (`grobid_tei_xml`) for parsing into
python object or, eg, JSON.

[GROBID Documentation](https://grobid.readthedocs.io/en/latest/)

[delb](https://github.com/funkyfuture/delb): more flexible/powerful interface
to TEI-XML documents. would be a better tool for working with structured text
(body, abstract, etc)

["Parsing TEI XML documents with
Python"](https://komax.github.io/blog/text/python/xml/parsing_tei_xml_python/)
(2019): blog post about basic parsing of GROBID TEI-XML files into Pandas
DataFrames