summaryrefslogtreecommitdiffstats
path: root/README.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2021-10-21 19:59:56 -0700
committerBryan Newbold <bnewbold@archive.org>2021-10-21 19:59:58 -0700
commitc82dbcaaa89d99cbe482eeb2d8ffbce28201fd14 (patch)
tree019481f865b283833bbbf9cddb97af9b9717b9bb /README.md
parent45deea74f80d1e8deed6076f2a93d711d16a3a83 (diff)
downloadgrobid_tei_xml-c82dbcaaa89d99cbe482eeb2d8ffbce28201fd14.tar.gz
grobid_tei_xml-c82dbcaaa89d99cbe482eeb2d8ffbce28201fd14.zip
add examples to README, and test those examples in CI
These tests don't run as part of 'make test' by default because they do live fetches against the internet.
Diffstat (limited to 'README.md')
-rw-r--r--README.md111
1 files changed, 109 insertions, 2 deletions
diff --git a/README.md b/README.md
index ca91cfe..0700757 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,111 @@
-grobid-tei-xml: Python parser and transforms for GROBID-flavor TEI-XML
-======================================================================
+`grobid_tei_xml`: Python parser and transforms for GROBID-flavor TEI-XML
+========================================================================
+This is a simple python library for parsing the TEI-XML structured documents
+returned by [GROBID](https://github.com/kermitt2/grobid), a machine learning
+tool for extracting text and bibliographic metadata from research article PDFs.
+
+TEI-XML is a standard format, and there are other libraries to parse entire
+documents and work with annotated text. This library is focused specifically on
+extracting "header" metadata from document (eg, title, authors, journal name,
+volume, issue), content in flattened text form (full abstract and body text as
+single strings, for things like search indexing), and structured citation
+metadata.
+
+`grobid_tei_xml` works with Python 3, using only the standard library. It does
+not talk to the GROBID HTTP API or read files off disk on it's own, but see
+examples below.
+
+In the near future, it should be possible to install `grobid_tei_xml` from
+[pypi.org](https://pypi.org) using `pip`.
+
+
+## Use Examples
+
+Read an XML file from disk, parse it, and print to stdout as JSON:
+
+```python
+import json
+import grobid_tei_xml
+
+xml_path = "./tests/files/small.xml"
+
+with open(xml_path, 'r') as xml_file:
+ doc = grobid_tei_xml.parse_document_xml(xml_file.read())
+
+print(json.dumps(doc.to_dict(), indent=2))
+```
+
+Use `requests` to download a PDF from the web, submit to GROBID (via HTTP API),
+parse the TEI-XML response with `grobid_tei_xml`, and print some metadata
+fields:
+
+```python
+import requests
+import grobid_tei_xml
+
+pdf_resp = requests.get("https://arxiv.org/pdf/1802.01168v3")
+pdf_resp.raise_for_status()
+
+grobid_resp = requests.post(
+ "https://cloud.science-miner.com/grobid/api/processFulltextDocument",
+ files={
+ 'input': pdf_resp.content,
+ 'consolidate_Citations': 0,
+ 'includeRawCitations': 1,
+ },
+ timeout=60.0,
+)
+grobid_resp.raise_for_status()
+
+doc = grobid_tei_xml.parse_document_xml(grobid_resp.text)
+
+print("title: " + doc.header.title)
+print("authors: " + ", ".join([a.name for a in doc.header.authors]))
+print("doi: " + str(doc.header.doi))
+print("citation count: " + str(len(doc.citations)))
+print("abstract: " + doc.abstract)
+```
+
+Use `requests` to submit a "raw" citation string to GROBID for extraction,
+parse the response with `grobid_tei_xml`, and print the structured output to
+stdout:
+
+```python
+import requests
+import grobid_tei_xml
+
+raw_citation = "Kvenvolden K.A. and Field M.E. 1981. Thermogenic hydrocarbons in unconsolidated sediment of Eel River Basin, offshore northern California. AAPG Bulletin 65:1642-1646"
+
+grobid_resp = requests.post(
+ "https://cloud.science-miner.com/grobid/api/processCitation",
+ data={
+ 'citations': raw_citation,
+ 'consolidateCitations': 0,
+ },
+ timeout=10.0,
+)
+grobid_resp.raise_for_status()
+
+citation = grobid_tei_xml.parse_citations_xml(grobid_resp.text)[0]
+print(citation)
+```
+
+## See Also
+
+[`grobid_client_python`](https://github.com/kermitt2/grobid_client_python):
+Python client and CLI tool for making requests to GROBID via HTTP API. Returns
+TEI-XML; could be used with this library (`grobid_tei_xml`) for parsing into
+python object or, eg, JSON.
+
+[GROBID Documentation](https://grobid.readthedocs.io/en/latest/)
+
+[delb](https://github.com/funkyfuture/delb): more flexible/powerful interface
+to TEI-XML documents. would be a better tool for working with structured text
+(body, abstract, etc)
+
+["Parsing TEI XML documents with
+Python"](https://komax.github.io/blog/text/python/xml/parsing_tei_xml_python/)
+(2019): blog post about basic parsing of GROBID TEI-XML files into Pandas
+DataFrames