diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-11-06 18:32:35 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-11-06 18:32:35 -0800 |
commit | 175019c96fced3e21d0f60ea1a4a37da6b8872ac (patch) | |
tree | f42fbbe9c8ac06ae9eb06373ab9eec96d2b3a177 /proposals/20201103_xml_ingest.md | |
parent | b0b66c20c6ffb9d8acc626068964d7dfd5d3bcdc (diff) | |
parent | 47ca1a273912c8836630b0930b71a4e66fd2c85b (diff) | |
download | sandcrawler-175019c96fced3e21d0f60ea1a4a37da6b8872ac.tar.gz sandcrawler-175019c96fced3e21d0f60ea1a4a37da6b8872ac.zip |
Merge branch 'bnewbold-html-ingest'
Diffstat (limited to 'proposals/20201103_xml_ingest.md')
-rw-r--r-- | proposals/20201103_xml_ingest.md | 81 |
1 files changed, 81 insertions, 0 deletions
diff --git a/proposals/20201103_xml_ingest.md b/proposals/20201103_xml_ingest.md new file mode 100644 index 0000000..181cc11 --- /dev/null +++ b/proposals/20201103_xml_ingest.md @@ -0,0 +1,81 @@ + +status: wip + +TODO: +x XML fulltext URL extractor (based on HTML biblio metadata, not PDF url extractor) +x differential JATS XML and scielo XML from generic XML? + application/xml+jats is what fatcat is doing for abstracts + but it should be application/jats+xml? + application/tei+xml + if startswith "<article " and "<article-meta>" => JATS +x refactor ingest worker to be more general +x have ingest code publish body to kafka topic +x write a persist worker +/ create/configure kafka topic +- test everything locally +- fatcat: ingest tool to create requests +- fatcat: entity updates worker creates XML ingest requests for specific sources +- fatcat: ingest file import worker allows XML results +- ansible: deployment of persist worker + +XML Fulltext Ingest +==================== + +This document details changes to include XML fulltext ingest in the same way +that we currently ingest PDF fulltext. + +Currently this will just fetch the single XML document, which is often lacking +figures, tables, and other required files. + +## Text Encoding + +Because we would like to treat XML as a string in a couple contexts, but XML +can have multiple encodings (indicated in an XML header), we are in a bit of a +bind. Simply parsing into unicode and then re-encoding as UTF-8 could result in +a header/content mismatch. Any form of re-encoding will change the hash of the +document. For recording in fatcat, the file metadata will be passed through. +For storing in Kafka and blob store (for downstream analysis), we will parse +the raw XML document (as "bytes") with an XML parser, then re-output with UTF-8 +encoding. The hash of the *original* XML file will be used as the key for +refering to this document. This is unintuitive, but similar to what we are +doing with PDF and HTML documents (extracting in a useful format, but keeping +the original document's hash as a key). + +Unclear if we need to do this re-encode process for XML documents already in +UTF-8 encoding. + +## Ingest Worker + +Could either re-use HTML metadata extractor to fetch XML fulltext links, or +fork that code off to a separate method, like the PDF fulltext URL extractor. + +Hopefully can re-use almost all of the PDF pipeline code, by making that ingest +worker class more generic and subclassing it. + +Result objects are treated the same as PDF ingest results: the result object +has context about status, and if successful, file metadata and CDX row of the +terminal object. + +TODO: should it be assumed that XML fulltext will end up in S3 bucket? or +should there be an `xml_meta` SQL table tracking this, like we have for PDFs +and HTML? + +TODO: should we detect and specify the XML schema better? Eg, indicate if JATS. + + +## Persist Pipeline + +### Kafka Topic + +sandcrawler-ENV.xml-doc + similar to other fulltext topics; JSON wrapping the XML + key compaction, content compression + +### S3/SeaweedFS + +`sandcrawler` bucket, `xml` folder. Extension could depend on sub-type of XML? + +### Persist Worker + +New S3-only worker that pulls from kafka topic and pushes to S3. Works +basically the same as PDF persist in S3-only mode, or like pdf-text worker. |