diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-11-03 22:40:14 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-11-03 22:40:14 -0800 |
commit | 653fac9632c6ae9dd036ad844454cf419cd5320b (patch) | |
tree | c09d8a3d8a2524a991f082ab500bce53d1986caa /proposals | |
parent | 9beafd7c5fc98571ec26b49d223ce660378d7b9e (diff) | |
download | sandcrawler-653fac9632c6ae9dd036ad844454cf419cd5320b.tar.gz sandcrawler-653fac9632c6ae9dd036ad844454cf419cd5320b.zip |
xml: re-encode XML docs into UTF-8 for persisting
Diffstat (limited to 'proposals')
-rw-r--r-- | proposals/20201103_xml_ingest.md | 19 |
1 files changed, 18 insertions, 1 deletions
diff --git a/proposals/20201103_xml_ingest.md b/proposals/20201103_xml_ingest.md index c0d0a79..181cc11 100644 --- a/proposals/20201103_xml_ingest.md +++ b/proposals/20201103_xml_ingest.md @@ -10,8 +10,8 @@ x differential JATS XML and scielo XML from generic XML? if startswith "<article " and "<article-meta>" => JATS x refactor ingest worker to be more general x have ingest code publish body to kafka topic +x write a persist worker / create/configure kafka topic -/ write a persist worker - test everything locally - fatcat: ingest tool to create requests - fatcat: entity updates worker creates XML ingest requests for specific sources @@ -27,6 +27,23 @@ that we currently ingest PDF fulltext. Currently this will just fetch the single XML document, which is often lacking figures, tables, and other required files. +## Text Encoding + +Because we would like to treat XML as a string in a couple contexts, but XML +can have multiple encodings (indicated in an XML header), we are in a bit of a +bind. Simply parsing into unicode and then re-encoding as UTF-8 could result in +a header/content mismatch. Any form of re-encoding will change the hash of the +document. For recording in fatcat, the file metadata will be passed through. +For storing in Kafka and blob store (for downstream analysis), we will parse +the raw XML document (as "bytes") with an XML parser, then re-output with UTF-8 +encoding. The hash of the *original* XML file will be used as the key for +refering to this document. This is unintuitive, but similar to what we are +doing with PDF and HTML documents (extracting in a useful format, but keeping +the original document's hash as a key). + +Unclear if we need to do this re-encode process for XML documents already in +UTF-8 encoding. + ## Ingest Worker Could either re-use HTML metadata extractor to fetch XML fulltext links, or |