xml: re-encode XML docs into UTF-8 for persisting

author: Bryan Newbold <bnewbold@archive.org> 2020-11-03 22:40:14 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2020-11-03 22:40:14 -0800
commit: 653fac9632c6ae9dd036ad844454cf419cd5320b (patch)
tree: c09d8a3d8a2524a991f082ab500bce53d1986caa /proposals/20201103_xml_ingest.md
parent: 9beafd7c5fc98571ec26b49d223ce660378d7b9e (diff)
download: sandcrawler-653fac9632c6ae9dd036ad844454cf419cd5320b.tar.gz
sandcrawler-653fac9632c6ae9dd036ad844454cf419cd5320b.zip
1 files changed, 18 insertions, 1 deletions
diff --git a/proposals/20201103_xml_ingest.md b/proposals/20201103_xml_ingest.md
index c0d0a79..181cc11 100644
--- a/proposals/20201103_xml_ingest.md
+++ b/proposals/20201103_xml_ingest.md
@@ -10,8 +10,8 @@ x differential JATS XML and scielo XML from generic XML?
     if startswith "<article " and "<article-meta>" => JATS
 x refactor ingest worker to be more general
 x have ingest code publish body to kafka topic
+x write a persist worker
 / create/configure kafka topic
-/ write a persist worker
 - test everything locally
 - fatcat: ingest tool to create requests
 - fatcat: entity updates worker creates XML ingest requests for specific sources
@@ -27,6 +27,23 @@ that we currently ingest PDF fulltext.
 Currently this will just fetch the single XML document, which is often lacking
 figures, tables, and other required files.
 
+## Text Encoding
+
+Because we would like to treat XML as a string in a couple contexts, but XML
+can have multiple encodings (indicated in an XML header), we are in a bit of a
+bind. Simply parsing into unicode and then re-encoding as UTF-8 could result in
+a header/content mismatch. Any form of re-encoding will change the hash of the
+document. For recording in fatcat, the file metadata will be passed through.
+For storing in Kafka and blob store (for downstream analysis), we will parse
+the raw XML document (as "bytes") with an XML parser, then re-output with UTF-8
+encoding. The hash of the *original* XML file will be used as the key for
+refering to this document. This is unintuitive, but similar to what we are
+doing with PDF and HTML documents (extracting in a useful format, but keeping
+the original document's hash as a key).
+
+Unclear if we need to do this re-encode process for XML documents already in
+UTF-8 encoding.
+
 ## Ingest Worker
 
 Could either re-use HTML metadata extractor to fetch XML fulltext links, or
author	Bryan Newbold <bnewbold@archive.org>	2020-11-03 22:40:14 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2020-11-03 22:40:14 -0800
commit	653fac9632c6ae9dd036ad844454cf419cd5320b (patch)
tree	c09d8a3d8a2524a991f082ab500bce53d1986caa /proposals/20201103_xml_ingest.md
parent	9beafd7c5fc98571ec26b49d223ce660378d7b9e (diff)
download	sandcrawler-653fac9632c6ae9dd036ad844454cf419cd5320b.tar.gz sandcrawler-653fac9632c6ae9dd036ad844454cf419cd5320b.zip