aboutsummaryrefslogtreecommitdiffstats
path: root/kafka
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-11-06 18:32:35 -0800
committerBryan Newbold <bnewbold@archive.org>2020-11-06 18:32:35 -0800
commit175019c96fced3e21d0f60ea1a4a37da6b8872ac (patch)
treef42fbbe9c8ac06ae9eb06373ab9eec96d2b3a177 /kafka
parentb0b66c20c6ffb9d8acc626068964d7dfd5d3bcdc (diff)
parent47ca1a273912c8836630b0930b71a4e66fd2c85b (diff)
downloadsandcrawler-175019c96fced3e21d0f60ea1a4a37da6b8872ac.tar.gz
sandcrawler-175019c96fced3e21d0f60ea1a4a37da6b8872ac.zip
Merge branch 'bnewbold-html-ingest'
Diffstat (limited to 'kafka')
-rw-r--r--kafka/topics.md16
1 files changed, 16 insertions, 0 deletions
diff --git a/kafka/topics.md b/kafka/topics.md
index 7a34c83..06faf8e 100644
--- a/kafka/topics.md
+++ b/kafka/topics.md
@@ -59,6 +59,18 @@ retention (on both a size and time basis).
=> 12 partitions
=> key is sha1hex of PDF; enable key compaction; gzip compression
+ sandcrawler-ENV.xml-doc
+ => fulltext XML; mostly JATS XML
+ => schema is JSON, with 'jats_xml' field containing the XML as a string
+ => 6 partitions
+ => key is sha1hex of XML document; enable key compaction; gzip compression
+
+ sandcrawler-ENV.html-teixml
+ => extracted fulltext from HTML; mostly TEI-XML
+ => schema is JSON, with 'tei_xml' field containing the XML as a string
+ => 6 partitions
+ => key is sha1hex of source HTML document; enable key compaction; gzip compression
+
sandcrawler-ENV.pdf-thumbnail-SIZE-TYPE
=> thumbnail images (eg, png, jpg) from PDFs
=> raw bytes in message (no JSON or other wrapping). fields average 10 KByte
@@ -194,3 +206,7 @@ exists`; this seems safe, and the settings won't be over-ridden.
./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 6 --topic scholar-qa.sim-updates
./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 12 --topic scholar-qa.update-docs --config compression.type=gzip --config cleanup.policy=compact --config retention.ms=7889400000
+
+ ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 6 --topic sandcrawler-qa.xml-doc --config compression.type=gzip --config cleanup.policy=compact
+ ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 6 --topic sandcrawler-qa.html-teixml --config compression.type=gzip --config cleanup.policy=compact
+