From 386cb8335d4d1a66b75301a244f7baed49658588 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Wed, 17 Jun 2020 18:06:12 -0700 Subject: tweak kafka topic names and seaweedfs layout --- proposals/2020_pdf_meta_thumbnails.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) (limited to 'proposals') diff --git a/proposals/2020_pdf_meta_thumbnails.md b/proposals/2020_pdf_meta_thumbnails.md index d7578cb..eacbfa5 100644 --- a/proposals/2020_pdf_meta_thumbnails.md +++ b/proposals/2020_pdf_meta_thumbnails.md @@ -22,7 +22,7 @@ against the existing SQL table to avoid duplication of processing. ## PDF Metadata and Text -Kafka topic (name: `sandcrawler-ENV.pdftext`; 12x partitions; gzip +Kafka topic (name: `sandcrawler-ENV.pdf-text`; 12x partitions; gzip compression) JSON schema: sha1hex (string; used as key) @@ -73,8 +73,9 @@ Kafka, and we don't want SQL table size to explode. Schema: Kafka Schema is raw image bytes as message body; sha1sum of PDF as the key. No compression, 12x partitions. -Topic name is `sandcrawler-ENV.thumbnail-SIZE-png`. Thus, topic name contains -the "metadata" of thumbail size/shape. +Kafka topic name is `sandcrawler-ENV.pdf-thumbnail-SIZE-TYPE` (eg, +`sandcrawler-qa.pdf-thumbnail-180px-jpg`). Thus, topic name contains the +"metadata" of thumbail size/shape. Have decided to use JPEG thumbnails, 180px wide (and max 300px high, though width restriction is almost always the limiting factor). This size matches that -- cgit v1.2.3