diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-06-17 18:06:12 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-06-17 18:06:12 -0700 |
commit | 386cb8335d4d1a66b75301a244f7baed49658588 (patch) | |
tree | a837ded7f4579ca7d9adcbd93f711347c7455b86 /proposals | |
parent | 815c2d115bbc2a64595a682bd15b95beac497c82 (diff) | |
download | sandcrawler-386cb8335d4d1a66b75301a244f7baed49658588.tar.gz sandcrawler-386cb8335d4d1a66b75301a244f7baed49658588.zip |
tweak kafka topic names and seaweedfs layout
Diffstat (limited to 'proposals')
-rw-r--r-- | proposals/2020_pdf_meta_thumbnails.md | 7 |
1 files changed, 4 insertions, 3 deletions
diff --git a/proposals/2020_pdf_meta_thumbnails.md b/proposals/2020_pdf_meta_thumbnails.md index d7578cb..eacbfa5 100644 --- a/proposals/2020_pdf_meta_thumbnails.md +++ b/proposals/2020_pdf_meta_thumbnails.md @@ -22,7 +22,7 @@ against the existing SQL table to avoid duplication of processing. ## PDF Metadata and Text -Kafka topic (name: `sandcrawler-ENV.pdftext`; 12x partitions; gzip +Kafka topic (name: `sandcrawler-ENV.pdf-text`; 12x partitions; gzip compression) JSON schema: sha1hex (string; used as key) @@ -73,8 +73,9 @@ Kafka, and we don't want SQL table size to explode. Schema: Kafka Schema is raw image bytes as message body; sha1sum of PDF as the key. No compression, 12x partitions. -Topic name is `sandcrawler-ENV.thumbnail-SIZE-png`. Thus, topic name contains -the "metadata" of thumbail size/shape. +Kafka topic name is `sandcrawler-ENV.pdf-thumbnail-SIZE-TYPE` (eg, +`sandcrawler-qa.pdf-thumbnail-180px-jpg`). Thus, topic name contains the +"metadata" of thumbail size/shape. Have decided to use JPEG thumbnails, 180px wide (and max 300px high, though width restriction is almost always the limiting factor). This size matches that |