aboutsummaryrefslogtreecommitdiffstats
path: root/proposals/2020_pdf_meta_thumbnails.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-06-17 18:06:12 -0700
committerBryan Newbold <bnewbold@archive.org>2020-06-17 18:06:12 -0700
commit386cb8335d4d1a66b75301a244f7baed49658588 (patch)
treea837ded7f4579ca7d9adcbd93f711347c7455b86 /proposals/2020_pdf_meta_thumbnails.md
parent815c2d115bbc2a64595a682bd15b95beac497c82 (diff)
downloadsandcrawler-386cb8335d4d1a66b75301a244f7baed49658588.tar.gz
sandcrawler-386cb8335d4d1a66b75301a244f7baed49658588.zip
tweak kafka topic names and seaweedfs layout
Diffstat (limited to 'proposals/2020_pdf_meta_thumbnails.md')
-rw-r--r--proposals/2020_pdf_meta_thumbnails.md7
1 files changed, 4 insertions, 3 deletions
diff --git a/proposals/2020_pdf_meta_thumbnails.md b/proposals/2020_pdf_meta_thumbnails.md
index d7578cb..eacbfa5 100644
--- a/proposals/2020_pdf_meta_thumbnails.md
+++ b/proposals/2020_pdf_meta_thumbnails.md
@@ -22,7 +22,7 @@ against the existing SQL table to avoid duplication of processing.
## PDF Metadata and Text
-Kafka topic (name: `sandcrawler-ENV.pdftext`; 12x partitions; gzip
+Kafka topic (name: `sandcrawler-ENV.pdf-text`; 12x partitions; gzip
compression) JSON schema:
sha1hex (string; used as key)
@@ -73,8 +73,9 @@ Kafka, and we don't want SQL table size to explode. Schema:
Kafka Schema is raw image bytes as message body; sha1sum of PDF as the key. No
compression, 12x partitions.
-Topic name is `sandcrawler-ENV.thumbnail-SIZE-png`. Thus, topic name contains
-the "metadata" of thumbail size/shape.
+Kafka topic name is `sandcrawler-ENV.pdf-thumbnail-SIZE-TYPE` (eg,
+`sandcrawler-qa.pdf-thumbnail-180px-jpg`). Thus, topic name contains the
+"metadata" of thumbail size/shape.
Have decided to use JPEG thumbnails, 180px wide (and max 300px high, though
width restriction is almost always the limiting factor). This size matches that