diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-06-25 16:32:47 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-06-25 16:32:47 -0700 |
commit | b15354bd8e1bb44aa8f774f2d73db8742826c4f6 (patch) | |
tree | f4754e665def1781df2945b839ef699e9b903a34 | |
parent | 561a1afe1c5c55a21f9b1db52344e58b2635a3a6 (diff) | |
download | sandcrawler-b15354bd8e1bb44aa8f774f2d73db8742826c4f6.tar.gz sandcrawler-b15354bd8e1bb44aa8f774f2d73db8742826c4f6.zip |
commit old (2020-02) pdftrio commands
-rw-r--r-- | notes/tasks/2020-02-14_pdftrio.md | 162 |
1 files changed, 162 insertions, 0 deletions
diff --git a/notes/tasks/2020-02-14_pdftrio.md b/notes/tasks/2020-02-14_pdftrio.md new file mode 100644 index 0000000..e6f8d8e --- /dev/null +++ b/notes/tasks/2020-02-14_pdftrio.md @@ -0,0 +1,162 @@ + +First end-to-end `pdf_trio` results! + +## Source + +Will use AIT partner #1830 (U Alberta) CDX as input. These are unique by +digest, about 100k. + + ArchiveIt-Collection-1830.download.cdx + +## Testing/Prep + +Versions/setup: + + sandcrawler: f613f69a40fcc9a445f21cadd35d7c36c8061db8 + => patched to 'auto' mode + + pdf_trio: 03bd3fdc15418462b2b1582e4f967f26ddcb43e2 + + pdftrio: 'auto' mode + + uwsgi: 16x processes + + sudo docker run --rm -p 8501:8501 -e TF_XLA_FLAGS=--tf_xla_cpu_global_jit -e KMP_AFFINITY=granularity=fine,compact,1,0 -e KMP_BLOCKTIME=0 -e OMP_NUM_THREADS=24 -e TENSORFLOW_INTER_OP_PARALLELISM=1 -e TENSORFLOW_INTRA_OP_PARALLELISM=24 -v /srv/pdftrio//models/bert_models:/models/bert_model -v /srv/pdftrio//models/pdf_image_classifier_model:/models/image_model -v /srv/pdftrio//config/tfserving_models_docker.config:/models/tfserving_models.config -v /srv/pdftrio/config/tfserving_batch.config:/models/tfserving_batch.config --name pdftrio-tfserving tensorflow/serving --model_config_file=/models/tfserving_models.config --enable_batching=true --batching_parameters_file=/models/tfserving_batch.config + +Basic testing:: + + head -n100 /srv/sandcrawler/tasks/ArchiveIt-Collection-1830.download.cdx | parallel -j20 --pipe --linebuffer ./pdftrio_tool.py --pdftrio-host http://localhost:3939 -j0 classify-pdf-cdx - | jq . + + head -n100 /srv/sandcrawler/tasks/ArchiveIt-Collection-1830.download.cdx | parallel -j20 --pipe --linebuffer ./pdftrio_tool.py --kafka-mode --kafka-env qa --kafka-hosts wbgrp-svc263.us.archive.org,wbgrp-svc284.us.archive.org,wbgrp-svc285.us.archive.org --pdftrio-host http://localhost:3939 -j0 classify-pdf-cdx - + => Running in kafka output mode, publishing to sandcrawler-qa.pdftrio-output + + +On the persist side: + + kafkacat -C -b wbgrp-svc263.us.archive.org -t sandcrawler-qa.pdftrio-output | head | jq . + => looks fine + + ./sandcrawler_worker.py --kafka-hosts wbgrp-svc263.us.archive.org --env qa persist-pdftrio + => Consuming from kafka topic sandcrawler-qa.pdftrio-output, group persist-pdftrio + +Ah, don't forget, start persist before writing to topic! Or would need to reset +offsets to start. + +Seems to be only a single pdftext instance running? Very low CPU + + head -n500 /srv/sandcrawler/tasks/ArchiveIt-Collection-1830.download.cdx | parallel -j40 -N1 --pipe --round-robin --linebuffer ./pdftrio_tool.py --kafka-mode --kafka-env qa --kafka-hosts wbgrp-svc263.us.archive.org,wbgrp-svc284.us.archive.org,wbgrp-svc285.us.archive.org --pdftrio-host http://localhost:3939 -j0 classify-pdf-cdx - + +That is much better! CPU still not pegged, so maybe could do 50x processes? Lots of I/O wait. Blech. + +Zero ("0") not getting persisted for any columns (fixed in sandcrawler/db.py) + +`models_date` not getting set. Added `PDFTRIO_MODELS_DATE="2020-01-01"` to env. (TODO: ansible) + +## Prod Run + + ./sandcrawler_worker.py --kafka-hosts wbgrp-svc263.us.archive.org --env prod persist-pdftrio + + time cat /srv/sandcrawler/tasks/ArchiveIt-Collection-1830.download.cdx | parallel -j40 -N1 --pipe --round-robin --linebuffer ./pdftrio_tool.py --kafka-mode --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org,wbgrp-svc284.us.archive.org,wbgrp-svc285.us.archive.org --pdftrio-host http://localhost:3939 -j0 classify-pdf-cdx - + +Worker CPU basically blocked on pdftotext, multiple 100% CPU. Presumably I/O +wait? Though not totally sure. + +htop: + + PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command + 17951 pdftrio 20 0 51756 12868 5856 R 90.1 0.0 0:06.61 pdftotext -nopgbrk -eol unix -enc UTF-8 /tmp/research-p + 17870 pdftrio 20 0 52004 12964 5684 R 87.4 0.0 0:08.61 pdftotext -nopgbrk -eol unix -enc UTF-8 /tmp/research-p + 13735 root 20 0 10.4G 3815M 4144 S 79.6 7.6 48h02:37 tensorflow_model_server --port=8500 --rest_api_port=850 + 14522 pdftrio 20 0 2817M 1331M 16896 R 43.1 2.6 0:57.75 /usr/bin/uwsgi --ini /srv/pdftrio/src/uwsgi.ini + 18027 pdftrio 20 0 49192 10692 6116 R 39.8 0.0 0:00.61 pdftotext -nopgbrk -eol unix -enc UTF-8 /tmp/research-p + 14518 pdftrio 20 0 2818M 1336M 16836 S 33.3 2.7 0:47.46 /usr/bin/uwsgi --ini /srv/pdftrio/src/uwsgi.ini + 14504 pdftrio 20 0 2731M 1310M 13164 D 32.6 2.6 0:34.81 /usr/bin/uwsgi --ini /srv/pdftrio/src/uwsgi.ini + 14526 pdftrio 20 0 2816M 1333M 16832 R 28.7 2.7 0:57.22 /usr/bin/uwsgi --ini /srv/pdftrio/src/uwsgi.ini + 14500 pdftrio 20 0 2729M 1306M 13160 R 20.9 2.6 0:22.57 /usr/bin/uwsgi --ini /srv/pdftrio/src/uwsgi.ini + 14492 pdftrio 20 0 2729M 1307M 13156 S 17.6 2.6 0:17.91 /usr/bin/uwsgi --ini /srv/pdftrio/src/uwsgi.ini + 14508 pdftrio 20 0 2734M 1312M 14380 D 14.4 2.6 0:38.75 /usr/bin/uwsgi --ini /srv/pdftrio/src/uwsgi.ini + 14496 pdftrio 20 0 2728M 1300M 13160 S 13.7 2.6 0:18.00 /usr/bin/uwsgi --ini /srv/pdftrio/src/uwsgi.ini + 17314 sandcrawl 20 0 56668 18228 4304 D 13.7 0.0 0:02.31 perl /usr/bin/parallel -j40 -N1 --pipe --round-robin -- + 14472 pdftrio 20 0 2725M 1283M 13136 S 12.4 2.6 0:05.69 /usr/bin/uwsgi --ini /srv/pdftrio/src/uwsgi.ini + 14513 pdftrio 20 0 2730M 1309M 14300 S 11.1 2.6 0:40.32 /usr/bin/uwsgi --ini /srv/pdftrio/src/uwsgi.ini + 14480 pdftrio 20 0 2725M 1291M 13144 S 10.4 2.6 0:08.77 /usr/bin/uwsgi --ini /srv/pdftrio/src/uwsgi.ini + 14488 pdftrio 20 0 2725M 1294M 13152 S 9.8 2.6 0:08.18 /usr/bin/uwsgi --ini /srv/pdftrio/src/uwsgi.ini + 14468 pdftrio 20 0 2717M 1271M 13088 S 6.5 2.5 0:02.42 /usr/bin/uwsgi --ini /srv/pdftrio/src/uwsgi.ini + 17411 sandcrawl 20 0 556M 53840 14936 S 6.5 0.1 0:01.57 python3 ./pdftrio_tool.py --kafka-mode --kafka-env qa - + 14530 pdftrio 20 0 2524M 1252M 3492 S 4.6 2.5 0:12.72 /usr/bin/uwsgi --ini /srv/pdftrio/src/uwsgi.ini + 7311 bnewbold 20 0 27716 5520 3128 R 3.9 0.0 0:41.59 htop + 17444 sandcrawl 20 0 552M 50456 14892 S 3.9 0.1 0:01.54 python3 ./pdftrio_tool.py --kafka-mode --kafka-env qa - + 18042 pdftrio 20 0 46068 6588 5328 R 3.3 0.0 0:00.05 pdftotext -nopgbrk -eol unix -enc UTF-8 /tmp/research-p + 18043 pdftrio 20 0 4 4 0 R 2.6 0.0 0:00.04 + 2203 grobid 20 0 6334M 126M 4188 S 0.7 0.3 3h27:32 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -XX:MaxMetas + 17419 sandcrawl 20 0 619M 116M 15248 S 0.7 0.2 0:02.68 python3 ./pdftrio_tool.py --kafka-mode --kafka-env qa - + 17440 sandcrawl 20 0 578M 76948 15160 S 0.7 0.1 0:01.54 python3 ./pdftrio_tool.py --kafka-mode --kafka-env qa - + 13848 root 20 0 0 0 0 D 0.7 0.0 0:00.69 kworker/u60:1 + 17443 sandcrawl 20 0 578M 76500 14912 S 0.7 0.1 0:01.74 python3 ./pdftrio_tool.py --kafka-mode --kafka-env qa - + 17414 sandcrawl 20 0 580M 77720 15036 S 0.0 0.2 0:01.77 python3 ./pdftrio_tool.py --kafka-mode --kafka-env qa - + 17432 sandcrawl 20 0 563M 61460 14976 S 0.0 0.1 0:01.59 python3 ./pdftrio_tool.py --kafka-mode --kafka-env qa - + 17442 sandcrawl 20 0 561M 53096 15240 S 0.0 0.1 0:01.47 python3 ./pdftrio_tool.py --kafka-mode --kafka-env qa - + 17433 sandcrawl 20 0 559M 57160 15176 S 0.0 0.1 0:01.52 python3 ./pdftrio_tool.py --kafka-mode --kafka-env qa - + 17431 sandcrawl 20 0 554M 50960 14892 S 0.0 0.1 0:01.37 python3 ./pdftrio_tool.py --kafka-mode --kafka-env qa - + 17413 sandcrawl 20 0 554M 52376 14920 S 0.0 0.1 0:01.57 python3 ./pdftrio_tool.py --kafka-mode --kafka-env qa - + +dstat: + + ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- + usr sys idl wai hiq siq| read writ| recv send| in out | int csw + 32 9 22 37 0 0| 0 37M| 20M 12M| 0 0 | 35k 64k + 20 6 24 50 0 0| 0 20M| 30M 5662k| 0 0 | 27k 48k + 27 7 24 43 0 0| 0 26M|8712k 6289k| 0 0 | 21k 114k + 30 8 23 38 0 0|4096B 61M| 17M 20M| 0 0 | 31k 54k + 33 6 17 44 0 0| 0 32M| 14M 6384k| 0 0 | 27k 46k + 25 6 24 44 0 0| 0 19M| 18M 13M| 0 0 | 27k 179k + 40 6 19 35 0 0|8192B 25M|7855k 6661k| 0 0 | 31k 85k + 59 8 12 20 0 0| 0 39M|4177k 33M| 0 0 | 34k 64k + 34 4 17 44 0 0| 0 16M|7527k 11M| 0 0 | 22k 45k + 44 7 17 32 0 0| 0 30M| 20M 291k| 0 0 | 36k 62k + +Create tmpfs: + + sudo mkdir -p /pdftrio-ramdisk + #sudo mount -t tmpfs -o size=2g tmpfs /pdftrio-ramdisk + sudo mount -t tmpfs -o size=6g tmpfs /pdftrio-ramdisk + +add to pdftrio config env and restart: + + TEMP=/run/pdf_trio + +Seems to have worked. Pretty much maxed CPU, may need to back-off parallelism. Doing more than 31/sec. + +Errors were not getting encoded correctly: + + File "/fast/sandcrawler/python/sandcrawler/persist.py", line 331, in push_batch + r['pdf_trio']['key'] = r['key'] + KeyError: 'pdf_trio' + +Fixed in sandcrawler worker, and patched persist to work around this. + + time cat /srv/sandcrawler/tasks/ArchiveIt-Collection-1830.download.cdx | parallel -j30 -N1 --pipe --round-robin --linebuffer ./pdftrio_tool.py --kafka-mode --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org,wbgrp-svc284.us.archive.org,wbgrp-svc285.us.archive.org --pdftrio-host http://localhost:3939 -j0 classify-pdf-cdx - + +Wow, 30x parallelism waaaay less? + + time cat /srv/sandcrawler/tasks/ArchiveIt-Collection-1830.download.cdx | parallel -j30 -N1 --pipe --round-robin --linebuffer ./pdftrio_tool.py --kafka-mode --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org,wbgrp-svc284.us.archive.org,wbgrp-svc285.us.archive.org --pdftrio-host http://localhost:3939 -j0 classify-pdf-cdx - + +What changed? Confused. Load average was like 40. + +Via kafka, as much as 69.71/sec! Errors? + +Hrm, this whole `auto` thing. I am very skeptical. Should also do a run as `all`, -j20. + + Worker: Counter({'total': 1916, 'pushed': 1916}) + CDX lines pushed: Counter({'total': 1934, 'pushed': 1916, 'skip-parse': 18}) + +Hit some bugs, causing failure, but still seem to have processed a good chunk. + +Switched to `all`, running a different batch: + + time cat /srv/sandcrawler/tasks/ArchiveIt-Collection-1914.download.cdx | parallel -j20 -N1 --pipe --round-robin --linebuffer ./pdftrio_tool.py --kafka-mode --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org,wbgrp-svc284.us.archive.org,wbgrp-svc285.us.archive.org --pdftrio-host http://localhost:3939 -j0 classify-pdf-cdx - + +After flag change, another batch in `all`: + + time cat /srv/sandcrawler/tasks/ArchiveIt-Collection-2566.download.cdx | parallel -j20 -N1 --pipe --round-robin --linebuffer ./pdftrio_tool.py --kafka-mode --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org,wbgrp-svc284.us.archive.org,wbgrp-svc285.us.archive.org --pdftrio-host http://localhost:3939 -j0 classify-pdf-cdx - + |