diff options
author | Bryan Newbold <bnewbold@archive.org> | 2019-11-15 15:53:28 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2019-11-15 15:53:28 -0800 |
commit | e28125db2735b53e28ab5148379cb8b804c184c6 (patch) | |
tree | 2905bf5341e6a38cca139fcd5885c206606b4f19 /notes | |
parent | 21e9fd509b63182657d1b76b02c2921b2a5334b0 (diff) | |
download | sandcrawler-e28125db2735b53e28ab5148379cb8b804c184c6.tar.gz sandcrawler-e28125db2735b53e28ab5148379cb8b804c184c6.zip |
updated re-GROBID job log entry
Diffstat (limited to 'notes')
-rw-r--r-- | notes/job_log.txt | 31 |
1 files changed, 31 insertions, 0 deletions
diff --git a/notes/job_log.txt b/notes/job_log.txt index 6051b91..68bef9b 100644 --- a/notes/job_log.txt +++ b/notes/job_log.txt @@ -142,3 +142,34 @@ this batch as well. NOTE: really should get parallel kafka worker going soon. if there is a reboot or something in the middle of this process, will need to re-run from the start. +Was getting a bunch of weird kafka INVALID_MSG errors on produce. Would be nice to be able to retry, so doing: + + cat /srv/sandcrawler/tasks/regrobid_cdx.split_*.json | pv -l | parallel --joblog regrobid_job.log --retries 5 -j40 --linebuffer --pipe ./grobid_tool.py --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --kafka-mode --grobid-host http://localhost:8070 -j0 extract-json - + +Never mind, going to split into chunks which can be retried. + + cd /srv/sandcrawler/tasks + sudo chown sandcrawler:staff . + cat regrobid_cdx.split_* | split -l 20000 -a4 -d --additional-suffix=.json - chunk_ + ls /srv/sandcrawler/tasks/chunk_*.json | parallel -j4 ./extract_chunk.sh {} + +extract_chunk.sh: + + + #!/bin/bash + + set -x -e -u -o pipefail + + if [ -f $1.SUCCESS ]; then + echo "Skipping: $1..." + exit + fi + + echo "Extracting $1..." + + date + cat $1 | parallel -j10 --linebuffer --round-robin --pipe ./grobid_tool.py --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --kafka-mode --grobid-host http://localhost:8070 -j0 extract-json - + + touch $1.SUCCESS + +seems to be working better! tested and if there is a problem with one chunk the others continue |