diff options
-rw-r--r-- | notes/ingest/2020-03-04_mag.md | 71 | ||||
-rw-r--r-- | proposals/20200211_nsq.md | 79 |
2 files changed, 150 insertions, 0 deletions
diff --git a/notes/ingest/2020-03-04_mag.md b/notes/ingest/2020-03-04_mag.md index 97594c8..9b000a3 100644 --- a/notes/ingest/2020-03-04_mag.md +++ b/notes/ingest/2020-03-04_mag.md @@ -406,3 +406,74 @@ Full run: 2020-04-07 12:19 (pacific): 11,703,871 +## Post-bulk-ingest + +Around 2020-04-28, seems like main wave of bulk ingest is complete. Will need +to re-try things like cdx-error. + +Current status: + + status | count + -------------------------------+---------- + success | 18491799 + redirect-loop | 1968530 + no-capture | 1373657 + no-pdf-link | 1311842 + link-loop | 1296439 + terminal-bad-status | 627577 + cdx-error | 418278 + wrong-mimetype | 50141 + wayback-error | 37159 + petabox-error | 11249 + null-body | 6295 + gateway-timeout | 3051 + spn2-cdx-lookup-failure | 328 + spn2-error:invalid-url-syntax | 93 + bad-redirect | 75 + | 47 + invalid-host-resolution | 28 + spn2-error | 10 + bad-gzip-encoding | 7 + redirects-exceeded | 2 + (20 rows) + +Lots of cdx-error to retry. + +The no-capture links are probably a mix of domain-blocklist and things that +failed in bulk mode. Will dump and re-attempt them: + + + COPY ( + SELECT row_to_json(ingest_request.*) FROM ingest_request + LEFT JOIN ingest_file_result + ON ingest_file_result.ingest_type = ingest_request.ingest_type + AND ingest_file_result.base_url = ingest_request.base_url + WHERE + ingest_request.ingest_type = 'pdf' + AND ingest_request.link_source = 'mag' + AND ingest_file_result.status = 'no-capture' + AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%' + AND ingest_request.base_url NOT LIKE '%pubs.acs.org%' + AND ingest_request.base_url NOT LIKE '%ahajournals.org%' + AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%' + AND ingest_request.base_url NOT LIKE '%aip.scitation.org%' + AND ingest_request.base_url NOT LIKE '%academic.oup.com%' + AND ingest_request.base_url NOT LIKE '%tandfonline.com%' + ) TO '/grande/snapshots/mag_nocapture_20200420.rows.json'; + => 859849 + +What domains are these? + + cat mag_nocapture_20200420.rows.json | jq .base_url -r | cut -f3 -d/ | sort | uniq -c | sort -nr | head -n30 + +Let's filter down more: + + cat mag_nocapture_20200420.rows.json | rg -v 'www.researchgate.net' | rg -v 'muse.jhu.edu' | rg -v 'www.omicsonline.org' | rg -v 'link.springer.com' | rg -v 'iopscience.iop.org' | rg -v 'ieeexplore.ieee.org' | shuf > mag_nocapture_20200420.rows.filtered.json + + wc -l mag_nocapture_20200420.rows.filtered.json + 423085 mag_nocapture_20200420.rows.filtered.json + +Ok, enqueue! + + cat mag_nocapture_20200420.rows.filtered.json | shuf | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests -p -1 + diff --git a/proposals/20200211_nsq.md b/proposals/20200211_nsq.md new file mode 100644 index 0000000..6aa885b --- /dev/null +++ b/proposals/20200211_nsq.md @@ -0,0 +1,79 @@ + +status: planned + +In short, Kafka is not working well as a job task scheduler, and I want to try +NSQ as a medium-term solution to that problem. + + +## Motivation + +Thinking of setting up NSQ to use for scheduling distributed work, to replace +kafka for some topics. for example, "regrobid" requests where we enqueue +millions of, basically, CDX lines, and want to process on dozens of cores or +multiple machines. or file ingest backfill. results would still go to kafka (to +persist), and pipelines like DOI harvest -> import -> elasticsearch would still +be kafka + +The pain point with kafka is having dozens of workers on tasks that take more +than a couple seconds per task. we could keep tweaking kafka and writing weird +consumer group things to handle this, but I think it will never work very well. +NSQ supports re-queues with delay (eg, on failure, defer to re-process later), +allows many workers to connect and leave with no disruption, messages don't +have to be processed in order, and has a very simple enqueue API (HTTP POST). + +The slowish tasks we have now are file ingest (wayback and/or SPNv2 + +GROBID) and re-GROBID. In the near future will also have ML backlog to go +through. + +Throughput isn't much of a concern as tasks take 10+ seconds each. + + +## Specific Plan + +Continue publishing ingest requests to Kafka topic. Have a new persist worker +consume from this topic and push to request table (but not result table) using +`ON CONFLICT DO NOTHING`. Have a new single-process kafka consumer pull from +the topic and push to NSQ. This consumer monitors NSQ and doesn't push too many +requests (eg, 1k maximum). NSQ could potentially even run as in-memory mode. +New worker/pusher class that acts as an NSQ client, possibly with parallelism. + +*Clean* NSQ shutdown/restart always persists data locally to disk. + +Unclean shutdown (eg, power failure) would mean NSQ might have lost state. +Because we are persisting requests to sandcrawler-db, cleanup is simple: +re-enqueue all requests from the past N days with null result or result older +than M days. + +Still need multiple kafka and NSQ topics to have priority queues (eg, bulk, +platform-specific). + +To start, have a single static NSQ host; don't need nsqlookupd. Could use +wbgrp-svc506 (datanode VM with SSD, lots of CPU and RAM). + +To move hosts, simply restart the kafka pusher pointing at the new NSQ host. +When the old host's queue is empty, restart the workers to consume from the new +host, and destroy the old NSQ host. + + +## Alternatives + +Work arounds i've done to date have been using the `grobid_tool.py` or +`ingest_tool.py` JSON input modes to pipe JSON task files (millions of lines) +through GNU/parallel. I guess GNU/parallel's distributed mode is also an option +here. + +Other things that could be used: + +**celery**: popular, many features. need to run separate redis, no disk persistence (?) + +**disque**: need to run redis, no disk persistence (?) <https://github.com/antirez/disque> + +**gearman**: <http://gearman.org/> no disk persistence (?) + + +## Old Notes + +TBD if would want to switch ingest requests from fatcat -> sandcrawler over, +and have the continuous ingests run out of NSQ, or keep using kafka for that. +currently can only do up to 10x parallelism or so with SPNv2, so that isn't a +scaling pain point |