2 files changed, 150 insertions, 0 deletions
diff --git a/notes/ingest/2020-03-04_mag.md b/notes/ingest/2020-03-04_mag.md
index 97594c8..9b000a3 100644
--- a/notes/ingest/2020-03-04_mag.md
+++ b/notes/ingest/2020-03-04_mag.md
@@ -406,3 +406,74 @@ Full run:
 
     2020-04-07 12:19 (pacific): 11,703,871
 
+## Post-bulk-ingest
+
+Around 2020-04-28, seems like main wave of bulk ingest is complete. Will need
+to re-try things like cdx-error.
+
+Current status:
+
+                status             |  count
+    -------------------------------+----------
+     success                       | 18491799
+     redirect-loop                 |  1968530
+     no-capture                    |  1373657
+     no-pdf-link                   |  1311842
+     link-loop                     |  1296439
+     terminal-bad-status           |   627577
+     cdx-error                     |   418278
+     wrong-mimetype                |    50141
+     wayback-error                 |    37159
+     petabox-error                 |    11249
+     null-body                     |     6295
+     gateway-timeout               |     3051
+     spn2-cdx-lookup-failure       |      328
+     spn2-error:invalid-url-syntax |       93
+     bad-redirect                  |       75
+                                   |       47
+     invalid-host-resolution       |       28
+     spn2-error                    |       10
+     bad-gzip-encoding             |        7
+     redirects-exceeded            |        2
+    (20 rows)
+
+Lots of cdx-error to retry.
+
+The no-capture links are probably a mix of domain-blocklist and things that
+failed in bulk mode. Will dump and re-attempt them:
+
+
+    COPY (  
+        SELECT row_to_json(ingest_request.*) FROM ingest_request
+        LEFT JOIN ingest_file_result
+            ON ingest_file_result.ingest_type = ingest_request.ingest_type
+            AND ingest_file_result.base_url = ingest_request.base_url
+        WHERE
+            ingest_request.ingest_type = 'pdf'
+            AND ingest_request.link_source = 'mag'
+            AND ingest_file_result.status = 'no-capture'
+            AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%'
+            AND ingest_request.base_url NOT LIKE '%pubs.acs.org%'
+            AND ingest_request.base_url NOT LIKE '%ahajournals.org%'
+            AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%'
+            AND ingest_request.base_url NOT LIKE '%aip.scitation.org%'
+            AND ingest_request.base_url NOT LIKE '%academic.oup.com%'
+            AND ingest_request.base_url NOT LIKE '%tandfonline.com%'
+    ) TO '/grande/snapshots/mag_nocapture_20200420.rows.json';
+    => 859849
+
+What domains are these?
+
+    cat mag_nocapture_20200420.rows.json | jq .base_url -r | cut -f3 -d/ | sort | uniq -c | sort -nr | head -n30
+
+Let's filter down more:
+
+    cat mag_nocapture_20200420.rows.json | rg -v 'www.researchgate.net' | rg -v 'muse.jhu.edu' | rg -v 'www.omicsonline.org' | rg -v 'link.springer.com' | rg -v 'iopscience.iop.org' | rg -v 'ieeexplore.ieee.org' | shuf > mag_nocapture_20200420.rows.filtered.json
+
+    wc -l mag_nocapture_20200420.rows.filtered.json
+    423085 mag_nocapture_20200420.rows.filtered.json
+
+Ok, enqueue!
+
+    cat mag_nocapture_20200420.rows.filtered.json | shuf | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests -p -1
+
diff --git a/proposals/20200211_nsq.md b/proposals/20200211_nsq.md
new file mode 100644
index 0000000..6aa885b
--- /dev/null
+++ b/proposals/20200211_nsq.md
@@ -0,0 +1,79 @@
+
+status: planned
+
+In short, Kafka is not working well as a job task scheduler, and I want to try
+NSQ as a medium-term solution to that problem.
+
+
+## Motivation
+
+Thinking of setting up NSQ to use for scheduling distributed work, to replace
+kafka for some topics.  for example, "regrobid" requests where we enqueue
+millions of, basically, CDX lines, and want to process on dozens of cores or
+multiple machines. or file ingest backfill. results would still go to kafka (to
+persist), and pipelines like DOI harvest -> import -> elasticsearch would still
+be kafka
+
+The pain point with kafka is having dozens of workers on tasks that take more
+than a couple seconds per task. we could keep tweaking kafka and writing weird
+consumer group things to handle this, but I think it will never work very well.
+NSQ supports re-queues with delay (eg, on failure, defer to re-process later),
+allows many workers to connect and leave with no disruption, messages don't
+have to be processed in order, and has a very simple enqueue API (HTTP POST).
+
+The slowish tasks we have now are file ingest (wayback and/or SPNv2 +
+GROBID) and re-GROBID. In the near future will also have ML backlog to go
+through.
+
+Throughput isn't much of a concern as tasks take 10+ seconds each.
+
+
+## Specific Plan
+
+Continue publishing ingest requests to Kafka topic. Have a new persist worker
+consume from this topic and push to request table (but not result table) using
+`ON CONFLICT DO NOTHING`. Have a new single-process kafka consumer pull from
+the topic and push to NSQ. This consumer monitors NSQ and doesn't push too many
+requests (eg, 1k maximum). NSQ could potentially even run as in-memory mode.
+New worker/pusher class that acts as an NSQ client, possibly with parallelism.
+
+*Clean* NSQ shutdown/restart always persists data locally to disk.
+
+Unclean shutdown (eg, power failure) would mean NSQ might have lost state.
+Because we are persisting requests to sandcrawler-db, cleanup is simple:
+re-enqueue all requests from the past N days with null result or result older
+than M days.
+
+Still need multiple kafka and NSQ topics to have priority queues (eg, bulk,
+platform-specific).
+
+To start, have a single static NSQ host; don't need nsqlookupd. Could use
+wbgrp-svc506 (datanode VM with SSD, lots of CPU and RAM).
+
+To move hosts, simply restart the kafka pusher pointing at the new NSQ host.
+When the old host's queue is empty, restart the workers to consume from the new
+host, and destroy the old NSQ host.
+
+
+## Alternatives
+
+Work arounds i've done to date have been using the `grobid_tool.py` or
+`ingest_tool.py` JSON input modes to pipe JSON task files (millions of lines)
+through GNU/parallel. I guess GNU/parallel's distributed mode is also an option
+here.
+
+Other things that could be used:
+
+**celery**: popular, many features. need to run separate redis, no disk persistence (?)
+
+**disque**: need to run redis, no disk persistence (?) <https://github.com/antirez/disque>
+
+**gearman**: <http://gearman.org/> no disk persistence (?)
+
+
+## Old Notes
+
+TBD if would want to switch ingest requests from fatcat -> sandcrawler over,
+and have the continuous ingests run out of NSQ, or keep using kafka for that.
+currently can only do up to 10x parallelism or so with SPNv2, so that isn't a
+scaling pain point