proposals: update status; include some brainstorm-only docs

author: Bryan Newbold <bnewbold@archive.org> 2023-01-02 19:16:09 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2023-01-02 19:16:09 -0800
commit: 99cc7de073baee53bb97075377906743d364ab84 (patch)
tree: 12b68a9695097c69eed68b1f8ece12b3007e3d4c /proposals/brainstorm
parent: e433990172c157707d92452652aefe2f21b6a4a0 (diff)
download: sandcrawler-99cc7de073baee53bb97075377906743d364ab84.tar.gz
sandcrawler-99cc7de073baee53bb97075377906743d364ab84.zip
2 files changed, 45 insertions, 0 deletions
diff --git a/proposals/brainstorm/2021-debug_web_interface.md b/proposals/brainstorm/2021-debug_web_interface.md
new file mode 100644
index 0000000..442b439
--- /dev/null
+++ b/proposals/brainstorm/2021-debug_web_interface.md
@@ -0,0 +1,9 @@
+
+status: brainstorm idea
+
+Simple internal-only web interface to help debug ingest issues.
+
+- paste a hash, URL, or identifier and get a display of "everything we know" about it
+- enter a URL/SURT prefix and get aggregate stats (?)
+- enter a domain/host/prefix and get recent attempts/results
+- pre-computed periodic reports on ingest pipeline (?)
diff --git a/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md
new file mode 100644
index 0000000..b3ad447
--- /dev/null
+++ b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md
@@ -0,0 +1,36 @@
+
+status: brainstorming
+
+We continue to see issues with heritrix3-based crawling. Would like to have an
+option to switch to higher-throughput heritrix-based crawling.
+
+SPNv2 path would stick around at least for save-paper-now style ingest.
+
+
+## Sketch
+
+Ingest requests are created continuously by fatcat, with daily spikes.
+
+Ingest workers run mostly in "bulk" mode, aka they don't make SPNv2 calls.
+`no-capture` responses are recorded in sandcrawler SQL database.
+
+Periodically (daily?), a script queries for new no-capture results, filtered to
+the most recent period. These are processed in a bit in to a URL list, then
+converted to a heritrix frontier, and sent to crawlers. This could either be an
+h3 instance (?), or simple `scp` to a running crawl directory.
+
+The crawler crawls, with usual landing page config, and draintasker runs.
+
+TODO: can we have draintasker/heritrix set a maximum WARC life? Like 6 hours?
+or, target a smaller draintasker item size, so they get updated more frequently
+
+Another SQL script dumps ingest requests from the *previous* period, and
+re-submits them for bulk-style ingest (by workers).
+
+The end result would be things getting crawled and updated within a couple
+days.
+
+
+## Sketch 2
+
+Upload URL list to petabox item, wait for heritrix derive to run (!)
author	Bryan Newbold <bnewbold@archive.org>	2023-01-02 19:16:09 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2023-01-02 19:16:09 -0800
commit	99cc7de073baee53bb97075377906743d364ab84 (patch)
tree	12b68a9695097c69eed68b1f8ece12b3007e3d4c /proposals/brainstorm
parent	e433990172c157707d92452652aefe2f21b6a4a0 (diff)
download	sandcrawler-99cc7de073baee53bb97075377906743d364ab84.tar.gz sandcrawler-99cc7de073baee53bb97075377906743d364ab84.zip