diff options
author | Bryan Newbold <bnewbold@archive.org> | 2023-01-02 19:16:09 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2023-01-02 19:16:09 -0800 |
commit | 99cc7de073baee53bb97075377906743d364ab84 (patch) | |
tree | 12b68a9695097c69eed68b1f8ece12b3007e3d4c /proposals/brainstorm | |
parent | e433990172c157707d92452652aefe2f21b6a4a0 (diff) | |
download | sandcrawler-99cc7de073baee53bb97075377906743d364ab84.tar.gz sandcrawler-99cc7de073baee53bb97075377906743d364ab84.zip |
proposals: update status; include some brainstorm-only docs
Diffstat (limited to 'proposals/brainstorm')
-rw-r--r-- | proposals/brainstorm/2021-debug_web_interface.md | 9 | ||||
-rw-r--r-- | proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md | 36 |
2 files changed, 45 insertions, 0 deletions
diff --git a/proposals/brainstorm/2021-debug_web_interface.md b/proposals/brainstorm/2021-debug_web_interface.md new file mode 100644 index 0000000..442b439 --- /dev/null +++ b/proposals/brainstorm/2021-debug_web_interface.md @@ -0,0 +1,9 @@ + +status: brainstorm idea + +Simple internal-only web interface to help debug ingest issues. + +- paste a hash, URL, or identifier and get a display of "everything we know" about it +- enter a URL/SURT prefix and get aggregate stats (?) +- enter a domain/host/prefix and get recent attempts/results +- pre-computed periodic reports on ingest pipeline (?) diff --git a/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md new file mode 100644 index 0000000..b3ad447 --- /dev/null +++ b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md @@ -0,0 +1,36 @@ + +status: brainstorming + +We continue to see issues with heritrix3-based crawling. Would like to have an +option to switch to higher-throughput heritrix-based crawling. + +SPNv2 path would stick around at least for save-paper-now style ingest. + + +## Sketch + +Ingest requests are created continuously by fatcat, with daily spikes. + +Ingest workers run mostly in "bulk" mode, aka they don't make SPNv2 calls. +`no-capture` responses are recorded in sandcrawler SQL database. + +Periodically (daily?), a script queries for new no-capture results, filtered to +the most recent period. These are processed in a bit in to a URL list, then +converted to a heritrix frontier, and sent to crawlers. This could either be an +h3 instance (?), or simple `scp` to a running crawl directory. + +The crawler crawls, with usual landing page config, and draintasker runs. + +TODO: can we have draintasker/heritrix set a maximum WARC life? Like 6 hours? +or, target a smaller draintasker item size, so they get updated more frequently + +Another SQL script dumps ingest requests from the *previous* period, and +re-submits them for bulk-style ingest (by workers). + +The end result would be things getting crawled and updated within a couple +days. + + +## Sketch 2 + +Upload URL list to petabox item, wait for heritrix derive to run (!) |