From 99cc7de073baee53bb97075377906743d364ab84 Mon Sep 17 00:00:00 2001
From: Bryan Newbold <bnewbold@archive.org>
Date: Mon, 2 Jan 2023 19:16:09 -0800
Subject: proposals: update status; include some brainstorm-only docs

---
 proposals/2019_ingest.md                           |  2 +-
 proposals/20200129_pdf_ingest.md                   |  2 +-
 proposals/20200207_pdftrio.md                      |  5 ++-
 proposals/20201012_no_capture.md                   |  5 ++-
 proposals/20201103_xml_ingest.md                   | 19 +-----------
 proposals/2020_pdf_meta_thumbnails.md              |  2 +-
 proposals/2021-04-22_crossref_db.md                |  2 +-
 proposals/2021-12-09_trawling.md                   |  5 ++-
 proposals/brainstorm/2021-debug_web_interface.md   |  9 ++++++
 .../2022-04-18_automated_heritrix_crawling.md      | 36 ++++++++++++++++++++++
 10 files changed, 62 insertions(+), 25 deletions(-)
 create mode 100644 proposals/brainstorm/2021-debug_web_interface.md
 create mode 100644 proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md

(limited to 'proposals')
diff --git a/proposals/2019_ingest.md b/proposals/2019_ingest.md
index c05c9df..768784f 100644
--- a/proposals/2019_ingest.md
+++ b/proposals/2019_ingest.md
@@ -1,5 +1,5 @@
 
-status: work-in-progress
+status: deployed
 
 This document proposes structure and systems for ingesting (crawling) paper
 PDFs and other content as part of sandcrawler.
diff --git a/proposals/20200129_pdf_ingest.md b/proposals/20200129_pdf_ingest.md
index 620ed09..157607e 100644
--- a/proposals/20200129_pdf_ingest.md
+++ b/proposals/20200129_pdf_ingest.md
@@ -1,5 +1,5 @@
 
-status: planned
+status: deployed
 
 2020q1 Fulltext PDF Ingest Plan
 ===================================
diff --git a/proposals/20200207_pdftrio.md b/proposals/20200207_pdftrio.md
index 31a2db6..6f6443f 100644
--- a/proposals/20200207_pdftrio.md
+++ b/proposals/20200207_pdftrio.md
@@ -1,5 +1,8 @@
 
-status: in progress
+status: deployed
+
+NOTE: while this has been used in production, as of December 2022 the results
+are not used much in practice, and we don't score every PDF that comes along
 
 PDF Trio (ML Classification)
 ==============================
diff --git a/proposals/20201012_no_capture.md b/proposals/20201012_no_capture.md
index 27c14d1..7f6a1f5 100644
--- a/proposals/20201012_no_capture.md
+++ b/proposals/20201012_no_capture.md
@@ -1,5 +1,8 @@
 
-status: in-progress
+status: work-in-progress
+
+NOTE: as of December 2022, bnewbold can't remember if this was fully
+implemented or not.
 
 Storing no-capture missing URLs in `terminal_url`
 =================================================
diff --git a/proposals/20201103_xml_ingest.md b/proposals/20201103_xml_ingest.md
index 25ec973..34e00b0 100644
--- a/proposals/20201103_xml_ingest.md
+++ b/proposals/20201103_xml_ingest.md
@@ -1,22 +1,5 @@
 
-status: wip
-
-TODO:
-x XML fulltext URL extractor (based on HTML biblio metadata, not PDF url extractor)
-x differential JATS XML and scielo XML from generic XML?
-    application/xml+jats is what fatcat is doing for abstracts
-    but it should be application/jats+xml?
-    application/tei+xml
-    if startswith "<article " and "<article-meta>" => JATS
-x refactor ingest worker to be more general
-x have ingest code publish body to kafka topic
-x write a persist worker
-/ create/configure kafka topic
-- test everything locally
-- fatcat: ingest tool to create requests
-- fatcat: entity updates worker creates XML ingest requests for specific sources
-- fatcat: ingest file import worker allows XML results
-- ansible: deployment of persist worker
+status: deployed
 
 XML Fulltext Ingest
 ====================
diff --git a/proposals/2020_pdf_meta_thumbnails.md b/proposals/2020_pdf_meta_thumbnails.md
index f231a7f..141ece8 100644
--- a/proposals/2020_pdf_meta_thumbnails.md
+++ b/proposals/2020_pdf_meta_thumbnails.md
@@ -1,5 +1,5 @@
 
-status: work-in-progress
+status: deployed
 
 New PDF derivatives: thumbnails, metadata, raw text
 ===================================================
diff --git a/proposals/2021-04-22_crossref_db.md b/proposals/2021-04-22_crossref_db.md
index bead7a4..1d4c3f8 100644
--- a/proposals/2021-04-22_crossref_db.md
+++ b/proposals/2021-04-22_crossref_db.md
@@ -1,5 +1,5 @@
 
-status: work-in-progress
+status: deployed
 
 Crossref DOI Metadata in Sandcrawler DB
 =======================================
diff --git a/proposals/2021-12-09_trawling.md b/proposals/2021-12-09_trawling.md
index 96c5f3f..33b6b4c 100644
--- a/proposals/2021-12-09_trawling.md
+++ b/proposals/2021-12-09_trawling.md
@@ -1,5 +1,8 @@
 
-status: in-progress
+status: work-in-progress
+
+NOTE: as of December 2022, the implementation on these features haven't been
+merged to the main branch. Development stalled in December 2021.
 
 Trawling for Unstructured Scholarly Web Content
 ===============================================
diff --git a/proposals/brainstorm/2021-debug_web_interface.md b/proposals/brainstorm/2021-debug_web_interface.md
new file mode 100644
index 0000000..442b439
--- /dev/null
+++ b/proposals/brainstorm/2021-debug_web_interface.md
@@ -0,0 +1,9 @@
+
+status: brainstorm idea
+
+Simple internal-only web interface to help debug ingest issues.
+
+- paste a hash, URL, or identifier and get a display of "everything we know" about it
+- enter a URL/SURT prefix and get aggregate stats (?)
+- enter a domain/host/prefix and get recent attempts/results
+- pre-computed periodic reports on ingest pipeline (?)
diff --git a/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md
new file mode 100644
index 0000000..b3ad447
--- /dev/null
+++ b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md
@@ -0,0 +1,36 @@
+
+status: brainstorming
+
+We continue to see issues with heritrix3-based crawling. Would like to have an
+option to switch to higher-throughput heritrix-based crawling.
+
+SPNv2 path would stick around at least for save-paper-now style ingest.
+
+
+## Sketch
+
+Ingest requests are created continuously by fatcat, with daily spikes.
+
+Ingest workers run mostly in "bulk" mode, aka they don't make SPNv2 calls.
+`no-capture` responses are recorded in sandcrawler SQL database.
+
+Periodically (daily?), a script queries for new no-capture results, filtered to
+the most recent period. These are processed in a bit in to a URL list, then
+converted to a heritrix frontier, and sent to crawlers. This could either be an
+h3 instance (?), or simple `scp` to a running crawl directory.
+
+The crawler crawls, with usual landing page config, and draintasker runs.
+
+TODO: can we have draintasker/heritrix set a maximum WARC life? Like 6 hours?
+or, target a smaller draintasker item size, so they get updated more frequently
+
+Another SQL script dumps ingest requests from the *previous* period, and
+re-submits them for bulk-style ingest (by workers).
+
+The end result would be things getting crawled and updated within a couple
+days.
+
+
+## Sketch 2
+
+Upload URL list to petabox item, wait for heritrix derive to run (!)
-- 
cgit v1.2.3