From 99cc7de073baee53bb97075377906743d364ab84 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Mon, 2 Jan 2023 19:16:09 -0800 Subject: proposals: update status; include some brainstorm-only docs --- proposals/2019_ingest.md | 2 +- proposals/20200129_pdf_ingest.md | 2 +- proposals/20200207_pdftrio.md | 5 ++- proposals/20201012_no_capture.md | 5 ++- proposals/20201103_xml_ingest.md | 19 +----------- proposals/2020_pdf_meta_thumbnails.md | 2 +- proposals/2021-04-22_crossref_db.md | 2 +- proposals/2021-12-09_trawling.md | 5 ++- proposals/brainstorm/2021-debug_web_interface.md | 9 ++++++ .../2022-04-18_automated_heritrix_crawling.md | 36 ++++++++++++++++++++++ 10 files changed, 62 insertions(+), 25 deletions(-) create mode 100644 proposals/brainstorm/2021-debug_web_interface.md create mode 100644 proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md (limited to 'proposals') diff --git a/proposals/2019_ingest.md b/proposals/2019_ingest.md index c05c9df..768784f 100644 --- a/proposals/2019_ingest.md +++ b/proposals/2019_ingest.md @@ -1,5 +1,5 @@ -status: work-in-progress +status: deployed This document proposes structure and systems for ingesting (crawling) paper PDFs and other content as part of sandcrawler. diff --git a/proposals/20200129_pdf_ingest.md b/proposals/20200129_pdf_ingest.md index 620ed09..157607e 100644 --- a/proposals/20200129_pdf_ingest.md +++ b/proposals/20200129_pdf_ingest.md @@ -1,5 +1,5 @@ -status: planned +status: deployed 2020q1 Fulltext PDF Ingest Plan =================================== diff --git a/proposals/20200207_pdftrio.md b/proposals/20200207_pdftrio.md index 31a2db6..6f6443f 100644 --- a/proposals/20200207_pdftrio.md +++ b/proposals/20200207_pdftrio.md @@ -1,5 +1,8 @@ -status: in progress +status: deployed + +NOTE: while this has been used in production, as of December 2022 the results +are not used much in practice, and we don't score every PDF that comes along PDF Trio (ML Classification) ============================== diff --git a/proposals/20201012_no_capture.md b/proposals/20201012_no_capture.md index 27c14d1..7f6a1f5 100644 --- a/proposals/20201012_no_capture.md +++ b/proposals/20201012_no_capture.md @@ -1,5 +1,8 @@ -status: in-progress +status: work-in-progress + +NOTE: as of December 2022, bnewbold can't remember if this was fully +implemented or not. Storing no-capture missing URLs in `terminal_url` ================================================= diff --git a/proposals/20201103_xml_ingest.md b/proposals/20201103_xml_ingest.md index 25ec973..34e00b0 100644 --- a/proposals/20201103_xml_ingest.md +++ b/proposals/20201103_xml_ingest.md @@ -1,22 +1,5 @@ -status: wip - -TODO: -x XML fulltext URL extractor (based on HTML biblio metadata, not PDF url extractor) -x differential JATS XML and scielo XML from generic XML? - application/xml+jats is what fatcat is doing for abstracts - but it should be application/jats+xml? - application/tei+xml - if startswith "
" => JATS -x refactor ingest worker to be more general -x have ingest code publish body to kafka topic -x write a persist worker -/ create/configure kafka topic -- test everything locally -- fatcat: ingest tool to create requests -- fatcat: entity updates worker creates XML ingest requests for specific sources -- fatcat: ingest file import worker allows XML results -- ansible: deployment of persist worker +status: deployed XML Fulltext Ingest ==================== diff --git a/proposals/2020_pdf_meta_thumbnails.md b/proposals/2020_pdf_meta_thumbnails.md index f231a7f..141ece8 100644 --- a/proposals/2020_pdf_meta_thumbnails.md +++ b/proposals/2020_pdf_meta_thumbnails.md @@ -1,5 +1,5 @@ -status: work-in-progress +status: deployed New PDF derivatives: thumbnails, metadata, raw text =================================================== diff --git a/proposals/2021-04-22_crossref_db.md b/proposals/2021-04-22_crossref_db.md index bead7a4..1d4c3f8 100644 --- a/proposals/2021-04-22_crossref_db.md +++ b/proposals/2021-04-22_crossref_db.md @@ -1,5 +1,5 @@ -status: work-in-progress +status: deployed Crossref DOI Metadata in Sandcrawler DB ======================================= diff --git a/proposals/2021-12-09_trawling.md b/proposals/2021-12-09_trawling.md index 96c5f3f..33b6b4c 100644 --- a/proposals/2021-12-09_trawling.md +++ b/proposals/2021-12-09_trawling.md @@ -1,5 +1,8 @@ -status: in-progress +status: work-in-progress + +NOTE: as of December 2022, the implementation on these features haven't been +merged to the main branch. Development stalled in December 2021. Trawling for Unstructured Scholarly Web Content =============================================== diff --git a/proposals/brainstorm/2021-debug_web_interface.md b/proposals/brainstorm/2021-debug_web_interface.md new file mode 100644 index 0000000..442b439 --- /dev/null +++ b/proposals/brainstorm/2021-debug_web_interface.md @@ -0,0 +1,9 @@ + +status: brainstorm idea + +Simple internal-only web interface to help debug ingest issues. + +- paste a hash, URL, or identifier and get a display of "everything we know" about it +- enter a URL/SURT prefix and get aggregate stats (?) +- enter a domain/host/prefix and get recent attempts/results +- pre-computed periodic reports on ingest pipeline (?) diff --git a/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md new file mode 100644 index 0000000..b3ad447 --- /dev/null +++ b/proposals/brainstorm/2022-04-18_automated_heritrix_crawling.md @@ -0,0 +1,36 @@ + +status: brainstorming + +We continue to see issues with heritrix3-based crawling. Would like to have an +option to switch to higher-throughput heritrix-based crawling. + +SPNv2 path would stick around at least for save-paper-now style ingest. + + +## Sketch + +Ingest requests are created continuously by fatcat, with daily spikes. + +Ingest workers run mostly in "bulk" mode, aka they don't make SPNv2 calls. +`no-capture` responses are recorded in sandcrawler SQL database. + +Periodically (daily?), a script queries for new no-capture results, filtered to +the most recent period. These are processed in a bit in to a URL list, then +converted to a heritrix frontier, and sent to crawlers. This could either be an +h3 instance (?), or simple `scp` to a running crawl directory. + +The crawler crawls, with usual landing page config, and draintasker runs. + +TODO: can we have draintasker/heritrix set a maximum WARC life? Like 6 hours? +or, target a smaller draintasker item size, so they get updated more frequently + +Another SQL script dumps ingest requests from the *previous* period, and +re-submits them for bulk-style ingest (by workers). + +The end result would be things getting crawled and updated within a couple +days. + + +## Sketch 2 + +Upload URL list to petabox item, wait for heritrix derive to run (!) -- cgit v1.2.3