From ab9d7c2ba70e53b58631e1ae5c8769461f6ae5de Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Fri, 23 Dec 2022 15:25:59 -0800 Subject: old notes on possible places to ingest from --- notes/possible_ingest_targets.txt | 15 +++++++++++++++ 1 file changed, 15 insertions(+) create mode 100644 notes/possible_ingest_targets.txt (limited to 'notes') diff --git a/notes/possible_ingest_targets.txt b/notes/possible_ingest_targets.txt new file mode 100644 index 0000000..fcdc3e4 --- /dev/null +++ b/notes/possible_ingest_targets.txt @@ -0,0 +1,15 @@ + +- all releases from small journals, regardless of OA status, if small (eg, less than 200 papers published), and not big5 + +more complex crawling/content: +- add video link to alternative content demo ingest: https://open.library.ubc.ca/cIRcle/collections/48630/items/1.0400764 +- watermark.silverchair.com: if terminal-bad-status, then do recrawl via heritrix with base_url +- www.morressier.com: interesting site for rich web crawling/preservation (video+slides+data) +- doi.ala.org.au: possible dataset ingest source +- peerj.com, at least reviews, should be HTML ingest? or are some PDF? +- publons.com should be HTML ingest, possibly special case for scope +- frontiersin.org: any 'component' releases with PDF file are probably a metadata bug + +other tasks: +- handle this related withdrawn notice? https://open.library.ubc.ca/cIRcle/collections/48630/items/1.0401512 +- push/deploy sandcrawler changes -- cgit v1.2.3