aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2022-12-23 15:25:59 -0800
committerBryan Newbold <bnewbold@archive.org>2022-12-23 15:25:59 -0800
commitab9d7c2ba70e53b58631e1ae5c8769461f6ae5de (patch)
treee95ee068878840001ad34f60d6d83d6dbb1b625a
parentb878b4a5036332e95145d2e70257d757cfecfc9c (diff)
downloadsandcrawler-ab9d7c2ba70e53b58631e1ae5c8769461f6ae5de.tar.gz
sandcrawler-ab9d7c2ba70e53b58631e1ae5c8769461f6ae5de.zip
old notes on possible places to ingest from
-rw-r--r--notes/possible_ingest_targets.txt15
1 files changed, 15 insertions, 0 deletions
diff --git a/notes/possible_ingest_targets.txt b/notes/possible_ingest_targets.txt
new file mode 100644
index 0000000..fcdc3e4
--- /dev/null
+++ b/notes/possible_ingest_targets.txt
@@ -0,0 +1,15 @@
+
+- all releases from small journals, regardless of OA status, if small (eg, less than 200 papers published), and not big5
+
+more complex crawling/content:
+- add video link to alternative content demo ingest: https://open.library.ubc.ca/cIRcle/collections/48630/items/1.0400764
+- watermark.silverchair.com: if terminal-bad-status, then do recrawl via heritrix with base_url
+- www.morressier.com: interesting site for rich web crawling/preservation (video+slides+data)
+- doi.ala.org.au: possible dataset ingest source
+- peerj.com, at least reviews, should be HTML ingest? or are some PDF?
+- publons.com should be HTML ingest, possibly special case for scope
+- frontiersin.org: any 'component' releases with PDF file are probably a metadata bug
+
+other tasks:
+- handle this related withdrawn notice? https://open.library.ubc.ca/cIRcle/collections/48630/items/1.0401512
+- push/deploy sandcrawler changes