aboutsummaryrefslogtreecommitdiffstats
path: root/python
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-08-11 17:37:03 -0700
committerBryan Newbold <bnewbold@archive.org>2020-08-11 17:37:09 -0700
commit644e412c38c8897e171e3aa1244f1aa6955d8e65 (patch)
tree7fef947fcd882cdb1ed7776dabcafe351278391d /python
parent7e8ff96fb90ddd1c853418a6c405d97afbc45355 (diff)
downloadsandcrawler-644e412c38c8897e171e3aa1244f1aa6955d8e65.tar.gz
sandcrawler-644e412c38c8897e171e3aa1244f1aa6955d8e65.zip
ingest: actually use force_get flag with SPN
The code path was there, but wasn't actually flagging in our most popular daily domains yet. Hopefully will make a big difference in SPN throughput.
Diffstat (limited to 'python')
-rw-r--r--python/sandcrawler/ingest.py13
1 files changed, 13 insertions, 0 deletions
diff --git a/python/sandcrawler/ingest.py b/python/sandcrawler/ingest.py
index 918a832..d910665 100644
--- a/python/sandcrawler/ingest.py
+++ b/python/sandcrawler/ingest.py
@@ -113,6 +113,19 @@ class IngestFileWorker(SandcrawlerWorker):
# future possibly to increase download efficiency (wget/fetch being
# faster than browser fetch)
self.spn2_simple_get_domains = [
+ # direct PDF links
+ "://arxiv.org/pdf/",
+ "://europepmc.org/backend/ptpmcrender.fcgi",
+ "://pdfs.semanticscholar.org/",
+ "://res.mdpi.com/",
+
+ # platform sites
+ "://zenodo.org/",
+ "://figshare.org/",
+ "://springernature.figshare.com/",
+
+ # popular simple cloud storage or direct links
+ "://s3-eu-west-1.amazonaws.com/",
]