diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-08-11 17:37:03 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-08-11 17:37:09 -0700 |
commit | 644e412c38c8897e171e3aa1244f1aa6955d8e65 (patch) | |
tree | 7fef947fcd882cdb1ed7776dabcafe351278391d /python | |
parent | 7e8ff96fb90ddd1c853418a6c405d97afbc45355 (diff) | |
download | sandcrawler-644e412c38c8897e171e3aa1244f1aa6955d8e65.tar.gz sandcrawler-644e412c38c8897e171e3aa1244f1aa6955d8e65.zip |
ingest: actually use force_get flag with SPN
The code path was there, but wasn't actually flagging in our most
popular daily domains yet. Hopefully will make a big difference in SPN
throughput.
Diffstat (limited to 'python')
-rw-r--r-- | python/sandcrawler/ingest.py | 13 |
1 files changed, 13 insertions, 0 deletions
diff --git a/python/sandcrawler/ingest.py b/python/sandcrawler/ingest.py index 918a832..d910665 100644 --- a/python/sandcrawler/ingest.py +++ b/python/sandcrawler/ingest.py @@ -113,6 +113,19 @@ class IngestFileWorker(SandcrawlerWorker): # future possibly to increase download efficiency (wget/fetch being # faster than browser fetch) self.spn2_simple_get_domains = [ + # direct PDF links + "://arxiv.org/pdf/", + "://europepmc.org/backend/ptpmcrender.fcgi", + "://pdfs.semanticscholar.org/", + "://res.mdpi.com/", + + # platform sites + "://zenodo.org/", + "://figshare.org/", + "://springernature.figshare.com/", + + # popular simple cloud storage or direct links + "://s3-eu-west-1.amazonaws.com/", ] |