diff options
| author | Bryan Newbold <bnewbold@archive.org> | 2020-08-11 17:37:03 -0700 | 
|---|---|---|
| committer | Bryan Newbold <bnewbold@archive.org> | 2020-08-11 17:37:09 -0700 | 
| commit | 644e412c38c8897e171e3aa1244f1aa6955d8e65 (patch) | |
| tree | 7fef947fcd882cdb1ed7776dabcafe351278391d | |
| parent | 7e8ff96fb90ddd1c853418a6c405d97afbc45355 (diff) | |
| download | sandcrawler-644e412c38c8897e171e3aa1244f1aa6955d8e65.tar.gz sandcrawler-644e412c38c8897e171e3aa1244f1aa6955d8e65.zip | |
ingest: actually use force_get flag with SPN
The code path was there, but wasn't actually flagging in our most
popular daily domains yet. Hopefully will make a big difference in SPN
throughput.
| -rw-r--r-- | python/sandcrawler/ingest.py | 13 | 
1 files changed, 13 insertions, 0 deletions
| diff --git a/python/sandcrawler/ingest.py b/python/sandcrawler/ingest.py index 918a832..d910665 100644 --- a/python/sandcrawler/ingest.py +++ b/python/sandcrawler/ingest.py @@ -113,6 +113,19 @@ class IngestFileWorker(SandcrawlerWorker):          # future possibly to increase download efficiency (wget/fetch being          # faster than browser fetch)          self.spn2_simple_get_domains = [ +            # direct PDF links +            "://arxiv.org/pdf/", +            "://europepmc.org/backend/ptpmcrender.fcgi", +            "://pdfs.semanticscholar.org/", +            "://res.mdpi.com/", + +            # platform sites +            "://zenodo.org/", +            "://figshare.org/", +            "://springernature.figshare.com/", + +            # popular simple cloud storage or direct links +            "://s3-eu-west-1.amazonaws.com/",          ] | 
