diff options
author | Bryan Newbold <bnewbold@archive.org> | 2021-09-30 18:47:17 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2021-09-30 18:47:23 -0700 |
commit | 1b5ee74818da93fd80201a60a18632ff28692d91 (patch) | |
tree | 515438e78d1f066d659afba11accae43cef41983 /python | |
parent | 3247dca63af8fecc07e7dfb79063e9c881490d88 (diff) | |
download | sandcrawler-1b5ee74818da93fd80201a60a18632ff28692d91.tar.gz sandcrawler-1b5ee74818da93fd80201a60a18632ff28692d91.zip |
ingest CDX lookup: weigh year+month of capture against in-petabox-or-not
This is to try working around an issue where ingests fail because an SPN
capture is much newer, but the old sorting preference ignored that.
Note that the sorting logic is pretty busted anyways, and we should
probably allow returning multiple matching files to try.
Diffstat (limited to 'python')
-rw-r--r-- | python/sandcrawler/ia.py | 1 |
1 files changed, 1 insertions, 0 deletions
diff --git a/python/sandcrawler/ia.py b/python/sandcrawler/ia.py index a5d19cd..c586972 100644 --- a/python/sandcrawler/ia.py +++ b/python/sandcrawler/ia.py @@ -297,6 +297,7 @@ class CdxApiClient: int(0 - (r.status_code or 999)), int(r.mimetype == best_mimetype), int(r.mimetype != "warc/revisit"), + int(r.datetime[:6]), int('/' in r.warc_path), int(r.datetime), ) |