aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/ia.py
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2021-09-30 18:47:17 -0700
committerBryan Newbold <bnewbold@archive.org>2021-09-30 18:47:23 -0700
commit1b5ee74818da93fd80201a60a18632ff28692d91 (patch)
tree515438e78d1f066d659afba11accae43cef41983 /python/sandcrawler/ia.py
parent3247dca63af8fecc07e7dfb79063e9c881490d88 (diff)
downloadsandcrawler-1b5ee74818da93fd80201a60a18632ff28692d91.tar.gz
sandcrawler-1b5ee74818da93fd80201a60a18632ff28692d91.zip
ingest CDX lookup: weigh year+month of capture against in-petabox-or-not
This is to try working around an issue where ingests fail because an SPN capture is much newer, but the old sorting preference ignored that. Note that the sorting logic is pretty busted anyways, and we should probably allow returning multiple matching files to try.
Diffstat (limited to 'python/sandcrawler/ia.py')
-rw-r--r--python/sandcrawler/ia.py1
1 files changed, 1 insertions, 0 deletions
diff --git a/python/sandcrawler/ia.py b/python/sandcrawler/ia.py
index a5d19cd..c586972 100644
--- a/python/sandcrawler/ia.py
+++ b/python/sandcrawler/ia.py
@@ -297,6 +297,7 @@ class CdxApiClient:
int(0 - (r.status_code or 999)),
int(r.mimetype == best_mimetype),
int(r.mimetype != "warc/revisit"),
+ int(r.datetime[:6]),
int('/' in r.warc_path),
int(r.datetime),
)