diff options
author | Bryan Newbold <bnewbold@archive.org> | 2022-07-14 15:03:49 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2022-07-14 15:03:51 -0700 |
commit | b5217753166956eed14cf2c91ec52d883d6a5a56 (patch) | |
tree | 758026fb0061d66e49fede1b3ef451d56ab8ac93 /pig/filter-cdx-paper-pdfs.pig | |
parent | b680c255508e6721185c6793bc872c0dc97864a0 (diff) | |
download | sandcrawler-b5217753166956eed14cf2c91ec52d883d6a5a56.tar.gz sandcrawler-b5217753166956eed14cf2c91ec52d883d6a5a56.zip |
cdx lookups: prioritize truely exact URL matches
This hopefully resolves an issue causing many apparent redirect loops,
which were actually timing or HTTP status code near-loops with
http/https fuzzy matching in CDX API. Despite "exact" API lookup semantics.
Diffstat (limited to 'pig/filter-cdx-paper-pdfs.pig')
0 files changed, 0 insertions, 0 deletions