aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/ia.py
Commit message (Collapse)AuthorAgeFilesLines
* sandcrawler: try to handle weird CDX API responseBryan Newbold2022-11-011-0/+5
| | | | Hard to debug this because sentry is broken.
* ingest: don't prefer WARC over SPN so stronglyBryan Newbold2022-10-241-1/+2
| | | | | | | | | | We generally prefer an older WARC record over an SPN record, because the lookup is easier. But, this was causing problems with repeated ingest, so demote it. We may want to make this more configurable in the future, so things like HTML sub-resource lookups or bulk ingest won't prefer random new SPN captures.
* spn2: fix tests by not retrying on HTTP 500Bryan Newbold2022-09-141-1/+3
|
* CDX: skip sha-256 digestsBryan Newbold2022-07-251-1/+5
|
* cdx: tweak CDX lookups and resolution (sort)Bryan Newbold2022-07-161-4/+7
|
* wayback: use same 5xx/4xx-allowing tricks for replay body fetch as for ↵Bryan Newbold2022-07-151-7/+7
| | | | replay redirect
* cdx api: add another allowable URL fuzzy-match pattern (double slashes)Bryan Newbold2022-07-151-0/+9
|
* spn2: handle case of re-attempting a recent crawl (race condition)Bryan Newbold2022-07-151-0/+14
|
* cdx lookups: prioritize truely exact URL matchesBryan Newbold2022-07-141-0/+1
| | | | | | This hopefully resolves an issue causing many apparent redirect loops, which were actually timing or HTTP status code near-loops with http/https fuzzy matching in CDX API. Despite "exact" API lookup semantics.
* ingest: handle another type of wayback redirectBryan Newbold2022-07-141-2/+5
|
* wayback fetch: handle upstream 5xx replaysBryan Newbold2022-07-131-4/+15
|
* ingest: catch more ConnectionErrors (SPN, replay fetch, GROBID)Bryan Newbold2022-05-161-4/+10
|
* ingest spn2: fix testsBryan Newbold2022-05-051-1/+1
|
* SPNv2: several fixes for prod throughputBryan Newbold2022-04-261-11/+34
| | | | | | | | | | Most importantly, for some API flags, if the value is not true-thy, do not set the flag at all. Setting any flag was resulting in screenshots and outlinks actually getting created/captured, which was a huge slowdown. Also, check per-user SPNv2 slots available, using API, before requesting an actual capture.
* file ingest: don't 'backoff' on spn2 backoff errorBryan Newbold2022-03-221-0/+1
| | | | | | | | The intent of this is to try and get through the daily ingest requests faster, so we can loop and retry if needed. A 200 second delay, usually resulting in a kafka topic reshuffle, really slows things down. This will presumably result in a bunch of spn2-backoff status requests, but we can just retry those.
* spn: handle blocked-url (etc) betterBryan Newbold2022-01-111-0/+10
|
* SPNv2: make 'resources' optionalBryan Newbold2021-11-161-1/+1
| | | | | | | | This was always present previously. A change was made to SPNv2 API recently that borked it a bit, though in theory should be present on new captures. I'm not seeing it for some captures, so pushing this work around. It seems like we don't actually use this field anyways, at least for ingest pipeline.
* IA (wayback): actually use an HTTP session for replay fetchesBryan Newbold2021-11-031-2/+3
| | | | | | | | I am embarassed this wasn't actually the case already! Looks like I had even instantiated a session but wasn't using it. Hopefully this change, which adds extra retries and better backoff behavior, will improve sandcrawler ingest throughput.
* make fmt (black 21.9b0)Bryan Newbold2021-10-271-257/+354
|
* fix type annotations for petabox body fetch helperBryan Newbold2021-10-261-1/+1
|
* lint collection membership (last lint for now)Bryan Newbold2021-10-261-7/+7
|
* more progress on type annotations and lintingBryan Newbold2021-10-261-2/+2
|
* ia: more tweaks to delicate code to satisfy type checkerBryan Newbold2021-10-261-10/+12
| | | | | Ran the 'live' wayback tests after this commit as a check, and worked (once FTP status code behavior change is fixed)
* ia helpers: enforce max_redirects count correctlyBryan Newbold2021-10-261-1/+1
| | | | | AKA, should run fetch even if max_redirects = 0; the first loop iteration is not a redirect.
* set CDX request params are str, not int or datetimeBryan Newbold2021-10-261-3/+6
| | | | This might be a bugfix, changing CDX lookup behavior?
* bugfix: was setting 'from' parameter as a tuple, not a stringBryan Newbold2021-10-261-1/+1
|
* start type annotating IA helper codeBryan Newbold2021-10-261-37/+65
|
* flake8 clean (with current settings)Bryan Newbold2021-10-261-2/+2
|
* start handling trivial lint cleanups: unused imports, 'is None', etcBryan Newbold2021-10-261-15/+14
|
* make fmtBryan Newbold2021-10-261-68/+124
|
* python: isort all importsBryan Newbold2021-10-261-10/+12
|
* move SPNv2 'simple_get' logic to SPN clientBryan Newbold2021-10-151-2/+29
|
* spn: avoid 'None' job_idBryan Newbold2021-10-111-2/+2
| | | | | | Thanks Vanglis for reporting these. Not sure this commit fixes *all* instances of the problem.
* ingest CDX lookup: weigh year+month of capture against in-petabox-or-notBryan Newbold2021-09-301-0/+1
| | | | | | | | This is to try working around an issue where ingests fail because an SPN capture is much newer, but the old sorting preference ignored that. Note that the sorting logic is pretty busted anyways, and we should probably allow returning multiple matching files to try.
* tune SPN CDX retry/wait depending on mode (priority vs daily)Bryan Newbold2021-09-301-2/+4
|
* crawl: SPN2 non-200 success code pathBryan Newbold2021-07-131-11/+25
|
* crawl: SPN self-redirect hackBryan Newbold2021-07-131-0/+9
|
* crawl: small comment updatesBryan Newbold2021-07-131-3/+6
|
* ia CDX: handle bad CDX rowsBryan Newbold2021-01-051-2/+4
|
* spn: more status codesBryan Newbold2020-12-211-1/+2
|
* handle more wayback error conditionsBryan Newbold2020-11-201-0/+6
|
* spn 'forbidden' status codeBryan Newbold2020-11-121-1/+1
|
* spn2-internal-server-error is a problem with remote server, not SPN2Bryan Newbold2020-11-081-0/+2
|
* ingest: better non-full URL fixupBryan Newbold2020-11-081-4/+3
|
* direct some more warnings to sys.stderr, not stdoutBryan Newbold2020-11-081-2/+2
|
* html: handle no-capture for sub-resourcesBryan Newbold2020-11-081-0/+3
|
* ingest: fix null-body caseBryan Newbold2020-11-081-0/+2
| | | | Broke this in earlier refactor.
* ia: use newer gwb (petabox) loading classBryan Newbold2020-11-041-5/+8
| | | | This fixes zstandard WARC reading.
* move transfer encoding helper to sandcrawler/ia.pyBryan Newbold2020-11-031-1/+26
|
* ingest: tweak debug printing alignmentBryan Newbold2020-11-031-8/+7
|