Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | CDX: skip sha-256 digests | Bryan Newbold | 2022-07-25 | 1 | -1/+5 |
| | |||||
* | cdx: tweak CDX lookups and resolution (sort) | Bryan Newbold | 2022-07-16 | 1 | -4/+7 |
| | |||||
* | wayback: use same 5xx/4xx-allowing tricks for replay body fetch as for ↵ | Bryan Newbold | 2022-07-15 | 1 | -7/+7 |
| | | | | replay redirect | ||||
* | cdx api: add another allowable URL fuzzy-match pattern (double slashes) | Bryan Newbold | 2022-07-15 | 1 | -0/+9 |
| | |||||
* | spn2: handle case of re-attempting a recent crawl (race condition) | Bryan Newbold | 2022-07-15 | 1 | -0/+14 |
| | |||||
* | cdx lookups: prioritize truely exact URL matches | Bryan Newbold | 2022-07-14 | 1 | -0/+1 |
| | | | | | | This hopefully resolves an issue causing many apparent redirect loops, which were actually timing or HTTP status code near-loops with http/https fuzzy matching in CDX API. Despite "exact" API lookup semantics. | ||||
* | ingest: handle another type of wayback redirect | Bryan Newbold | 2022-07-14 | 1 | -2/+5 |
| | |||||
* | wayback fetch: handle upstream 5xx replays | Bryan Newbold | 2022-07-13 | 1 | -4/+15 |
| | |||||
* | ingest: catch more ConnectionErrors (SPN, replay fetch, GROBID) | Bryan Newbold | 2022-05-16 | 1 | -4/+10 |
| | |||||
* | ingest spn2: fix tests | Bryan Newbold | 2022-05-05 | 1 | -1/+1 |
| | |||||
* | SPNv2: several fixes for prod throughput | Bryan Newbold | 2022-04-26 | 1 | -11/+34 |
| | | | | | | | | | | Most importantly, for some API flags, if the value is not true-thy, do not set the flag at all. Setting any flag was resulting in screenshots and outlinks actually getting created/captured, which was a huge slowdown. Also, check per-user SPNv2 slots available, using API, before requesting an actual capture. | ||||
* | file ingest: don't 'backoff' on spn2 backoff error | Bryan Newbold | 2022-03-22 | 1 | -0/+1 |
| | | | | | | | | The intent of this is to try and get through the daily ingest requests faster, so we can loop and retry if needed. A 200 second delay, usually resulting in a kafka topic reshuffle, really slows things down. This will presumably result in a bunch of spn2-backoff status requests, but we can just retry those. | ||||
* | spn: handle blocked-url (etc) better | Bryan Newbold | 2022-01-11 | 1 | -0/+10 |
| | |||||
* | SPNv2: make 'resources' optional | Bryan Newbold | 2021-11-16 | 1 | -1/+1 |
| | | | | | | | | This was always present previously. A change was made to SPNv2 API recently that borked it a bit, though in theory should be present on new captures. I'm not seeing it for some captures, so pushing this work around. It seems like we don't actually use this field anyways, at least for ingest pipeline. | ||||
* | IA (wayback): actually use an HTTP session for replay fetches | Bryan Newbold | 2021-11-03 | 1 | -2/+3 |
| | | | | | | | | I am embarassed this wasn't actually the case already! Looks like I had even instantiated a session but wasn't using it. Hopefully this change, which adds extra retries and better backoff behavior, will improve sandcrawler ingest throughput. | ||||
* | make fmt (black 21.9b0) | Bryan Newbold | 2021-10-27 | 1 | -257/+354 |
| | |||||
* | fix type annotations for petabox body fetch helper | Bryan Newbold | 2021-10-26 | 1 | -1/+1 |
| | |||||
* | lint collection membership (last lint for now) | Bryan Newbold | 2021-10-26 | 1 | -7/+7 |
| | |||||
* | more progress on type annotations and linting | Bryan Newbold | 2021-10-26 | 1 | -2/+2 |
| | |||||
* | ia: more tweaks to delicate code to satisfy type checker | Bryan Newbold | 2021-10-26 | 1 | -10/+12 |
| | | | | | Ran the 'live' wayback tests after this commit as a check, and worked (once FTP status code behavior change is fixed) | ||||
* | ia helpers: enforce max_redirects count correctly | Bryan Newbold | 2021-10-26 | 1 | -1/+1 |
| | | | | | AKA, should run fetch even if max_redirects = 0; the first loop iteration is not a redirect. | ||||
* | set CDX request params are str, not int or datetime | Bryan Newbold | 2021-10-26 | 1 | -3/+6 |
| | | | | This might be a bugfix, changing CDX lookup behavior? | ||||
* | bugfix: was setting 'from' parameter as a tuple, not a string | Bryan Newbold | 2021-10-26 | 1 | -1/+1 |
| | |||||
* | start type annotating IA helper code | Bryan Newbold | 2021-10-26 | 1 | -37/+65 |
| | |||||
* | flake8 clean (with current settings) | Bryan Newbold | 2021-10-26 | 1 | -2/+2 |
| | |||||
* | start handling trivial lint cleanups: unused imports, 'is None', etc | Bryan Newbold | 2021-10-26 | 1 | -15/+14 |
| | |||||
* | make fmt | Bryan Newbold | 2021-10-26 | 1 | -68/+124 |
| | |||||
* | python: isort all imports | Bryan Newbold | 2021-10-26 | 1 | -10/+12 |
| | |||||
* | move SPNv2 'simple_get' logic to SPN client | Bryan Newbold | 2021-10-15 | 1 | -2/+29 |
| | |||||
* | spn: avoid 'None' job_id | Bryan Newbold | 2021-10-11 | 1 | -2/+2 |
| | | | | | | Thanks Vanglis for reporting these. Not sure this commit fixes *all* instances of the problem. | ||||
* | ingest CDX lookup: weigh year+month of capture against in-petabox-or-not | Bryan Newbold | 2021-09-30 | 1 | -0/+1 |
| | | | | | | | | This is to try working around an issue where ingests fail because an SPN capture is much newer, but the old sorting preference ignored that. Note that the sorting logic is pretty busted anyways, and we should probably allow returning multiple matching files to try. | ||||
* | tune SPN CDX retry/wait depending on mode (priority vs daily) | Bryan Newbold | 2021-09-30 | 1 | -2/+4 |
| | |||||
* | crawl: SPN2 non-200 success code path | Bryan Newbold | 2021-07-13 | 1 | -11/+25 |
| | |||||
* | crawl: SPN self-redirect hack | Bryan Newbold | 2021-07-13 | 1 | -0/+9 |
| | |||||
* | crawl: small comment updates | Bryan Newbold | 2021-07-13 | 1 | -3/+6 |
| | |||||
* | ia CDX: handle bad CDX rows | Bryan Newbold | 2021-01-05 | 1 | -2/+4 |
| | |||||
* | spn: more status codes | Bryan Newbold | 2020-12-21 | 1 | -1/+2 |
| | |||||
* | handle more wayback error conditions | Bryan Newbold | 2020-11-20 | 1 | -0/+6 |
| | |||||
* | spn 'forbidden' status code | Bryan Newbold | 2020-11-12 | 1 | -1/+1 |
| | |||||
* | spn2-internal-server-error is a problem with remote server, not SPN2 | Bryan Newbold | 2020-11-08 | 1 | -0/+2 |
| | |||||
* | ingest: better non-full URL fixup | Bryan Newbold | 2020-11-08 | 1 | -4/+3 |
| | |||||
* | direct some more warnings to sys.stderr, not stdout | Bryan Newbold | 2020-11-08 | 1 | -2/+2 |
| | |||||
* | html: handle no-capture for sub-resources | Bryan Newbold | 2020-11-08 | 1 | -0/+3 |
| | |||||
* | ingest: fix null-body case | Bryan Newbold | 2020-11-08 | 1 | -0/+2 |
| | | | | Broke this in earlier refactor. | ||||
* | ia: use newer gwb (petabox) loading class | Bryan Newbold | 2020-11-04 | 1 | -5/+8 |
| | | | | This fixes zstandard WARC reading. | ||||
* | move transfer encoding helper to sandcrawler/ia.py | Bryan Newbold | 2020-11-03 | 1 | -1/+26 |
| | |||||
* | ingest: tweak debug printing alignment | Bryan Newbold | 2020-11-03 | 1 | -8/+7 |
| | |||||
* | cdx: fix 'closest' support | Bryan Newbold | 2020-11-03 | 1 | -3/+2 |
| | |||||
* | cdx: add support for 'closest' time parameter | Bryan Newbold | 2020-10-30 | 1 | -3/+9 |
| | |||||
* | ingest: decrease CDX timeout retries again | Bryan Newbold | 2020-10-22 | 1 | -1/+1 |
| |