Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | more progress on type annotations and linting | Bryan Newbold | 2021-10-26 | 1 | -2/+2 |
| | |||||
* | ia: more tweaks to delicate code to satisfy type checker | Bryan Newbold | 2021-10-26 | 1 | -10/+12 |
| | | | | | Ran the 'live' wayback tests after this commit as a check, and worked (once FTP status code behavior change is fixed) | ||||
* | ia helpers: enforce max_redirects count correctly | Bryan Newbold | 2021-10-26 | 1 | -1/+1 |
| | | | | | AKA, should run fetch even if max_redirects = 0; the first loop iteration is not a redirect. | ||||
* | set CDX request params are str, not int or datetime | Bryan Newbold | 2021-10-26 | 1 | -3/+6 |
| | | | | This might be a bugfix, changing CDX lookup behavior? | ||||
* | bugfix: was setting 'from' parameter as a tuple, not a string | Bryan Newbold | 2021-10-26 | 1 | -1/+1 |
| | |||||
* | start type annotating IA helper code | Bryan Newbold | 2021-10-26 | 1 | -37/+65 |
| | |||||
* | flake8 clean (with current settings) | Bryan Newbold | 2021-10-26 | 1 | -2/+2 |
| | |||||
* | start handling trivial lint cleanups: unused imports, 'is None', etc | Bryan Newbold | 2021-10-26 | 1 | -15/+14 |
| | |||||
* | make fmt | Bryan Newbold | 2021-10-26 | 1 | -68/+124 |
| | |||||
* | python: isort all imports | Bryan Newbold | 2021-10-26 | 1 | -10/+12 |
| | |||||
* | move SPNv2 'simple_get' logic to SPN client | Bryan Newbold | 2021-10-15 | 1 | -2/+29 |
| | |||||
* | spn: avoid 'None' job_id | Bryan Newbold | 2021-10-11 | 1 | -2/+2 |
| | | | | | | Thanks Vanglis for reporting these. Not sure this commit fixes *all* instances of the problem. | ||||
* | ingest CDX lookup: weigh year+month of capture against in-petabox-or-not | Bryan Newbold | 2021-09-30 | 1 | -0/+1 |
| | | | | | | | | This is to try working around an issue where ingests fail because an SPN capture is much newer, but the old sorting preference ignored that. Note that the sorting logic is pretty busted anyways, and we should probably allow returning multiple matching files to try. | ||||
* | tune SPN CDX retry/wait depending on mode (priority vs daily) | Bryan Newbold | 2021-09-30 | 1 | -2/+4 |
| | |||||
* | crawl: SPN2 non-200 success code path | Bryan Newbold | 2021-07-13 | 1 | -11/+25 |
| | |||||
* | crawl: SPN self-redirect hack | Bryan Newbold | 2021-07-13 | 1 | -0/+9 |
| | |||||
* | crawl: small comment updates | Bryan Newbold | 2021-07-13 | 1 | -3/+6 |
| | |||||
* | ia CDX: handle bad CDX rows | Bryan Newbold | 2021-01-05 | 1 | -2/+4 |
| | |||||
* | spn: more status codes | Bryan Newbold | 2020-12-21 | 1 | -1/+2 |
| | |||||
* | handle more wayback error conditions | Bryan Newbold | 2020-11-20 | 1 | -0/+6 |
| | |||||
* | spn 'forbidden' status code | Bryan Newbold | 2020-11-12 | 1 | -1/+1 |
| | |||||
* | spn2-internal-server-error is a problem with remote server, not SPN2 | Bryan Newbold | 2020-11-08 | 1 | -0/+2 |
| | |||||
* | ingest: better non-full URL fixup | Bryan Newbold | 2020-11-08 | 1 | -4/+3 |
| | |||||
* | direct some more warnings to sys.stderr, not stdout | Bryan Newbold | 2020-11-08 | 1 | -2/+2 |
| | |||||
* | html: handle no-capture for sub-resources | Bryan Newbold | 2020-11-08 | 1 | -0/+3 |
| | |||||
* | ingest: fix null-body case | Bryan Newbold | 2020-11-08 | 1 | -0/+2 |
| | | | | Broke this in earlier refactor. | ||||
* | ia: use newer gwb (petabox) loading class | Bryan Newbold | 2020-11-04 | 1 | -5/+8 |
| | | | | This fixes zstandard WARC reading. | ||||
* | move transfer encoding helper to sandcrawler/ia.py | Bryan Newbold | 2020-11-03 | 1 | -1/+26 |
| | |||||
* | ingest: tweak debug printing alignment | Bryan Newbold | 2020-11-03 | 1 | -8/+7 |
| | |||||
* | cdx: fix 'closest' support | Bryan Newbold | 2020-11-03 | 1 | -3/+2 |
| | |||||
* | cdx: add support for 'closest' time parameter | Bryan Newbold | 2020-10-30 | 1 | -3/+9 |
| | |||||
* | ingest: decrease CDX timeout retries again | Bryan Newbold | 2020-10-22 | 1 | -1/+1 |
| | |||||
* | differential wayback-error from wayback-content-error | Bryan Newbold | 2020-10-21 | 1 | -10/+13 |
| | | | | | | The motivation here is to distinguish errors due to current content in wayback (eg, in WARCs) from operational errors (eg, wayback machine is down, or network failures/disruption). | ||||
* | SPN CDX delay now seems reasonable; increase to 40sec to catch most | Bryan Newbold | 2020-10-19 | 1 | -1/+1 |
| | |||||
* | CDX: when retrying, do so every 3 seconds up to limit | Bryan Newbold | 2020-10-19 | 1 | -5/+9 |
| | |||||
* | SPN: more verbose status logging | Bryan Newbold | 2020-10-19 | 1 | -0/+4 |
| | |||||
* | CDX: revert post-SPN CDX lookup retry to 10 seconds | Bryan Newbold | 2020-10-19 | 1 | -1/+1 |
| | | | | | Hoping to have many fewer SPN requests and issues, so willing to wait longer for each. | ||||
* | ingest: catch wayback-fail-after-SPN as separate status | Bryan Newbold | 2020-10-19 | 1 | -4/+17 |
| | |||||
* | SPN: better log line when starting a request | Bryan Newbold | 2020-10-19 | 1 | -0/+1 |
| | |||||
* | SPN: look for non-200 CDX responses | Bryan Newbold | 2020-10-19 | 1 | -1/+1 |
| | | | | Suspect that this has been the source of many `spn2-cdx-lookup-failure` | ||||
* | SPN: better check for partial URLs returned | Bryan Newbold | 2020-10-19 | 1 | -2/+2 |
| | |||||
* | CDX fetch: more permissive fuzzy/normalization check | Bryan Newbold | 2020-10-19 | 1 | -3/+9 |
| | | | | | | | This might the source of some `spn2-cdx-lookup-failure`. Wayback/CDX does this check via full-on SURT, with many more changes, and potentially we should be doing that here as well. | ||||
* | ingest: experimentally reduce CDX API retry delay | Bryan Newbold | 2020-10-17 | 1 | -1/+1 |
| | | | | | | | This code path is only working about 1/7 times in production. Going to try with a much shorter retry delay and see if we get no success with that. Considering also just disabling this attempt all together and relying on retries after hours/days. | ||||
* | ingest: handle cookieAbsent and partial SPNv2 URL reponse cases better | Bryan Newbold | 2020-10-17 | 1 | -0/+31 |
| | |||||
* | store no-capture URLs in terminal_url | Bryan Newbold | 2020-10-12 | 1 | -1/+1 |
| | |||||
* | Revert "ingest: reduce CDX retry_sleep to 3.0 sec (after SPN)" | Bryan Newbold | 2020-08-11 | 1 | -1/+1 |
| | | | | | | | This reverts commit 92bf9bc28ac0eacab2e06fa3b25b52f0882804c2. In practice, in prod, this resulted in much larger spn2-cdx-lookup-failure error rates. | ||||
* | ingest: reduce CDX retry_sleep to 3.0 sec (after SPN) | Bryan Newbold | 2020-08-11 | 1 | -1/+1 |
| | | | | | | | | As we are moving towards just retrying entire ingest requests, we should probably just make this zero. But until then we should give SPN CDX a small chance to sync before giving up. This change expected to improve overall throughput. | ||||
* | refactor: force_get -> force_simple_get | Bryan Newbold | 2020-08-11 | 1 | -5/+5 |
| | | | | | For clarity. The SPNv2 API hasn't changed, just changing the variable/parameter name. | ||||
* | spn2: skip js behavior (experiment) | Bryan Newbold | 2020-08-05 | 1 | -0/+1 |
| | | | | | Hoping this will increase crawling throughput with little-to-no impact on fidelity. | ||||
* | SPN2: ensure not fetching outlinks | Bryan Newbold | 2020-08-05 | 1 | -0/+1 |
| |