Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | handle more wayback error conditions | Bryan Newbold | 2020-11-20 | 1 | -0/+6 |
| | |||||
* | spn 'forbidden' status code | Bryan Newbold | 2020-11-12 | 1 | -1/+1 |
| | |||||
* | spn2-internal-server-error is a problem with remote server, not SPN2 | Bryan Newbold | 2020-11-08 | 1 | -0/+2 |
| | |||||
* | ingest: better non-full URL fixup | Bryan Newbold | 2020-11-08 | 1 | -4/+3 |
| | |||||
* | direct some more warnings to sys.stderr, not stdout | Bryan Newbold | 2020-11-08 | 1 | -2/+2 |
| | |||||
* | html: handle no-capture for sub-resources | Bryan Newbold | 2020-11-08 | 1 | -0/+3 |
| | |||||
* | ingest: fix null-body case | Bryan Newbold | 2020-11-08 | 1 | -0/+2 |
| | | | | Broke this in earlier refactor. | ||||
* | ia: use newer gwb (petabox) loading class | Bryan Newbold | 2020-11-04 | 1 | -5/+8 |
| | | | | This fixes zstandard WARC reading. | ||||
* | move transfer encoding helper to sandcrawler/ia.py | Bryan Newbold | 2020-11-03 | 1 | -1/+26 |
| | |||||
* | ingest: tweak debug printing alignment | Bryan Newbold | 2020-11-03 | 1 | -8/+7 |
| | |||||
* | cdx: fix 'closest' support | Bryan Newbold | 2020-11-03 | 1 | -3/+2 |
| | |||||
* | cdx: add support for 'closest' time parameter | Bryan Newbold | 2020-10-30 | 1 | -3/+9 |
| | |||||
* | ingest: decrease CDX timeout retries again | Bryan Newbold | 2020-10-22 | 1 | -1/+1 |
| | |||||
* | differential wayback-error from wayback-content-error | Bryan Newbold | 2020-10-21 | 1 | -10/+13 |
| | | | | | | The motivation here is to distinguish errors due to current content in wayback (eg, in WARCs) from operational errors (eg, wayback machine is down, or network failures/disruption). | ||||
* | SPN CDX delay now seems reasonable; increase to 40sec to catch most | Bryan Newbold | 2020-10-19 | 1 | -1/+1 |
| | |||||
* | CDX: when retrying, do so every 3 seconds up to limit | Bryan Newbold | 2020-10-19 | 1 | -5/+9 |
| | |||||
* | SPN: more verbose status logging | Bryan Newbold | 2020-10-19 | 1 | -0/+4 |
| | |||||
* | CDX: revert post-SPN CDX lookup retry to 10 seconds | Bryan Newbold | 2020-10-19 | 1 | -1/+1 |
| | | | | | Hoping to have many fewer SPN requests and issues, so willing to wait longer for each. | ||||
* | ingest: catch wayback-fail-after-SPN as separate status | Bryan Newbold | 2020-10-19 | 1 | -4/+17 |
| | |||||
* | SPN: better log line when starting a request | Bryan Newbold | 2020-10-19 | 1 | -0/+1 |
| | |||||
* | SPN: look for non-200 CDX responses | Bryan Newbold | 2020-10-19 | 1 | -1/+1 |
| | | | | Suspect that this has been the source of many `spn2-cdx-lookup-failure` | ||||
* | SPN: better check for partial URLs returned | Bryan Newbold | 2020-10-19 | 1 | -2/+2 |
| | |||||
* | CDX fetch: more permissive fuzzy/normalization check | Bryan Newbold | 2020-10-19 | 1 | -3/+9 |
| | | | | | | | This might the source of some `spn2-cdx-lookup-failure`. Wayback/CDX does this check via full-on SURT, with many more changes, and potentially we should be doing that here as well. | ||||
* | ingest: experimentally reduce CDX API retry delay | Bryan Newbold | 2020-10-17 | 1 | -1/+1 |
| | | | | | | | This code path is only working about 1/7 times in production. Going to try with a much shorter retry delay and see if we get no success with that. Considering also just disabling this attempt all together and relying on retries after hours/days. | ||||
* | ingest: handle cookieAbsent and partial SPNv2 URL reponse cases better | Bryan Newbold | 2020-10-17 | 1 | -0/+31 |
| | |||||
* | store no-capture URLs in terminal_url | Bryan Newbold | 2020-10-12 | 1 | -1/+1 |
| | |||||
* | Revert "ingest: reduce CDX retry_sleep to 3.0 sec (after SPN)" | Bryan Newbold | 2020-08-11 | 1 | -1/+1 |
| | | | | | | | This reverts commit 92bf9bc28ac0eacab2e06fa3b25b52f0882804c2. In practice, in prod, this resulted in much larger spn2-cdx-lookup-failure error rates. | ||||
* | ingest: reduce CDX retry_sleep to 3.0 sec (after SPN) | Bryan Newbold | 2020-08-11 | 1 | -1/+1 |
| | | | | | | | | As we are moving towards just retrying entire ingest requests, we should probably just make this zero. But until then we should give SPN CDX a small chance to sync before giving up. This change expected to improve overall throughput. | ||||
* | refactor: force_get -> force_simple_get | Bryan Newbold | 2020-08-11 | 1 | -5/+5 |
| | | | | | For clarity. The SPNv2 API hasn't changed, just changing the variable/parameter name. | ||||
* | spn2: skip js behavior (experiment) | Bryan Newbold | 2020-08-05 | 1 | -0/+1 |
| | | | | | Hoping this will increase crawling throughput with little-to-no impact on fidelity. | ||||
* | SPN2: ensure not fetching outlinks | Bryan Newbold | 2020-08-05 | 1 | -0/+1 |
| | |||||
* | use new SPNv2 'skip_first_archive' param | Bryan Newbold | 2020-07-22 | 1 | -0/+1 |
| | | | | For speed and efficiency. | ||||
* | report revisit non-200 as a WaybackError | Bryan Newbold | 2020-06-26 | 1 | -7/+7 |
| | |||||
* | pdf: mypy and typo fixes | Bryan Newbold | 2020-06-17 | 1 | -1/+1 |
| | |||||
* | ingest: quick hack to capture CNKI outlinks | Bryan Newbold | 2020-04-13 | 1 | -2/+9 |
| | |||||
* | ia: set User-Agent for replay fetch from wayback | Bryan Newbold | 2020-03-29 | 1 | -0/+5 |
| | | | | | | | Did this for all the other "client" helpers, but forgot to for wayback replay. Was starting to get "445" errors from wayback. | ||||
* | ingest: better spn2 pending error code | Bryan Newbold | 2020-03-27 | 1 | -0/+2 |
| | |||||
* | ia: more conservative use of clean_url() | Bryan Newbold | 2020-03-24 | 1 | -3/+5 |
| | | | | | | Fixes AttributeError: 'NoneType' object has no attribute 'strip' Seen in production on the lookup_resource code path. | ||||
* | ingest: clean_url() in more places | Bryan Newbold | 2020-03-23 | 1 | -1/+4 |
| | | | | | | Some 'cdx-error' results were due to URLs with ':' after the hostname or trailing newline ("\n") characters in the URL. This attempts to work around this categroy of error. | ||||
* | ingest: log every URL (from ia code side) | Bryan Newbold | 2020-03-18 | 1 | -0/+1 |
| | |||||
* | implement (unused) force_get flag for SPN2 | Bryan Newbold | 2020-03-18 | 1 | -3/+4 |
| | | | | | | | | | I hoped this feature would make it possible to crawl journals.lww.com PDFs, because the token URLs work with `wget`, but it still doesn't seem to work. Maybe because of user agent? Anyways, this feature might be useful for crawling efficiency, so adding to master. | ||||
* | work around local redirect (resource.location) | Bryan Newbold | 2020-03-17 | 1 | -1/+6 |
| | | | | | | Some redirects are host-local. This patch crudely detects this (full-path redirects starting with "/" only), and appends the URL to the host of the original URL. | ||||
* | ia: catch wayback ChunkedEncodingError | Bryan Newbold | 2020-03-05 | 1 | -0/+3 |
| | |||||
* | fix warc_offset -> offset | Bryan Newbold | 2020-02-24 | 1 | -1/+1 |
| | |||||
* | ingest: handle broken revisit records | Bryan Newbold | 2020-02-24 | 1 | -1/+4 |
| | |||||
* | ingest: treat CDX lookup error as a wayback-error | Bryan Newbold | 2020-02-24 | 1 | -1/+4 |
| | |||||
* | fetch_petabox_body: allow non-200 status code fetches | Bryan Newbold | 2020-02-24 | 1 | -2/+10 |
| | | | | | | But only if it matches what the revisit record indicated. This is mostly to enable better revisit fetching. | ||||
* | allow fuzzy revisit matches | Bryan Newbold | 2020-02-24 | 1 | -1/+26 |
| | |||||
* | ingest: more revisit fixes | Bryan Newbold | 2020-02-22 | 1 | -4/+4 |
| | |||||
* | ia: improve warc/revisit implementation | Bryan Newbold | 2020-02-22 | 1 | -26/+46 |
| | | | | | A lot of the terminal-bad-status seems to have due to not handling revisits correctly. They have status_code = '-' or None. |