Commit message (Collapse) | Author | Age | Files | Lines | ||
---|---|---|---|---|---|---|
... | ||||||
* | SPN: better check for partial URLs returned | Bryan Newbold | 2020-10-19 | 1 | -2/+2 | |
| | ||||||
* | CDX fetch: more permissive fuzzy/normalization check | Bryan Newbold | 2020-10-19 | 1 | -3/+9 | |
| | | | | | | | This might the source of some `spn2-cdx-lookup-failure`. Wayback/CDX does this check via full-on SURT, with many more changes, and potentially we should be doing that here as well. | |||||
* | ingest: experimentally reduce CDX API retry delay | Bryan Newbold | 2020-10-17 | 1 | -1/+1 | |
| | | | | | | | This code path is only working about 1/7 times in production. Going to try with a much shorter retry delay and see if we get no success with that. Considering also just disabling this attempt all together and relying on retries after hours/days. | |||||
* | ingest: handle cookieAbsent and partial SPNv2 URL reponse cases better | Bryan Newbold | 2020-10-17 | 1 | -0/+31 | |
| | ||||||
* | store no-capture URLs in terminal_url | Bryan Newbold | 2020-10-12 | 1 | -1/+1 | |
| | ||||||
* | Revert "ingest: reduce CDX retry_sleep to 3.0 sec (after SPN)" | Bryan Newbold | 2020-08-11 | 1 | -1/+1 | |
| | | | | | | | This reverts commit 92bf9bc28ac0eacab2e06fa3b25b52f0882804c2. In practice, in prod, this resulted in much larger spn2-cdx-lookup-failure error rates. | |||||
* | ingest: reduce CDX retry_sleep to 3.0 sec (after SPN) | Bryan Newbold | 2020-08-11 | 1 | -1/+1 | |
| | | | | | | | | As we are moving towards just retrying entire ingest requests, we should probably just make this zero. But until then we should give SPN CDX a small chance to sync before giving up. This change expected to improve overall throughput. | |||||
* | refactor: force_get -> force_simple_get | Bryan Newbold | 2020-08-11 | 1 | -5/+5 | |
| | | | | | For clarity. The SPNv2 API hasn't changed, just changing the variable/parameter name. | |||||
* | spn2: skip js behavior (experiment) | Bryan Newbold | 2020-08-05 | 1 | -0/+1 | |
| | | | | | Hoping this will increase crawling throughput with little-to-no impact on fidelity. | |||||
* | SPN2: ensure not fetching outlinks | Bryan Newbold | 2020-08-05 | 1 | -0/+1 | |
| | ||||||
* | use new SPNv2 'skip_first_archive' param | Bryan Newbold | 2020-07-22 | 1 | -0/+1 | |
| | | | | For speed and efficiency. | |||||
* | report revisit non-200 as a WaybackError | Bryan Newbold | 2020-06-26 | 1 | -7/+7 | |
| | ||||||
* | pdf: mypy and typo fixes | Bryan Newbold | 2020-06-17 | 1 | -1/+1 | |
| | ||||||
* | ingest: quick hack to capture CNKI outlinks | Bryan Newbold | 2020-04-13 | 1 | -2/+9 | |
| | ||||||
* | ia: set User-Agent for replay fetch from wayback | Bryan Newbold | 2020-03-29 | 1 | -0/+5 | |
| | | | | | | | Did this for all the other "client" helpers, but forgot to for wayback replay. Was starting to get "445" errors from wayback. | |||||
* | ingest: better spn2 pending error code | Bryan Newbold | 2020-03-27 | 1 | -0/+2 | |
| | ||||||
* | ia: more conservative use of clean_url() | Bryan Newbold | 2020-03-24 | 1 | -3/+5 | |
| | | | | | | Fixes AttributeError: 'NoneType' object has no attribute 'strip' Seen in production on the lookup_resource code path. | |||||
* | ingest: clean_url() in more places | Bryan Newbold | 2020-03-23 | 1 | -1/+4 | |
| | | | | | | Some 'cdx-error' results were due to URLs with ':' after the hostname or trailing newline ("\n") characters in the URL. This attempts to work around this categroy of error. | |||||
* | ingest: log every URL (from ia code side) | Bryan Newbold | 2020-03-18 | 1 | -0/+1 | |
| | ||||||
* | implement (unused) force_get flag for SPN2 | Bryan Newbold | 2020-03-18 | 1 | -3/+4 | |
| | | | | | | | | | I hoped this feature would make it possible to crawl journals.lww.com PDFs, because the token URLs work with `wget`, but it still doesn't seem to work. Maybe because of user agent? Anyways, this feature might be useful for crawling efficiency, so adding to master. | |||||
* | work around local redirect (resource.location) | Bryan Newbold | 2020-03-17 | 1 | -1/+6 | |
| | | | | | | Some redirects are host-local. This patch crudely detects this (full-path redirects starting with "/" only), and appends the URL to the host of the original URL. | |||||
* | ia: catch wayback ChunkedEncodingError | Bryan Newbold | 2020-03-05 | 1 | -0/+3 | |
| | ||||||
* | fix warc_offset -> offset | Bryan Newbold | 2020-02-24 | 1 | -1/+1 | |
| | ||||||
* | ingest: handle broken revisit records | Bryan Newbold | 2020-02-24 | 1 | -1/+4 | |
| | ||||||
* | ingest: treat CDX lookup error as a wayback-error | Bryan Newbold | 2020-02-24 | 1 | -1/+4 | |
| | ||||||
* | fetch_petabox_body: allow non-200 status code fetches | Bryan Newbold | 2020-02-24 | 1 | -2/+10 | |
| | | | | | | But only if it matches what the revisit record indicated. This is mostly to enable better revisit fetching. | |||||
* | allow fuzzy revisit matches | Bryan Newbold | 2020-02-24 | 1 | -1/+26 | |
| | ||||||
* | ingest: more revisit fixes | Bryan Newbold | 2020-02-22 | 1 | -4/+4 | |
| | ||||||
* | ia: improve warc/revisit implementation | Bryan Newbold | 2020-02-22 | 1 | -26/+46 | |
| | | | | | A lot of the terminal-bad-status seems to have due to not handling revisits correctly. They have status_code = '-' or None. | |||||
* | cdx: handle empty/null CDX response | Bryan Newbold | 2020-02-22 | 1 | -0/+2 | |
| | | | | Sometimes seem to get empty string instead of empty JSON list | |||||
* | filter out CDX rows missing WARC playback fields | Bryan Newbold | 2020-02-19 | 1 | -0/+4 | |
| | ||||||
* | X-Archive-Src more robust than X-Archive-Redirect-Reason | Bryan Newbold | 2020-02-18 | 1 | -2/+3 | |
| | ||||||
* | wayback: on bad redirects, log instead of assert | Bryan Newbold | 2020-02-18 | 1 | -2/+13 | |
| | | | | This is a different form of mangled redirect. | |||||
* | attempt to work around corrupt ARC files from alexa issue | Bryan Newbold | 2020-02-18 | 1 | -0/+5 | |
| | ||||||
* | handle alternative dt format in WARC headers | Bryan Newbold | 2020-02-05 | 1 | -2/+4 | |
| | | | | | If there is a UTC timestamp, with trailing 'Z' indicating timezone, that is valid but increases string length by one. | |||||
* | decrease SPNv2 polling timeout to 3 minutes | Bryan Newbold | 2020-02-05 | 1 | -2/+2 | |
| | ||||||
* | improvements to reliability from prod testing | Bryan Newbold | 2020-02-03 | 1 | -5/+11 | |
| | ||||||
* | hack-y backoff ingest attempt | Bryan Newbold | 2020-02-03 | 1 | -2/+11 | |
| | | | | | | | | | | | | | | | The goal here is to have SPNv2 requests backoff when we get back-pressure (usually caused by some sessions taking too long). Lack of proper back-pressure is making it hard to turn up parallelism. This is a hack because we still timeout and drop the slow request. A better way is probably to have a background thread run, while the KafkaPusher thread does polling. Maybe with timeouts to detect slow processing (greater than 30 seconds?) and only pause/resume in that case. This would also make taking batches easier. Unlike the existing code, however, the parallelism needs to happen at the Pusher level to do the polling (Kafka) and "await" (for all worker threads to complete) correctly. | |||||
* | wayback: try to resolve HTTPException due to many HTTP headers | Bryan Newbold | 2020-02-02 | 1 | -1/+9 | |
| | | | | | | | | | This is withing GWB wayback code. Trying two things: - bump default max headers from 100 to 1000 in the (global?) http.client module itself. I didn't think through whether we would expect this to actually work - catch the exception, record it, move on | |||||
* | fix WaybackError exception formating | Bryan Newbold | 2020-01-28 | 1 | -1/+1 | |
| | ||||||
* | fix elif syntax error | Bryan Newbold | 2020-01-28 | 1 | -1/+1 | |
| | ||||||
* | clarify petabox fetch behavior | Bryan Newbold | 2020-01-28 | 1 | -3/+6 | |
| | ||||||
* | wayback: replay redirects have X-Archive-Redirect-Reason | Bryan Newbold | 2020-01-21 | 1 | -2/+4 | |
| | ||||||
* | handle UnicodeDecodeError in the other GET instance | Bryan Newbold | 2020-01-15 | 1 | -0/+2 | |
| | ||||||
* | increase SPNv2 polling timeout to 4 minutes | Bryan Newbold | 2020-01-15 | 1 | -1/+3 | |
| | ||||||
* | make failed replay fetch an error, not assert error | Bryan Newbold | 2020-01-15 | 1 | -1/+2 | |
| | ||||||
* | wayback replay: catch UnicodeDecodeError | Bryan Newbold | 2020-01-15 | 1 | -0/+2 | |
| | | | | | | | | In prod, ran in to a redirect URL like: b'/web/20200116043630id_/https://mediarep.org/bitstream/handle/doc/1127/Barth\xe9l\xe9my_2015_Life_and_Technology.pdf;jsessionid=A9EFB2798846F5E14A8473BBFD6AB46C?sequence=1' which broke requests. | |||||
* | pass through revisit_cdx | Bryan Newbold | 2020-01-15 | 1 | -5/+18 | |
| | ||||||
* | fix revisit resolution | Bryan Newbold | 2020-01-15 | 1 | -4/+12 | |
| | | | | | Returns the *original* CDX record, but keeps the terminal_url and terminal_sha1hex info. | |||||
* | SPNv2 doesn't support FTP; add a live test for non-revist FTP | Bryan Newbold | 2020-01-14 | 1 | -0/+10 | |
| |