Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | ia: catch wayback ChunkedEncodingError | Bryan Newbold | 2020-03-05 | 1 | -0/+3 |
| | |||||
* | fix warc_offset -> offset | Bryan Newbold | 2020-02-24 | 1 | -1/+1 |
| | |||||
* | ingest: handle broken revisit records | Bryan Newbold | 2020-02-24 | 1 | -1/+4 |
| | |||||
* | ingest: treat CDX lookup error as a wayback-error | Bryan Newbold | 2020-02-24 | 1 | -1/+4 |
| | |||||
* | fetch_petabox_body: allow non-200 status code fetches | Bryan Newbold | 2020-02-24 | 1 | -2/+10 |
| | | | | | | But only if it matches what the revisit record indicated. This is mostly to enable better revisit fetching. | ||||
* | allow fuzzy revisit matches | Bryan Newbold | 2020-02-24 | 1 | -1/+26 |
| | |||||
* | ingest: more revisit fixes | Bryan Newbold | 2020-02-22 | 1 | -4/+4 |
| | |||||
* | ia: improve warc/revisit implementation | Bryan Newbold | 2020-02-22 | 1 | -26/+46 |
| | | | | | A lot of the terminal-bad-status seems to have due to not handling revisits correctly. They have status_code = '-' or None. | ||||
* | cdx: handle empty/null CDX response | Bryan Newbold | 2020-02-22 | 1 | -0/+2 |
| | | | | Sometimes seem to get empty string instead of empty JSON list | ||||
* | filter out CDX rows missing WARC playback fields | Bryan Newbold | 2020-02-19 | 1 | -0/+4 |
| | |||||
* | X-Archive-Src more robust than X-Archive-Redirect-Reason | Bryan Newbold | 2020-02-18 | 1 | -2/+3 |
| | |||||
* | wayback: on bad redirects, log instead of assert | Bryan Newbold | 2020-02-18 | 1 | -2/+13 |
| | | | | This is a different form of mangled redirect. | ||||
* | attempt to work around corrupt ARC files from alexa issue | Bryan Newbold | 2020-02-18 | 1 | -0/+5 |
| | |||||
* | handle alternative dt format in WARC headers | Bryan Newbold | 2020-02-05 | 1 | -2/+4 |
| | | | | | If there is a UTC timestamp, with trailing 'Z' indicating timezone, that is valid but increases string length by one. | ||||
* | decrease SPNv2 polling timeout to 3 minutes | Bryan Newbold | 2020-02-05 | 1 | -2/+2 |
| | |||||
* | improvements to reliability from prod testing | Bryan Newbold | 2020-02-03 | 1 | -5/+11 |
| | |||||
* | hack-y backoff ingest attempt | Bryan Newbold | 2020-02-03 | 1 | -2/+11 |
| | | | | | | | | | | | | | | | The goal here is to have SPNv2 requests backoff when we get back-pressure (usually caused by some sessions taking too long). Lack of proper back-pressure is making it hard to turn up parallelism. This is a hack because we still timeout and drop the slow request. A better way is probably to have a background thread run, while the KafkaPusher thread does polling. Maybe with timeouts to detect slow processing (greater than 30 seconds?) and only pause/resume in that case. This would also make taking batches easier. Unlike the existing code, however, the parallelism needs to happen at the Pusher level to do the polling (Kafka) and "await" (for all worker threads to complete) correctly. | ||||
* | wayback: try to resolve HTTPException due to many HTTP headers | Bryan Newbold | 2020-02-02 | 1 | -1/+9 |
| | | | | | | | | | This is withing GWB wayback code. Trying two things: - bump default max headers from 100 to 1000 in the (global?) http.client module itself. I didn't think through whether we would expect this to actually work - catch the exception, record it, move on | ||||
* | fix WaybackError exception formating | Bryan Newbold | 2020-01-28 | 1 | -1/+1 |
| | |||||
* | fix elif syntax error | Bryan Newbold | 2020-01-28 | 1 | -1/+1 |
| | |||||
* | clarify petabox fetch behavior | Bryan Newbold | 2020-01-28 | 1 | -3/+6 |
| | |||||
* | wayback: replay redirects have X-Archive-Redirect-Reason | Bryan Newbold | 2020-01-21 | 1 | -2/+4 |
| | |||||
* | handle UnicodeDecodeError in the other GET instance | Bryan Newbold | 2020-01-15 | 1 | -0/+2 |
| | |||||
* | increase SPNv2 polling timeout to 4 minutes | Bryan Newbold | 2020-01-15 | 1 | -1/+3 |
| | |||||
* | make failed replay fetch an error, not assert error | Bryan Newbold | 2020-01-15 | 1 | -1/+2 |
| | |||||
* | wayback replay: catch UnicodeDecodeError | Bryan Newbold | 2020-01-15 | 1 | -0/+2 |
| | | | | | | | | In prod, ran in to a redirect URL like: b'/web/20200116043630id_/https://mediarep.org/bitstream/handle/doc/1127/Barth\xe9l\xe9my_2015_Life_and_Technology.pdf;jsessionid=A9EFB2798846F5E14A8473BBFD6AB46C?sequence=1' which broke requests. | ||||
* | pass through revisit_cdx | Bryan Newbold | 2020-01-15 | 1 | -5/+18 |
| | |||||
* | fix revisit resolution | Bryan Newbold | 2020-01-15 | 1 | -4/+12 |
| | | | | | Returns the *original* CDX record, but keeps the terminal_url and terminal_sha1hex info. | ||||
* | SPNv2 doesn't support FTP; add a live test for non-revist FTP | Bryan Newbold | 2020-01-14 | 1 | -0/+10 |
| | |||||
* | basic FTP ingest support; revist record resolution | Bryan Newbold | 2020-01-14 | 1 | -34/+77 |
| | | | | | | | - supporting revisits means more wayback hits (fewer crawls) => faster - ... but this is only partial support. will also need to work through sandcrawler db schema, etc. current status should be safe to merge/use. - ftp support via treating an ftp hit as a 200 | ||||
* | better print() output | Bryan Newbold | 2020-01-10 | 1 | -3/+3 |
| | |||||
* | fix redirect replay fetch method | Bryan Newbold | 2020-01-10 | 1 | -1/+4 |
| | |||||
* | handle SPNv2-then-CDX lookup failures | Bryan Newbold | 2020-01-10 | 1 | -6/+23 |
| | | | | | | - use a 10 second delay if CDX result isn't immediately available. blech. - if there is a lookup failure, call it a wayback-error and move on | ||||
* | SPNv2 hack specifically for elsevier lookups | Bryan Newbold | 2020-01-10 | 1 | -0/+15 |
| | | | | | I'm not really sure why this is needed, and maybe with more careful testing it isn't. But it works! | ||||
* | add support for redirect lookups from replay | Bryan Newbold | 2020-01-10 | 1 | -9/+69 |
| | |||||
* | more general ingest teaks and affordances | Bryan Newbold | 2020-01-10 | 1 | -5/+18 |
| | |||||
* | add sleep-and-retry workaround for CDX after SPNv2 | Bryan Newbold | 2020-01-10 | 1 | -1/+9 |
| | |||||
* | more live tests (for regressions) | Bryan Newbold | 2020-01-10 | 1 | -0/+1 |
| | |||||
* | disable CDX best lookup 'collapse'; leave comment | Bryan Newbold | 2020-01-10 | 1 | -1/+3 |
| | |||||
* | hack: reverse sort of CDX exact seems broken with SPNv2 results | Bryan Newbold | 2020-01-10 | 1 | -1/+1 |
| | |||||
* | wayback: datetime mismatch as an error | Bryan Newbold | 2020-01-09 | 1 | -1/+2 |
| | |||||
* | lots of progress on wayback refactoring | Bryan Newbold | 2020-01-09 | 1 | -39/+123 |
| | | | | | | - too much to list - canonical flags to control crawling - cdx_to_dict helper | ||||
* | location comes as a string, not list | Bryan Newbold | 2020-01-09 | 1 | -1/+1 |
| | |||||
* | fix http/https issue with GlobalWayback library | Bryan Newbold | 2020-01-09 | 1 | -1/+2 |
| | |||||
* | wayback fetch via replay; confirm hashes in crawl_resource() | Bryan Newbold | 2020-01-09 | 1 | -5/+40 |
| | |||||
* | wrap up basic (locally testable) ingest refactor | Bryan Newbold | 2020-01-09 | 1 | -19/+23 |
| | |||||
* | more wayback and SPN tests and fixes | Bryan Newbold | 2020-01-09 | 1 | -38/+152 |
| | |||||
* | refactor CdxApiClient, add tests | Bryan Newbold | 2020-01-08 | 1 | -40/+130 |
| | | | | | | - always use auth token and get full CDX rows - simplify to "fetch" (exact url/dt match) and "lookup_best" methods - all redirect stuff will be moved to a higher level | ||||
* | refactor SavePaperNowClient and add test | Bryan Newbold | 2020-01-07 | 1 | -28/+154 |
| | | | | | | - response as a namedtuple - "remote" errors (aka, SPN API was HTTP 200 but returned error) aren't an exception | ||||
* | remove SPNv1 code paths | Bryan Newbold | 2020-01-07 | 1 | -35/+1 |
| |