aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/ia.py
Commit message (Collapse)AuthorAgeFilesLines
* handle alternative dt format in WARC headersBryan Newbold2020-02-051-2/+4
| | | | | If there is a UTC timestamp, with trailing 'Z' indicating timezone, that is valid but increases string length by one.
* decrease SPNv2 polling timeout to 3 minutesBryan Newbold2020-02-051-2/+2
|
* improvements to reliability from prod testingBryan Newbold2020-02-031-5/+11
|
* hack-y backoff ingest attemptBryan Newbold2020-02-031-2/+11
| | | | | | | | | | | | | | | The goal here is to have SPNv2 requests backoff when we get back-pressure (usually caused by some sessions taking too long). Lack of proper back-pressure is making it hard to turn up parallelism. This is a hack because we still timeout and drop the slow request. A better way is probably to have a background thread run, while the KafkaPusher thread does polling. Maybe with timeouts to detect slow processing (greater than 30 seconds?) and only pause/resume in that case. This would also make taking batches easier. Unlike the existing code, however, the parallelism needs to happen at the Pusher level to do the polling (Kafka) and "await" (for all worker threads to complete) correctly.
* wayback: try to resolve HTTPException due to many HTTP headersBryan Newbold2020-02-021-1/+9
| | | | | | | | | This is withing GWB wayback code. Trying two things: - bump default max headers from 100 to 1000 in the (global?) http.client module itself. I didn't think through whether we would expect this to actually work - catch the exception, record it, move on
* fix WaybackError exception formatingBryan Newbold2020-01-281-1/+1
|
* fix elif syntax errorBryan Newbold2020-01-281-1/+1
|
* clarify petabox fetch behaviorBryan Newbold2020-01-281-3/+6
|
* wayback: replay redirects have X-Archive-Redirect-ReasonBryan Newbold2020-01-211-2/+4
|
* handle UnicodeDecodeError in the other GET instanceBryan Newbold2020-01-151-0/+2
|
* increase SPNv2 polling timeout to 4 minutesBryan Newbold2020-01-151-1/+3
|
* make failed replay fetch an error, not assert errorBryan Newbold2020-01-151-1/+2
|
* wayback replay: catch UnicodeDecodeErrorBryan Newbold2020-01-151-0/+2
| | | | | | | | In prod, ran in to a redirect URL like: b'/web/20200116043630id_/https://mediarep.org/bitstream/handle/doc/1127/Barth\xe9l\xe9my_2015_Life_and_Technology.pdf;jsessionid=A9EFB2798846F5E14A8473BBFD6AB46C?sequence=1' which broke requests.
* pass through revisit_cdxBryan Newbold2020-01-151-5/+18
|
* fix revisit resolutionBryan Newbold2020-01-151-4/+12
| | | | | Returns the *original* CDX record, but keeps the terminal_url and terminal_sha1hex info.
* SPNv2 doesn't support FTP; add a live test for non-revist FTPBryan Newbold2020-01-141-0/+10
|
* basic FTP ingest support; revist record resolutionBryan Newbold2020-01-141-34/+77
| | | | | | | - supporting revisits means more wayback hits (fewer crawls) => faster - ... but this is only partial support. will also need to work through sandcrawler db schema, etc. current status should be safe to merge/use. - ftp support via treating an ftp hit as a 200
* better print() outputBryan Newbold2020-01-101-3/+3
|
* fix redirect replay fetch methodBryan Newbold2020-01-101-1/+4
|
* handle SPNv2-then-CDX lookup failuresBryan Newbold2020-01-101-6/+23
| | | | | | - use a 10 second delay if CDX result isn't immediately available. blech. - if there is a lookup failure, call it a wayback-error and move on
* SPNv2 hack specifically for elsevier lookupsBryan Newbold2020-01-101-0/+15
| | | | | I'm not really sure why this is needed, and maybe with more careful testing it isn't. But it works!
* add support for redirect lookups from replayBryan Newbold2020-01-101-9/+69
|
* more general ingest teaks and affordancesBryan Newbold2020-01-101-5/+18
|
* add sleep-and-retry workaround for CDX after SPNv2Bryan Newbold2020-01-101-1/+9
|
* more live tests (for regressions)Bryan Newbold2020-01-101-0/+1
|
* disable CDX best lookup 'collapse'; leave commentBryan Newbold2020-01-101-1/+3
|
* hack: reverse sort of CDX exact seems broken with SPNv2 resultsBryan Newbold2020-01-101-1/+1
|
* wayback: datetime mismatch as an errorBryan Newbold2020-01-091-1/+2
|
* lots of progress on wayback refactoringBryan Newbold2020-01-091-39/+123
| | | | | | - too much to list - canonical flags to control crawling - cdx_to_dict helper
* location comes as a string, not listBryan Newbold2020-01-091-1/+1
|
* fix http/https issue with GlobalWayback libraryBryan Newbold2020-01-091-1/+2
|
* wayback fetch via replay; confirm hashes in crawl_resource()Bryan Newbold2020-01-091-5/+40
|
* wrap up basic (locally testable) ingest refactorBryan Newbold2020-01-091-19/+23
|
* more wayback and SPN tests and fixesBryan Newbold2020-01-091-38/+152
|
* refactor CdxApiClient, add testsBryan Newbold2020-01-081-40/+130
| | | | | | - always use auth token and get full CDX rows - simplify to "fetch" (exact url/dt match) and "lookup_best" methods - all redirect stuff will be moved to a higher level
* refactor SavePaperNowClient and add testBryan Newbold2020-01-071-28/+154
| | | | | | - response as a namedtuple - "remote" errors (aka, SPN API was HTTP 200 but returned error) aren't an exception
* remove SPNv1 code pathsBryan Newbold2020-01-071-35/+1
|
* handle SPNv1 redirect loopBryan Newbold2019-11-141-0/+2
|
* handle SPNv2 polling timeoutBryan Newbold2019-11-141-6/+10
|
* status_forcelist is on session, not requestBryan Newbold2019-11-131-2/+2
|
* handle SPNv1 remote server HTTP status codes betterBryan Newbold2019-11-131-8/+15
|
* handle requests (http) redirect loop from waybackBryan Newbold2019-11-131-1/+4
|
* clean up redirect-following CDX API pathBryan Newbold2019-11-131-8/+15
|
* have SPN client differentiate between SPN and remote errorsBryan Newbold2019-11-131-2/+10
| | | | | | | | This is only a partial implementation. The requests client will still make way too many SPN requests trying to figure out if this is a real error or not (eg, if remote was a 502, we'll retry many times). We may just want to switch to SPNv2 for everything.
* more progress on file ingestBryan Newbold2019-11-131-6/+17
|
* much progress on file ingest pathBryan Newbold2019-10-221-15/+73
|
* lots of grobid tool implementation (still WIP)Bryan Newbold2019-09-261-0/+135