aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/ia.py
Commit message (Collapse)AuthorAgeFilesLines
...
* store no-capture URLs in terminal_urlBryan Newbold2020-10-121-1/+1
|
* Revert "ingest: reduce CDX retry_sleep to 3.0 sec (after SPN)"Bryan Newbold2020-08-111-1/+1
| | | | | | | This reverts commit 92bf9bc28ac0eacab2e06fa3b25b52f0882804c2. In practice, in prod, this resulted in much larger spn2-cdx-lookup-failure error rates.
* ingest: reduce CDX retry_sleep to 3.0 sec (after SPN)Bryan Newbold2020-08-111-1/+1
| | | | | | | | As we are moving towards just retrying entire ingest requests, we should probably just make this zero. But until then we should give SPN CDX a small chance to sync before giving up. This change expected to improve overall throughput.
* refactor: force_get -> force_simple_getBryan Newbold2020-08-111-5/+5
| | | | | For clarity. The SPNv2 API hasn't changed, just changing the variable/parameter name.
* spn2: skip js behavior (experiment)Bryan Newbold2020-08-051-0/+1
| | | | | Hoping this will increase crawling throughput with little-to-no impact on fidelity.
* SPN2: ensure not fetching outlinksBryan Newbold2020-08-051-0/+1
|
* use new SPNv2 'skip_first_archive' paramBryan Newbold2020-07-221-0/+1
| | | | For speed and efficiency.
* report revisit non-200 as a WaybackErrorBryan Newbold2020-06-261-7/+7
|
* pdf: mypy and typo fixesBryan Newbold2020-06-171-1/+1
|
* ingest: quick hack to capture CNKI outlinksBryan Newbold2020-04-131-2/+9
|
* ia: set User-Agent for replay fetch from waybackBryan Newbold2020-03-291-0/+5
| | | | | | | Did this for all the other "client" helpers, but forgot to for wayback replay. Was starting to get "445" errors from wayback.
* ingest: better spn2 pending error codeBryan Newbold2020-03-271-0/+2
|
* ia: more conservative use of clean_url()Bryan Newbold2020-03-241-3/+5
| | | | | | Fixes AttributeError: 'NoneType' object has no attribute 'strip' Seen in production on the lookup_resource code path.
* ingest: clean_url() in more placesBryan Newbold2020-03-231-1/+4
| | | | | | Some 'cdx-error' results were due to URLs with ':' after the hostname or trailing newline ("\n") characters in the URL. This attempts to work around this categroy of error.
* ingest: log every URL (from ia code side)Bryan Newbold2020-03-181-0/+1
|
* implement (unused) force_get flag for SPN2Bryan Newbold2020-03-181-3/+4
| | | | | | | | | I hoped this feature would make it possible to crawl journals.lww.com PDFs, because the token URLs work with `wget`, but it still doesn't seem to work. Maybe because of user agent? Anyways, this feature might be useful for crawling efficiency, so adding to master.
* work around local redirect (resource.location)Bryan Newbold2020-03-171-1/+6
| | | | | | Some redirects are host-local. This patch crudely detects this (full-path redirects starting with "/" only), and appends the URL to the host of the original URL.
* ia: catch wayback ChunkedEncodingErrorBryan Newbold2020-03-051-0/+3
|
* fix warc_offset -> offsetBryan Newbold2020-02-241-1/+1
|
* ingest: handle broken revisit recordsBryan Newbold2020-02-241-1/+4
|
* ingest: treat CDX lookup error as a wayback-errorBryan Newbold2020-02-241-1/+4
|
* fetch_petabox_body: allow non-200 status code fetchesBryan Newbold2020-02-241-2/+10
| | | | | | But only if it matches what the revisit record indicated. This is mostly to enable better revisit fetching.
* allow fuzzy revisit matchesBryan Newbold2020-02-241-1/+26
|
* ingest: more revisit fixesBryan Newbold2020-02-221-4/+4
|
* ia: improve warc/revisit implementationBryan Newbold2020-02-221-26/+46
| | | | | A lot of the terminal-bad-status seems to have due to not handling revisits correctly. They have status_code = '-' or None.
* cdx: handle empty/null CDX responseBryan Newbold2020-02-221-0/+2
| | | | Sometimes seem to get empty string instead of empty JSON list
* filter out CDX rows missing WARC playback fieldsBryan Newbold2020-02-191-0/+4
|
* X-Archive-Src more robust than X-Archive-Redirect-ReasonBryan Newbold2020-02-181-2/+3
|
* wayback: on bad redirects, log instead of assertBryan Newbold2020-02-181-2/+13
| | | | This is a different form of mangled redirect.
* attempt to work around corrupt ARC files from alexa issueBryan Newbold2020-02-181-0/+5
|
* handle alternative dt format in WARC headersBryan Newbold2020-02-051-2/+4
| | | | | If there is a UTC timestamp, with trailing 'Z' indicating timezone, that is valid but increases string length by one.
* decrease SPNv2 polling timeout to 3 minutesBryan Newbold2020-02-051-2/+2
|
* improvements to reliability from prod testingBryan Newbold2020-02-031-5/+11
|
* hack-y backoff ingest attemptBryan Newbold2020-02-031-2/+11
| | | | | | | | | | | | | | | The goal here is to have SPNv2 requests backoff when we get back-pressure (usually caused by some sessions taking too long). Lack of proper back-pressure is making it hard to turn up parallelism. This is a hack because we still timeout and drop the slow request. A better way is probably to have a background thread run, while the KafkaPusher thread does polling. Maybe with timeouts to detect slow processing (greater than 30 seconds?) and only pause/resume in that case. This would also make taking batches easier. Unlike the existing code, however, the parallelism needs to happen at the Pusher level to do the polling (Kafka) and "await" (for all worker threads to complete) correctly.
* wayback: try to resolve HTTPException due to many HTTP headersBryan Newbold2020-02-021-1/+9
| | | | | | | | | This is withing GWB wayback code. Trying two things: - bump default max headers from 100 to 1000 in the (global?) http.client module itself. I didn't think through whether we would expect this to actually work - catch the exception, record it, move on
* fix WaybackError exception formatingBryan Newbold2020-01-281-1/+1
|
* fix elif syntax errorBryan Newbold2020-01-281-1/+1
|
* clarify petabox fetch behaviorBryan Newbold2020-01-281-3/+6
|
* wayback: replay redirects have X-Archive-Redirect-ReasonBryan Newbold2020-01-211-2/+4
|
* handle UnicodeDecodeError in the other GET instanceBryan Newbold2020-01-151-0/+2
|
* increase SPNv2 polling timeout to 4 minutesBryan Newbold2020-01-151-1/+3
|
* make failed replay fetch an error, not assert errorBryan Newbold2020-01-151-1/+2
|
* wayback replay: catch UnicodeDecodeErrorBryan Newbold2020-01-151-0/+2
| | | | | | | | In prod, ran in to a redirect URL like: b'/web/20200116043630id_/https://mediarep.org/bitstream/handle/doc/1127/Barth\xe9l\xe9my_2015_Life_and_Technology.pdf;jsessionid=A9EFB2798846F5E14A8473BBFD6AB46C?sequence=1' which broke requests.
* pass through revisit_cdxBryan Newbold2020-01-151-5/+18
|
* fix revisit resolutionBryan Newbold2020-01-151-4/+12
| | | | | Returns the *original* CDX record, but keeps the terminal_url and terminal_sha1hex info.
* SPNv2 doesn't support FTP; add a live test for non-revist FTPBryan Newbold2020-01-141-0/+10
|
* basic FTP ingest support; revist record resolutionBryan Newbold2020-01-141-34/+77
| | | | | | | - supporting revisits means more wayback hits (fewer crawls) => faster - ... but this is only partial support. will also need to work through sandcrawler db schema, etc. current status should be safe to merge/use. - ftp support via treating an ftp hit as a 200
* better print() outputBryan Newbold2020-01-101-3/+3
|
* fix redirect replay fetch methodBryan Newbold2020-01-101-1/+4
|
* handle SPNv2-then-CDX lookup failuresBryan Newbold2020-01-101-6/+23
| | | | | | - use a 10 second delay if CDX result isn't immediately available. blech. - if there is a lookup failure, call it a wayback-error and move on