| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
| |
This might the source of some `spn2-cdx-lookup-failure`.
Wayback/CDX does this check via full-on SURT, with many more changes,
and potentially we should be doing that here as well.
|
|
|
|
|
|
|
| |
This code path is only working about 1/7 times in production. Going to
try with a much shorter retry delay and see if we get no success with
that. Considering also just disabling this attempt all together and
relying on retries after hours/days.
|
| |
|
| |
|
|
|
|
|
|
|
| |
This reverts commit 92bf9bc28ac0eacab2e06fa3b25b52f0882804c2.
In practice, in prod, this resulted in much larger
spn2-cdx-lookup-failure error rates.
|
|
|
|
|
|
|
|
| |
As we are moving towards just retrying entire ingest requests, we should
probably just make this zero. But until then we should give SPN CDX a
small chance to sync before giving up.
This change expected to improve overall throughput.
|
|
|
|
|
| |
For clarity. The SPNv2 API hasn't changed, just changing the
variable/parameter name.
|
|
|
|
|
| |
Hoping this will increase crawling throughput with little-to-no impact
on fidelity.
|
| |
|
|
|
|
| |
For speed and efficiency.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
Did this for all the other "client" helpers, but forgot to for wayback
replay.
Was starting to get "445" errors from wayback.
|
| |
|
|
|
|
|
|
| |
Fixes AttributeError: 'NoneType' object has no attribute 'strip'
Seen in production on the lookup_resource code path.
|
|
|
|
|
|
| |
Some 'cdx-error' results were due to URLs with ':' after the hostname or
trailing newline ("\n") characters in the URL. This attempts to work
around this categroy of error.
|
| |
|
|
|
|
|
|
|
|
|
| |
I hoped this feature would make it possible to crawl journals.lww.com
PDFs, because the token URLs work with `wget`, but it still doesn't seem
to work. Maybe because of user agent?
Anyways, this feature might be useful for crawling efficiency, so adding
to master.
|
|
|
|
|
|
| |
Some redirects are host-local. This patch crudely detects this
(full-path redirects starting with "/" only), and appends the URL to the
host of the original URL.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
But only if it matches what the revisit record indicated.
This is mostly to enable better revisit fetching.
|
| |
|
| |
|
|
|
|
|
| |
A lot of the terminal-bad-status seems to have due to not handling
revisits correctly. They have status_code = '-' or None.
|
|
|
|
| |
Sometimes seem to get empty string instead of empty JSON list
|
| |
|
| |
|
|
|
|
| |
This is a different form of mangled redirect.
|
| |
|
|
|
|
|
| |
If there is a UTC timestamp, with trailing 'Z' indicating timezone, that
is valid but increases string length by one.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The goal here is to have SPNv2 requests backoff when we get
back-pressure (usually caused by some sessions taking too long). Lack of
proper back-pressure is making it hard to turn up parallelism.
This is a hack because we still timeout and drop the slow request. A
better way is probably to have a background thread run, while the
KafkaPusher thread does polling. Maybe with timeouts to detect slow
processing (greater than 30 seconds?) and only pause/resume in that
case. This would also make taking batches easier. Unlike the existing
code, however, the parallelism needs to happen at the Pusher level to do
the polling (Kafka) and "await" (for all worker threads to complete)
correctly.
|
|
|
|
|
|
|
|
|
| |
This is withing GWB wayback code. Trying two things:
- bump default max headers from 100 to 1000 in the (global?) http.client
module itself. I didn't think through whether we would expect this to
actually work
- catch the exception, record it, move on
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
In prod, ran in to a redirect URL like:
b'/web/20200116043630id_/https://mediarep.org/bitstream/handle/doc/1127/Barth\xe9l\xe9my_2015_Life_and_Technology.pdf;jsessionid=A9EFB2798846F5E14A8473BBFD6AB46C?sequence=1'
which broke requests.
|
| |
|
|
|
|
|
| |
Returns the *original* CDX record, but keeps the terminal_url and
terminal_sha1hex info.
|
| |
|
|
|
|
|
|
|
| |
- supporting revisits means more wayback hits (fewer crawls) => faster
- ... but this is only partial support. will also need to work through
sandcrawler db schema, etc. current status should be safe to merge/use.
- ftp support via treating an ftp hit as a 200
|