aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/ia.py
Commit message (Collapse)AuthorAgeFilesLines
* spn: avoid 'None' job_idBryan Newbold2021-10-111-2/+2
| | | | | | Thanks Vanglis for reporting these. Not sure this commit fixes *all* instances of the problem.
* ingest CDX lookup: weigh year+month of capture against in-petabox-or-notBryan Newbold2021-09-301-0/+1
| | | | | | | | This is to try working around an issue where ingests fail because an SPN capture is much newer, but the old sorting preference ignored that. Note that the sorting logic is pretty busted anyways, and we should probably allow returning multiple matching files to try.
* tune SPN CDX retry/wait depending on mode (priority vs daily)Bryan Newbold2021-09-301-2/+4
|
* crawl: SPN2 non-200 success code pathBryan Newbold2021-07-131-11/+25
|
* crawl: SPN self-redirect hackBryan Newbold2021-07-131-0/+9
|
* crawl: small comment updatesBryan Newbold2021-07-131-3/+6
|
* ia CDX: handle bad CDX rowsBryan Newbold2021-01-051-2/+4
|
* spn: more status codesBryan Newbold2020-12-211-1/+2
|
* handle more wayback error conditionsBryan Newbold2020-11-201-0/+6
|
* spn 'forbidden' status codeBryan Newbold2020-11-121-1/+1
|
* spn2-internal-server-error is a problem with remote server, not SPN2Bryan Newbold2020-11-081-0/+2
|
* ingest: better non-full URL fixupBryan Newbold2020-11-081-4/+3
|
* direct some more warnings to sys.stderr, not stdoutBryan Newbold2020-11-081-2/+2
|
* html: handle no-capture for sub-resourcesBryan Newbold2020-11-081-0/+3
|
* ingest: fix null-body caseBryan Newbold2020-11-081-0/+2
| | | | Broke this in earlier refactor.
* ia: use newer gwb (petabox) loading classBryan Newbold2020-11-041-5/+8
| | | | This fixes zstandard WARC reading.
* move transfer encoding helper to sandcrawler/ia.pyBryan Newbold2020-11-031-1/+26
|
* ingest: tweak debug printing alignmentBryan Newbold2020-11-031-8/+7
|
* cdx: fix 'closest' supportBryan Newbold2020-11-031-3/+2
|
* cdx: add support for 'closest' time parameterBryan Newbold2020-10-301-3/+9
|
* ingest: decrease CDX timeout retries againBryan Newbold2020-10-221-1/+1
|
* differential wayback-error from wayback-content-errorBryan Newbold2020-10-211-10/+13
| | | | | | The motivation here is to distinguish errors due to current content in wayback (eg, in WARCs) from operational errors (eg, wayback machine is down, or network failures/disruption).
* SPN CDX delay now seems reasonable; increase to 40sec to catch mostBryan Newbold2020-10-191-1/+1
|
* CDX: when retrying, do so every 3 seconds up to limitBryan Newbold2020-10-191-5/+9
|
* SPN: more verbose status loggingBryan Newbold2020-10-191-0/+4
|
* CDX: revert post-SPN CDX lookup retry to 10 secondsBryan Newbold2020-10-191-1/+1
| | | | | Hoping to have many fewer SPN requests and issues, so willing to wait longer for each.
* ingest: catch wayback-fail-after-SPN as separate statusBryan Newbold2020-10-191-4/+17
|
* SPN: better log line when starting a requestBryan Newbold2020-10-191-0/+1
|
* SPN: look for non-200 CDX responsesBryan Newbold2020-10-191-1/+1
| | | | Suspect that this has been the source of many `spn2-cdx-lookup-failure`
* SPN: better check for partial URLs returnedBryan Newbold2020-10-191-2/+2
|
* CDX fetch: more permissive fuzzy/normalization checkBryan Newbold2020-10-191-3/+9
| | | | | | | This might the source of some `spn2-cdx-lookup-failure`. Wayback/CDX does this check via full-on SURT, with many more changes, and potentially we should be doing that here as well.
* ingest: experimentally reduce CDX API retry delayBryan Newbold2020-10-171-1/+1
| | | | | | | This code path is only working about 1/7 times in production. Going to try with a much shorter retry delay and see if we get no success with that. Considering also just disabling this attempt all together and relying on retries after hours/days.
* ingest: handle cookieAbsent and partial SPNv2 URL reponse cases betterBryan Newbold2020-10-171-0/+31
|
* store no-capture URLs in terminal_urlBryan Newbold2020-10-121-1/+1
|
* Revert "ingest: reduce CDX retry_sleep to 3.0 sec (after SPN)"Bryan Newbold2020-08-111-1/+1
| | | | | | | This reverts commit 92bf9bc28ac0eacab2e06fa3b25b52f0882804c2. In practice, in prod, this resulted in much larger spn2-cdx-lookup-failure error rates.
* ingest: reduce CDX retry_sleep to 3.0 sec (after SPN)Bryan Newbold2020-08-111-1/+1
| | | | | | | | As we are moving towards just retrying entire ingest requests, we should probably just make this zero. But until then we should give SPN CDX a small chance to sync before giving up. This change expected to improve overall throughput.
* refactor: force_get -> force_simple_getBryan Newbold2020-08-111-5/+5
| | | | | For clarity. The SPNv2 API hasn't changed, just changing the variable/parameter name.
* spn2: skip js behavior (experiment)Bryan Newbold2020-08-051-0/+1
| | | | | Hoping this will increase crawling throughput with little-to-no impact on fidelity.
* SPN2: ensure not fetching outlinksBryan Newbold2020-08-051-0/+1
|
* use new SPNv2 'skip_first_archive' paramBryan Newbold2020-07-221-0/+1
| | | | For speed and efficiency.
* report revisit non-200 as a WaybackErrorBryan Newbold2020-06-261-7/+7
|
* pdf: mypy and typo fixesBryan Newbold2020-06-171-1/+1
|
* ingest: quick hack to capture CNKI outlinksBryan Newbold2020-04-131-2/+9
|
* ia: set User-Agent for replay fetch from waybackBryan Newbold2020-03-291-0/+5
| | | | | | | Did this for all the other "client" helpers, but forgot to for wayback replay. Was starting to get "445" errors from wayback.
* ingest: better spn2 pending error codeBryan Newbold2020-03-271-0/+2
|
* ia: more conservative use of clean_url()Bryan Newbold2020-03-241-3/+5
| | | | | | Fixes AttributeError: 'NoneType' object has no attribute 'strip' Seen in production on the lookup_resource code path.
* ingest: clean_url() in more placesBryan Newbold2020-03-231-1/+4
| | | | | | Some 'cdx-error' results were due to URLs with ':' after the hostname or trailing newline ("\n") characters in the URL. This attempts to work around this categroy of error.
* ingest: log every URL (from ia code side)Bryan Newbold2020-03-181-0/+1
|
* implement (unused) force_get flag for SPN2Bryan Newbold2020-03-181-3/+4
| | | | | | | | | I hoped this feature would make it possible to crawl journals.lww.com PDFs, because the token URLs work with `wget`, but it still doesn't seem to work. Maybe because of user agent? Anyways, this feature might be useful for crawling efficiency, so adding to master.
* work around local redirect (resource.location)Bryan Newbold2020-03-171-1/+6
| | | | | | Some redirects are host-local. This patch crudely detects this (full-path redirects starting with "/" only), and appends the URL to the host of the original URL.