aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/ia.py
Commit message (Expand)AuthorAgeFilesLines
* sandcrawler: try to handle weird CDX API responseBryan Newbold2022-11-011-0/+5
* ingest: don't prefer WARC over SPN so stronglyBryan Newbold2022-10-241-1/+2
* spn2: fix tests by not retrying on HTTP 500Bryan Newbold2022-09-141-1/+3
* CDX: skip sha-256 digestsBryan Newbold2022-07-251-1/+5
* cdx: tweak CDX lookups and resolution (sort)Bryan Newbold2022-07-161-4/+7
* wayback: use same 5xx/4xx-allowing tricks for replay body fetch as for replay...Bryan Newbold2022-07-151-7/+7
* cdx api: add another allowable URL fuzzy-match pattern (double slashes)Bryan Newbold2022-07-151-0/+9
* spn2: handle case of re-attempting a recent crawl (race condition)Bryan Newbold2022-07-151-0/+14
* cdx lookups: prioritize truely exact URL matchesBryan Newbold2022-07-141-0/+1
* ingest: handle another type of wayback redirectBryan Newbold2022-07-141-2/+5
* wayback fetch: handle upstream 5xx replaysBryan Newbold2022-07-131-4/+15
* ingest: catch more ConnectionErrors (SPN, replay fetch, GROBID)Bryan Newbold2022-05-161-4/+10
* ingest spn2: fix testsBryan Newbold2022-05-051-1/+1
* SPNv2: several fixes for prod throughputBryan Newbold2022-04-261-11/+34
* file ingest: don't 'backoff' on spn2 backoff errorBryan Newbold2022-03-221-0/+1
* spn: handle blocked-url (etc) betterBryan Newbold2022-01-111-0/+10
* SPNv2: make 'resources' optionalBryan Newbold2021-11-161-1/+1
* IA (wayback): actually use an HTTP session for replay fetchesBryan Newbold2021-11-031-2/+3
* make fmt (black 21.9b0)Bryan Newbold2021-10-271-257/+354
* fix type annotations for petabox body fetch helperBryan Newbold2021-10-261-1/+1
* lint collection membership (last lint for now)Bryan Newbold2021-10-261-7/+7
* more progress on type annotations and lintingBryan Newbold2021-10-261-2/+2
* ia: more tweaks to delicate code to satisfy type checkerBryan Newbold2021-10-261-10/+12
* ia helpers: enforce max_redirects count correctlyBryan Newbold2021-10-261-1/+1
* set CDX request params are str, not int or datetimeBryan Newbold2021-10-261-3/+6
* bugfix: was setting 'from' parameter as a tuple, not a stringBryan Newbold2021-10-261-1/+1
* start type annotating IA helper codeBryan Newbold2021-10-261-37/+65
* flake8 clean (with current settings)Bryan Newbold2021-10-261-2/+2
* start handling trivial lint cleanups: unused imports, 'is None', etcBryan Newbold2021-10-261-15/+14
* make fmtBryan Newbold2021-10-261-68/+124
* python: isort all importsBryan Newbold2021-10-261-10/+12
* move SPNv2 'simple_get' logic to SPN clientBryan Newbold2021-10-151-2/+29
* spn: avoid 'None' job_idBryan Newbold2021-10-111-2/+2
* ingest CDX lookup: weigh year+month of capture against in-petabox-or-notBryan Newbold2021-09-301-0/+1
* tune SPN CDX retry/wait depending on mode (priority vs daily)Bryan Newbold2021-09-301-2/+4
* crawl: SPN2 non-200 success code pathBryan Newbold2021-07-131-11/+25
* crawl: SPN self-redirect hackBryan Newbold2021-07-131-0/+9
* crawl: small comment updatesBryan Newbold2021-07-131-3/+6
* ia CDX: handle bad CDX rowsBryan Newbold2021-01-051-2/+4
* spn: more status codesBryan Newbold2020-12-211-1/+2
* handle more wayback error conditionsBryan Newbold2020-11-201-0/+6
* spn 'forbidden' status codeBryan Newbold2020-11-121-1/+1
* spn2-internal-server-error is a problem with remote server, not SPN2Bryan Newbold2020-11-081-0/+2
* ingest: better non-full URL fixupBryan Newbold2020-11-081-4/+3
* direct some more warnings to sys.stderr, not stdoutBryan Newbold2020-11-081-2/+2
* html: handle no-capture for sub-resourcesBryan Newbold2020-11-081-0/+3
* ingest: fix null-body caseBryan Newbold2020-11-081-0/+2
* ia: use newer gwb (petabox) loading classBryan Newbold2020-11-041-5/+8
* move transfer encoding helper to sandcrawler/ia.pyBryan Newbold2020-11-031-1/+26
* ingest: tweak debug printing alignmentBryan Newbold2020-11-031-8/+7