aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
...
* html: remove old citation_pdf_url code pathBryan Newbold2022-07-151-32/+1
| | | | | This code path doesn't check for 'skip' patterns, resulting in a bunch of bad CDX checks/errors
* wayback: use same 5xx/4xx-allowing tricks for replay body fetch as for ↵Bryan Newbold2022-07-151-7/+7
| | | | replay redirect
* cdx api: add another allowable URL fuzzy-match pattern (double slashes)Bryan Newbold2022-07-151-0/+9
|
* ingest: more bogus domain patternsBryan Newbold2022-07-151-0/+3
|
* spn2: handle case of re-attempting a recent crawl (race condition)Bryan Newbold2022-07-151-0/+14
|
* html: fulltext URL prefixes to skip; also fix broken pattern matchingBryan Newbold2022-07-151-4/+19
| | | | | Due to both the 'continue-in-a-for-loop' and 'missing-trailing-commas', the existing pattern matching was not working.
* row2json script: fix argument typeBryan Newbold2022-07-151-1/+1
|
* row2json script: add flag to enable recrawlingBryan Newbold2022-07-151-1/+8
|
* ingest: another form of cookie block URLBryan Newbold2022-07-151-0/+2
| | | | | This still doesn't short-cut CDX lookup chain, because that is all pure redirects happening in ia.py.
* HTML ingest: most sub-resource patterns to skipBryan Newbold2022-07-151-1/+13
|
* cdx lookups: prioritize truely exact URL matchesBryan Newbold2022-07-141-0/+1
| | | | | | This hopefully resolves an issue causing many apparent redirect loops, which were actually timing or HTTP status code near-loops with http/https fuzzy matching in CDX API. Despite "exact" API lookup semantics.
* ingest: handle another type of wayback redirectBryan Newbold2022-07-141-2/+5
|
* unpaywall crawl wrap-up notesBryan Newbold2022-07-141-2/+145
|
* yet another bad PDFBryan Newbold2022-07-131-0/+1
|
* wayback fetch: handle upstream 5xx replaysBryan Newbold2022-07-131-4/+15
|
* shorten default HTTP backoff factorBryan Newbold2022-07-131-1/+1
| | | | | The existing factor was resulting in many-minute long backoffs, and Kafka timeouts
* ingest: random site PDF link patternBryan Newbold2022-07-121-0/+7
|
* ingest: doaj.org article landing page access linksBryan Newbold2022-07-122-1/+12
|
* ingest: targeted 2022-04 notesBryan Newbold2022-07-071-1/+3
|
* stats: may 2022 ingest-by-domain statsBryan Newbold2022-07-071-0/+410
|
* ingest: IEEE domain is blocking usBryan Newbold2022-07-071-1/+2
|
* ingest: catch more ConnectionErrors (SPN, replay fetch, GROBID)Bryan Newbold2022-05-162-4/+19
|
* ingest: skip arxiv.org DOIs, we already direct-ingestBryan Newbold2022-05-111-0/+1
|
* python make fmtBryan Newbold2022-05-051-3/+1
|
* ingest spn2: fix testsBryan Newbold2022-05-054-6/+108
|
* ingest: more loginwall patternsBryan Newbold2022-05-051-0/+3
|
* ingest_tool: fix arg parsingBryan Newbold2022-05-031-8/+8
|
* finished re-GROBID-ingBryan Newbold2022-05-031-5/+7
|
* PDF URL lists updateBryan Newbold2022-05-032-0/+76
|
* some weekly crawl numbers (not very helpful)Bryan Newbold2022-05-031-0/+191
|
* switch default kafka-broker host from wbgrp-svc263 to wbgrp-svc350Bryan Newbold2022-05-039-14/+14
|
* April 2022 sandcrawler DB statsBryan Newbold2022-04-271-0/+432
|
* more dataset crawl notesBryan Newbold2022-04-261-0/+53
|
* .ua crawling follow-up statsBryan Newbold2022-04-261-2/+2
|
* update HBase Thrift gateway hostBryan Newbold2022-04-261-1/+1
|
* SPNv2: several fixes for prod throughputBryan Newbold2022-04-261-11/+34
| | | | | | | | | | Most importantly, for some API flags, if the value is not true-thy, do not set the flag at all. Setting any flag was resulting in screenshots and outlinks actually getting created/captured, which was a huge slowdown. Also, check per-user SPNv2 slots available, using API, before requesting an actual capture.
* make fmtBryan Newbold2022-04-261-2/+5
|
* ingest_tool: spn-status command to check user's quotaBryan Newbold2022-04-261-0/+19
|
* flake8: allow 'Any' typesBryan Newbold2022-04-261-1/+2
|
* start notes on unpaywall and targeted/patch crawlsBryan Newbold2022-04-202-0/+277
|
* block isiarticles.com from future PDF crawlsBryan Newbold2022-04-201-0/+2
|
* pipenv: update; newer devpi hostnameBryan Newbold2022-04-062-781/+850
|
* ingest: drive.google.com ingest supportBryan Newbold2022-04-041-0/+8
|
* .ua ingest notesBryan Newbold2022-04-041-0/+29
|
* sql: add source/created index on ingest_request tableBryan Newbold2022-04-041-0/+1
|
* sql: fix reingest query missing type on LEFT JOIN; wrap in read-only transactionBryan Newbold2022-04-045-5/+27
|
* filesets: fix archive.org path namingBryan Newbold2022-03-291-7/+8
|
* bugfix: sha1/md5 typoBryan Newbold2022-03-231-1/+1
| | | | Caught this prepping to ingest in to fatcat. Derp!
* various ingest/task notesBryan Newbold2022-03-224-5/+97
|
* file ingest: don't 'backoff' on spn2 backoff errorBryan Newbold2022-03-222-0/+8
| | | | | | | | The intent of this is to try and get through the daily ingest requests faster, so we can loop and retry if needed. A 200 second delay, usually resulting in a kafka topic reshuffle, really slows things down. This will presumably result in a bunch of spn2-backoff status requests, but we can just retry those.