aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler
Commit message (Collapse)AuthorAgeFilesLines
* wayback: use same 5xx/4xx-allowing tricks for replay body fetch as for ↵Bryan Newbold2022-07-151-7/+7
| | | | replay redirect
* cdx api: add another allowable URL fuzzy-match pattern (double slashes)Bryan Newbold2022-07-151-0/+9
|
* ingest: more bogus domain patternsBryan Newbold2022-07-151-0/+3
|
* spn2: handle case of re-attempting a recent crawl (race condition)Bryan Newbold2022-07-151-0/+14
|
* html: fulltext URL prefixes to skip; also fix broken pattern matchingBryan Newbold2022-07-151-4/+19
| | | | | Due to both the 'continue-in-a-for-loop' and 'missing-trailing-commas', the existing pattern matching was not working.
* ingest: another form of cookie block URLBryan Newbold2022-07-151-0/+2
| | | | | This still doesn't short-cut CDX lookup chain, because that is all pure redirects happening in ia.py.
* HTML ingest: most sub-resource patterns to skipBryan Newbold2022-07-151-1/+13
|
* cdx lookups: prioritize truely exact URL matchesBryan Newbold2022-07-141-0/+1
| | | | | | This hopefully resolves an issue causing many apparent redirect loops, which were actually timing or HTTP status code near-loops with http/https fuzzy matching in CDX API. Despite "exact" API lookup semantics.
* ingest: handle another type of wayback redirectBryan Newbold2022-07-141-2/+5
|
* yet another bad PDFBryan Newbold2022-07-131-0/+1
|
* wayback fetch: handle upstream 5xx replaysBryan Newbold2022-07-131-4/+15
|
* shorten default HTTP backoff factorBryan Newbold2022-07-131-1/+1
| | | | | The existing factor was resulting in many-minute long backoffs, and Kafka timeouts
* ingest: random site PDF link patternBryan Newbold2022-07-121-0/+7
|
* ingest: doaj.org article landing page access linksBryan Newbold2022-07-122-1/+12
|
* ingest: IEEE domain is blocking usBryan Newbold2022-07-071-1/+2
|
* ingest: catch more ConnectionErrors (SPN, replay fetch, GROBID)Bryan Newbold2022-05-162-4/+19
|
* ingest: skip arxiv.org DOIs, we already direct-ingestBryan Newbold2022-05-111-0/+1
|
* ingest spn2: fix testsBryan Newbold2022-05-052-1/+2
|
* ingest: more loginwall patternsBryan Newbold2022-05-051-0/+3
|
* SPNv2: several fixes for prod throughputBryan Newbold2022-04-261-11/+34
| | | | | | | | | | Most importantly, for some API flags, if the value is not true-thy, do not set the flag at all. Setting any flag was resulting in screenshots and outlinks actually getting created/captured, which was a huge slowdown. Also, check per-user SPNv2 slots available, using API, before requesting an actual capture.
* make fmtBryan Newbold2022-04-261-2/+5
|
* block isiarticles.com from future PDF crawlsBryan Newbold2022-04-201-0/+2
|
* ingest: drive.google.com ingest supportBryan Newbold2022-04-041-0/+8
|
* filesets: fix archive.org path namingBryan Newbold2022-03-291-7/+8
|
* bugfix: sha1/md5 typoBryan Newbold2022-03-231-1/+1
| | | | Caught this prepping to ingest in to fatcat. Derp!
* file ingest: don't 'backoff' on spn2 backoff errorBryan Newbold2022-03-222-0/+8
| | | | | | | | The intent of this is to try and get through the daily ingest requests faster, so we can loop and retry if needed. A 200 second delay, usually resulting in a kafka topic reshuffle, really slows things down. This will presumably result in a bunch of spn2-backoff status requests, but we can just retry those.
* small lint/typo/fmt fixesBryan Newbold2022-02-243-5/+5
|
* another bad PDF sha1Bryan Newbold2022-02-231-0/+1
|
* ingest: fix mistakenly commented except block (?)Bryan Newbold2022-02-181-4/+3
|
* ingest: handle more fileset failure modesBryan Newbold2022-02-182-3/+30
|
* yet another bad PDF sha1Bryan Newbold2022-02-081-0/+1
|
* sandcrawler: additional extracts, mostly OJSBryan Newbold2022-01-131-1/+23
|
* filesets: more figshare URL patternsBryan Newbold2022-01-131-0/+13
|
* fileset ingest: better verification of resourcesBryan Newbold2022-01-131-7/+23
|
* ingest: PDF pattern for integrityresjournals.orgBryan Newbold2022-01-131-0/+8
|
* null-body -> empty-blobBryan Newbold2022-01-133-4/+8
|
* spn: handle blocked-url (etc) betterBryan Newbold2022-01-111-0/+10
|
* filesets: handle weird figshare link-only case betterBryan Newbold2021-12-161-1/+4
|
* lint ('not in')Bryan Newbold2021-12-151-2/+2
|
* more fileset ingest tweaksBryan Newbold2021-12-152-0/+7
|
* fileset ingest: more requests timeouts, sessionsBryan Newbold2021-12-153-37/+68
|
* fileset ingest: create tmp subdirectories if neededBryan Newbold2021-12-151-0/+5
|
* fileset ingest: configure IA session from envBryan Newbold2021-12-151-1/+6
| | | | | Note that this doesn't currently work for `upload()`, and as a work-around I created `~/.config/ia.ini` manually on the worker VM.
* fileset ingest: actually use spn2 CLI flagBryan Newbold2021-12-112-3/+4
|
* grobid: set a maximum file size (256 MByte)Bryan Newbold2021-12-071-0/+8
|
* codespell typos in python (comments)Bryan Newbold2021-11-244-4/+4
|
* html_meta: actual typo in code (CSS selector) caught by codespellBryan Newbold2021-11-241-1/+1
|
* make fmtBryan Newbold2021-11-161-1/+1
|
* SPNv2: make 'resources' optionalBryan Newbold2021-11-161-1/+1
| | | | | | | | This was always present previously. A change was made to SPNv2 API recently that borked it a bit, though in theory should be present on new captures. I'm not seeing it for some captures, so pushing this work around. It seems like we don't actually use this field anyways, at least for ingest pipeline.
* grobid: handle XML parsing errors, and have them recorded in sandcrawler-dbBryan Newbold2021-11-121-1/+5
|