Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | spn2: handle case of re-attempting a recent crawl (race condition) | Bryan Newbold | 2022-07-15 | 1 | -0/+14 |
| | |||||
* | html: fulltext URL prefixes to skip; also fix broken pattern matching | Bryan Newbold | 2022-07-15 | 1 | -4/+19 |
| | | | | | Due to both the 'continue-in-a-for-loop' and 'missing-trailing-commas', the existing pattern matching was not working. | ||||
* | ingest: another form of cookie block URL | Bryan Newbold | 2022-07-15 | 1 | -0/+2 |
| | | | | | This still doesn't short-cut CDX lookup chain, because that is all pure redirects happening in ia.py. | ||||
* | HTML ingest: most sub-resource patterns to skip | Bryan Newbold | 2022-07-15 | 1 | -1/+13 |
| | |||||
* | cdx lookups: prioritize truely exact URL matches | Bryan Newbold | 2022-07-14 | 1 | -0/+1 |
| | | | | | | This hopefully resolves an issue causing many apparent redirect loops, which were actually timing or HTTP status code near-loops with http/https fuzzy matching in CDX API. Despite "exact" API lookup semantics. | ||||
* | ingest: handle another type of wayback redirect | Bryan Newbold | 2022-07-14 | 1 | -2/+5 |
| | |||||
* | yet another bad PDF | Bryan Newbold | 2022-07-13 | 1 | -0/+1 |
| | |||||
* | wayback fetch: handle upstream 5xx replays | Bryan Newbold | 2022-07-13 | 1 | -4/+15 |
| | |||||
* | shorten default HTTP backoff factor | Bryan Newbold | 2022-07-13 | 1 | -1/+1 |
| | | | | | The existing factor was resulting in many-minute long backoffs, and Kafka timeouts | ||||
* | ingest: random site PDF link pattern | Bryan Newbold | 2022-07-12 | 1 | -0/+7 |
| | |||||
* | ingest: doaj.org article landing page access links | Bryan Newbold | 2022-07-12 | 2 | -1/+12 |
| | |||||
* | ingest: IEEE domain is blocking us | Bryan Newbold | 2022-07-07 | 1 | -1/+2 |
| | |||||
* | ingest: catch more ConnectionErrors (SPN, replay fetch, GROBID) | Bryan Newbold | 2022-05-16 | 2 | -4/+19 |
| | |||||
* | ingest: skip arxiv.org DOIs, we already direct-ingest | Bryan Newbold | 2022-05-11 | 1 | -0/+1 |
| | |||||
* | ingest spn2: fix tests | Bryan Newbold | 2022-05-05 | 2 | -1/+2 |
| | |||||
* | ingest: more loginwall patterns | Bryan Newbold | 2022-05-05 | 1 | -0/+3 |
| | |||||
* | SPNv2: several fixes for prod throughput | Bryan Newbold | 2022-04-26 | 1 | -11/+34 |
| | | | | | | | | | | Most importantly, for some API flags, if the value is not true-thy, do not set the flag at all. Setting any flag was resulting in screenshots and outlinks actually getting created/captured, which was a huge slowdown. Also, check per-user SPNv2 slots available, using API, before requesting an actual capture. | ||||
* | make fmt | Bryan Newbold | 2022-04-26 | 1 | -2/+5 |
| | |||||
* | block isiarticles.com from future PDF crawls | Bryan Newbold | 2022-04-20 | 1 | -0/+2 |
| | |||||
* | ingest: drive.google.com ingest support | Bryan Newbold | 2022-04-04 | 1 | -0/+8 |
| | |||||
* | filesets: fix archive.org path naming | Bryan Newbold | 2022-03-29 | 1 | -7/+8 |
| | |||||
* | bugfix: sha1/md5 typo | Bryan Newbold | 2022-03-23 | 1 | -1/+1 |
| | | | | Caught this prepping to ingest in to fatcat. Derp! | ||||
* | file ingest: don't 'backoff' on spn2 backoff error | Bryan Newbold | 2022-03-22 | 2 | -0/+8 |
| | | | | | | | | The intent of this is to try and get through the daily ingest requests faster, so we can loop and retry if needed. A 200 second delay, usually resulting in a kafka topic reshuffle, really slows things down. This will presumably result in a bunch of spn2-backoff status requests, but we can just retry those. | ||||
* | small lint/typo/fmt fixes | Bryan Newbold | 2022-02-24 | 3 | -5/+5 |
| | |||||
* | another bad PDF sha1 | Bryan Newbold | 2022-02-23 | 1 | -0/+1 |
| | |||||
* | ingest: fix mistakenly commented except block (?) | Bryan Newbold | 2022-02-18 | 1 | -4/+3 |
| | |||||
* | ingest: handle more fileset failure modes | Bryan Newbold | 2022-02-18 | 2 | -3/+30 |
| | |||||
* | yet another bad PDF sha1 | Bryan Newbold | 2022-02-08 | 1 | -0/+1 |
| | |||||
* | sandcrawler: additional extracts, mostly OJS | Bryan Newbold | 2022-01-13 | 1 | -1/+23 |
| | |||||
* | filesets: more figshare URL patterns | Bryan Newbold | 2022-01-13 | 1 | -0/+13 |
| | |||||
* | fileset ingest: better verification of resources | Bryan Newbold | 2022-01-13 | 1 | -7/+23 |
| | |||||
* | ingest: PDF pattern for integrityresjournals.org | Bryan Newbold | 2022-01-13 | 1 | -0/+8 |
| | |||||
* | null-body -> empty-blob | Bryan Newbold | 2022-01-13 | 3 | -4/+8 |
| | |||||
* | spn: handle blocked-url (etc) better | Bryan Newbold | 2022-01-11 | 1 | -0/+10 |
| | |||||
* | filesets: handle weird figshare link-only case better | Bryan Newbold | 2021-12-16 | 1 | -1/+4 |
| | |||||
* | lint ('not in') | Bryan Newbold | 2021-12-15 | 1 | -2/+2 |
| | |||||
* | more fileset ingest tweaks | Bryan Newbold | 2021-12-15 | 2 | -0/+7 |
| | |||||
* | fileset ingest: more requests timeouts, sessions | Bryan Newbold | 2021-12-15 | 3 | -37/+68 |
| | |||||
* | fileset ingest: create tmp subdirectories if needed | Bryan Newbold | 2021-12-15 | 1 | -0/+5 |
| | |||||
* | fileset ingest: configure IA session from env | Bryan Newbold | 2021-12-15 | 1 | -1/+6 |
| | | | | | Note that this doesn't currently work for `upload()`, and as a work-around I created `~/.config/ia.ini` manually on the worker VM. | ||||
* | fileset ingest: actually use spn2 CLI flag | Bryan Newbold | 2021-12-11 | 2 | -3/+4 |
| | |||||
* | grobid: set a maximum file size (256 MByte) | Bryan Newbold | 2021-12-07 | 1 | -0/+8 |
| | |||||
* | codespell typos in python (comments) | Bryan Newbold | 2021-11-24 | 4 | -4/+4 |
| | |||||
* | html_meta: actual typo in code (CSS selector) caught by codespell | Bryan Newbold | 2021-11-24 | 1 | -1/+1 |
| | |||||
* | make fmt | Bryan Newbold | 2021-11-16 | 1 | -1/+1 |
| | |||||
* | SPNv2: make 'resources' optional | Bryan Newbold | 2021-11-16 | 1 | -1/+1 |
| | | | | | | | | This was always present previously. A change was made to SPNv2 API recently that borked it a bit, though in theory should be present on new captures. I'm not seeing it for some captures, so pushing this work around. It seems like we don't actually use this field anyways, at least for ingest pipeline. | ||||
* | grobid: handle XML parsing errors, and have them recorded in sandcrawler-db | Bryan Newbold | 2021-11-12 | 1 | -1/+5 |
| | |||||
* | ingest_file: more efficient GROBID metadata copy | Bryan Newbold | 2021-11-12 | 1 | -3/+3 |
| | |||||
* | ingest: start re-processing GROBID with newer version | Bryan Newbold | 2021-11-10 | 1 | -2/+6 |
| | |||||
* | simple persist worker/tool to backfill grobid_refs | Bryan Newbold | 2021-11-10 | 1 | -0/+40 |
| |