Commit message (Collapse) | Author | Age | Files | Lines | ||
---|---|---|---|---|---|---|
... | ||||||
* | html: remove old citation_pdf_url code path | Bryan Newbold | 2022-07-15 | 1 | -32/+1 | |
| | | | | | This code path doesn't check for 'skip' patterns, resulting in a bunch of bad CDX checks/errors | |||||
* | wayback: use same 5xx/4xx-allowing tricks for replay body fetch as for ↵ | Bryan Newbold | 2022-07-15 | 1 | -7/+7 | |
| | | | | replay redirect | |||||
* | cdx api: add another allowable URL fuzzy-match pattern (double slashes) | Bryan Newbold | 2022-07-15 | 1 | -0/+9 | |
| | ||||||
* | ingest: more bogus domain patterns | Bryan Newbold | 2022-07-15 | 1 | -0/+3 | |
| | ||||||
* | spn2: handle case of re-attempting a recent crawl (race condition) | Bryan Newbold | 2022-07-15 | 1 | -0/+14 | |
| | ||||||
* | html: fulltext URL prefixes to skip; also fix broken pattern matching | Bryan Newbold | 2022-07-15 | 1 | -4/+19 | |
| | | | | | Due to both the 'continue-in-a-for-loop' and 'missing-trailing-commas', the existing pattern matching was not working. | |||||
* | row2json script: fix argument type | Bryan Newbold | 2022-07-15 | 1 | -1/+1 | |
| | ||||||
* | row2json script: add flag to enable recrawling | Bryan Newbold | 2022-07-15 | 1 | -1/+8 | |
| | ||||||
* | ingest: another form of cookie block URL | Bryan Newbold | 2022-07-15 | 1 | -0/+2 | |
| | | | | | This still doesn't short-cut CDX lookup chain, because that is all pure redirects happening in ia.py. | |||||
* | HTML ingest: most sub-resource patterns to skip | Bryan Newbold | 2022-07-15 | 1 | -1/+13 | |
| | ||||||
* | cdx lookups: prioritize truely exact URL matches | Bryan Newbold | 2022-07-14 | 1 | -0/+1 | |
| | | | | | | This hopefully resolves an issue causing many apparent redirect loops, which were actually timing or HTTP status code near-loops with http/https fuzzy matching in CDX API. Despite "exact" API lookup semantics. | |||||
* | ingest: handle another type of wayback redirect | Bryan Newbold | 2022-07-14 | 1 | -2/+5 | |
| | ||||||
* | unpaywall crawl wrap-up notes | Bryan Newbold | 2022-07-14 | 1 | -2/+145 | |
| | ||||||
* | yet another bad PDF | Bryan Newbold | 2022-07-13 | 1 | -0/+1 | |
| | ||||||
* | wayback fetch: handle upstream 5xx replays | Bryan Newbold | 2022-07-13 | 1 | -4/+15 | |
| | ||||||
* | shorten default HTTP backoff factor | Bryan Newbold | 2022-07-13 | 1 | -1/+1 | |
| | | | | | The existing factor was resulting in many-minute long backoffs, and Kafka timeouts | |||||
* | ingest: random site PDF link pattern | Bryan Newbold | 2022-07-12 | 1 | -0/+7 | |
| | ||||||
* | ingest: doaj.org article landing page access links | Bryan Newbold | 2022-07-12 | 2 | -1/+12 | |
| | ||||||
* | ingest: targeted 2022-04 notes | Bryan Newbold | 2022-07-07 | 1 | -1/+3 | |
| | ||||||
* | stats: may 2022 ingest-by-domain stats | Bryan Newbold | 2022-07-07 | 1 | -0/+410 | |
| | ||||||
* | ingest: IEEE domain is blocking us | Bryan Newbold | 2022-07-07 | 1 | -1/+2 | |
| | ||||||
* | ingest: catch more ConnectionErrors (SPN, replay fetch, GROBID) | Bryan Newbold | 2022-05-16 | 2 | -4/+19 | |
| | ||||||
* | ingest: skip arxiv.org DOIs, we already direct-ingest | Bryan Newbold | 2022-05-11 | 1 | -0/+1 | |
| | ||||||
* | python make fmt | Bryan Newbold | 2022-05-05 | 1 | -3/+1 | |
| | ||||||
* | ingest spn2: fix tests | Bryan Newbold | 2022-05-05 | 4 | -6/+108 | |
| | ||||||
* | ingest: more loginwall patterns | Bryan Newbold | 2022-05-05 | 1 | -0/+3 | |
| | ||||||
* | ingest_tool: fix arg parsing | Bryan Newbold | 2022-05-03 | 1 | -8/+8 | |
| | ||||||
* | finished re-GROBID-ing | Bryan Newbold | 2022-05-03 | 1 | -5/+7 | |
| | ||||||
* | PDF URL lists update | Bryan Newbold | 2022-05-03 | 2 | -0/+76 | |
| | ||||||
* | some weekly crawl numbers (not very helpful) | Bryan Newbold | 2022-05-03 | 1 | -0/+191 | |
| | ||||||
* | switch default kafka-broker host from wbgrp-svc263 to wbgrp-svc350 | Bryan Newbold | 2022-05-03 | 9 | -14/+14 | |
| | ||||||
* | April 2022 sandcrawler DB stats | Bryan Newbold | 2022-04-27 | 1 | -0/+432 | |
| | ||||||
* | more dataset crawl notes | Bryan Newbold | 2022-04-26 | 1 | -0/+53 | |
| | ||||||
* | .ua crawling follow-up stats | Bryan Newbold | 2022-04-26 | 1 | -2/+2 | |
| | ||||||
* | update HBase Thrift gateway host | Bryan Newbold | 2022-04-26 | 1 | -1/+1 | |
| | ||||||
* | SPNv2: several fixes for prod throughput | Bryan Newbold | 2022-04-26 | 1 | -11/+34 | |
| | | | | | | | | | | Most importantly, for some API flags, if the value is not true-thy, do not set the flag at all. Setting any flag was resulting in screenshots and outlinks actually getting created/captured, which was a huge slowdown. Also, check per-user SPNv2 slots available, using API, before requesting an actual capture. | |||||
* | make fmt | Bryan Newbold | 2022-04-26 | 1 | -2/+5 | |
| | ||||||
* | ingest_tool: spn-status command to check user's quota | Bryan Newbold | 2022-04-26 | 1 | -0/+19 | |
| | ||||||
* | flake8: allow 'Any' types | Bryan Newbold | 2022-04-26 | 1 | -1/+2 | |
| | ||||||
* | start notes on unpaywall and targeted/patch crawls | Bryan Newbold | 2022-04-20 | 2 | -0/+277 | |
| | ||||||
* | block isiarticles.com from future PDF crawls | Bryan Newbold | 2022-04-20 | 1 | -0/+2 | |
| | ||||||
* | pipenv: update; newer devpi hostname | Bryan Newbold | 2022-04-06 | 2 | -781/+850 | |
| | ||||||
* | ingest: drive.google.com ingest support | Bryan Newbold | 2022-04-04 | 1 | -0/+8 | |
| | ||||||
* | .ua ingest notes | Bryan Newbold | 2022-04-04 | 1 | -0/+29 | |
| | ||||||
* | sql: add source/created index on ingest_request table | Bryan Newbold | 2022-04-04 | 1 | -0/+1 | |
| | ||||||
* | sql: fix reingest query missing type on LEFT JOIN; wrap in read-only transaction | Bryan Newbold | 2022-04-04 | 5 | -5/+27 | |
| | ||||||
* | filesets: fix archive.org path naming | Bryan Newbold | 2022-03-29 | 1 | -7/+8 | |
| | ||||||
* | bugfix: sha1/md5 typo | Bryan Newbold | 2022-03-23 | 1 | -1/+1 | |
| | | | | Caught this prepping to ingest in to fatcat. Derp! | |||||
* | various ingest/task notes | Bryan Newbold | 2022-03-22 | 4 | -5/+97 | |
| | ||||||
* | file ingest: don't 'backoff' on spn2 backoff error | Bryan Newbold | 2022-03-22 | 2 | -0/+8 | |
| | | | | | | | | The intent of this is to try and get through the daily ingest requests faster, so we can loop and retry if needed. A 200 second delay, usually resulting in a kafka topic reshuffle, really slows things down. This will presumably result in a bunch of spn2-backoff status requests, but we can just retry those. |