Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | handle alternative dt format in WARC headers | Bryan Newbold | 2020-02-05 | 1 | -2/+4 |
| | | | | | If there is a UTC timestamp, with trailing 'Z' indicating timezone, that is valid but increases string length by one. | ||||
* | decrease SPNv2 polling timeout to 3 minutes | Bryan Newbold | 2020-02-05 | 1 | -2/+2 |
| | |||||
* | improvements to reliability from prod testing | Bryan Newbold | 2020-02-03 | 2 | -7/+20 |
| | |||||
* | hack-y backoff ingest attempt | Bryan Newbold | 2020-02-03 | 2 | -3/+26 |
| | | | | | | | | | | | | | | | The goal here is to have SPNv2 requests backoff when we get back-pressure (usually caused by some sessions taking too long). Lack of proper back-pressure is making it hard to turn up parallelism. This is a hack because we still timeout and drop the slow request. A better way is probably to have a background thread run, while the KafkaPusher thread does polling. Maybe with timeouts to detect slow processing (greater than 30 seconds?) and only pause/resume in that case. This would also make taking batches easier. Unlike the existing code, however, the parallelism needs to happen at the Pusher level to do the polling (Kafka) and "await" (for all worker threads to complete) correctly. | ||||
* | grobid petabox: fix fetch body/content | Bryan Newbold | 2020-02-03 | 1 | -1/+1 |
| | |||||
* | wayback: try to resolve HTTPException due to many HTTP headers | Bryan Newbold | 2020-02-02 | 1 | -1/+9 |
| | | | | | | | | | This is withing GWB wayback code. Trying two things: - bump default max headers from 100 to 1000 in the (global?) http.client module itself. I didn't think through whether we would expect this to actually work - catch the exception, record it, move on | ||||
* | sandcrawler_worker: ingest worker distinct consumer groups | Bryan Newbold | 2020-01-29 | 1 | -1/+3 |
| | | | | | | I'm in the process of resetting these consumer groups, so might as well take the opportunity to split by topic and use the new canonical naming format. | ||||
* | grobid worker: catch PetaboxError also | Bryan Newbold | 2020-01-28 | 1 | -2/+2 |
| | |||||
* | worker kafka setting tweaks | Bryan Newbold | 2020-01-28 | 1 | -2/+4 |
| | | | | These are all attempts to get kafka workers operating more smoothly. | ||||
* | make grobid-extract worker batch size 1 | Bryan Newbold | 2020-01-28 | 1 | -0/+1 |
| | | | | | This is part of attempts to fix Kafka errors that look like they might be timeouts. | ||||
* | workers: yes, poll is necessary | Bryan Newbold | 2020-01-28 | 1 | -1/+1 |
| | |||||
* | grobid worker: always set a key in response | Bryan Newbold | 2020-01-28 | 1 | -4/+25 |
| | | | | | | | | | We have key-based compaction enabled for the GROBID output topic. This means it is an error to public to that topic without a key set. Hopefully this change will end these errors, which look like: KafkaError{code=INVALID_MSG,val=2,str="Broker: Invalid message"} | ||||
* | fix kafka worker partition-specific error | Bryan Newbold | 2020-01-28 | 1 | -1/+1 |
| | |||||
* | fix WaybackError exception formating | Bryan Newbold | 2020-01-28 | 1 | -1/+1 |
| | |||||
* | fix elif syntax error | Bryan Newbold | 2020-01-28 | 1 | -1/+1 |
| | |||||
* | block springer page-one domain | Bryan Newbold | 2020-01-28 | 1 | -0/+3 |
| | |||||
* | clarify petabox fetch behavior | Bryan Newbold | 2020-01-28 | 1 | -3/+6 |
| | |||||
* | re-enable figshare and zenodo crawling | Bryan Newbold | 2020-01-21 | 1 | -8/+0 |
| | | | | For daily imports | ||||
* | persist grobid: actually, status_code is required | Bryan Newbold | 2020-01-21 | 2 | -3/+10 |
| | | | | | | | Instead of working around when missing, force it to exist but skip in database insert section. Disk mode still needs to check if blank. | ||||
* | ingest: check for null-body before file_meta | Bryan Newbold | 2020-01-21 | 1 | -0/+3 |
| | | | | | gen_file_metadata raises an assert error if body is None (or false-y in general) | ||||
* | wayback: replay redirects have X-Archive-Redirect-Reason | Bryan Newbold | 2020-01-21 | 1 | -2/+4 |
| | |||||
* | persist: work around GROBID timeouts with no status_code | Bryan Newbold | 2020-01-21 | 2 | -3/+3 |
| | |||||
* | grobid: fix error_msg typo; set status_code for timeouts | Bryan Newbold | 2020-01-21 | 1 | -1/+2 |
| | |||||
* | add 200 second timeout to GROBID requests | Bryan Newbold | 2020-01-17 | 1 | -8/+15 |
| | |||||
* | add SKIP log line for skip-url-blocklist path | Bryan Newbold | 2020-01-17 | 1 | -0/+1 |
| | |||||
* | ingest: add URL blocklist feature | Bryan Newbold | 2020-01-17 | 2 | -4/+49 |
| | | | | And, temporarily, block zenodo and figshare. | ||||
* | handle UnicodeDecodeError in the other GET instance | Bryan Newbold | 2020-01-15 | 1 | -0/+2 |
| | |||||
* | increase SPNv2 polling timeout to 4 minutes | Bryan Newbold | 2020-01-15 | 1 | -1/+3 |
| | |||||
* | make failed replay fetch an error, not assert error | Bryan Newbold | 2020-01-15 | 1 | -1/+2 |
| | |||||
* | improve sentry reporting with 'release' git hash | Bryan Newbold | 2020-01-15 | 2 | -2/+5 |
| | |||||
* | wayback replay: catch UnicodeDecodeError | Bryan Newbold | 2020-01-15 | 1 | -0/+2 |
| | | | | | | | | In prod, ran in to a redirect URL like: b'/web/20200116043630id_/https://mediarep.org/bitstream/handle/doc/1127/Barth\xe9l\xe9my_2015_Life_and_Technology.pdf;jsessionid=A9EFB2798846F5E14A8473BBFD6AB46C?sequence=1' which broke requests. | ||||
* | persist: fix dupe field copying | Bryan Newbold | 2020-01-15 | 1 | -1/+8 |
| | | | | | | In testing hit: AttributeError: 'str' object has no attribute 'get' | ||||
* | persist worker: implement updated ingest result semantics | Bryan Newbold | 2020-01-15 | 2 | -12/+17 |
| | |||||
* | clarify ingest result schema and semantics | Bryan Newbold | 2020-01-15 | 3 | -7/+32 |
| | |||||
* | pass through revisit_cdx | Bryan Newbold | 2020-01-15 | 2 | -5/+21 |
| | |||||
* | fix revisit resolution | Bryan Newbold | 2020-01-15 | 1 | -4/+12 |
| | | | | | Returns the *original* CDX record, but keeps the terminal_url and terminal_sha1hex info. | ||||
* | add postgrest checks to test mocks | Bryan Newbold | 2020-01-14 | 1 | -1/+9 |
| | |||||
* | tests: don't use localhost as a responses mock host | Bryan Newbold | 2020-01-14 | 2 | -6/+6 |
| | |||||
* | bulk ingest file request topic support | Bryan Newbold | 2020-01-14 | 1 | -1/+7 |
| | |||||
* | ingest: sketch out more of how 'existing' path would work | Bryan Newbold | 2020-01-14 | 1 | -8/+22 |
| | |||||
* | ingest: check existing GROBID; also push results to sink | Bryan Newbold | 2020-01-14 | 1 | -4/+22 |
| | |||||
* | ingest persist skips 'existing' ingest results | Bryan Newbold | 2020-01-14 | 1 | -0/+3 |
| | |||||
* | grobid-to-kafka support in ingest worker | Bryan Newbold | 2020-01-14 | 1 | -0/+6 |
| | |||||
* | grobid worker fixes for newer ia lib refactors | Bryan Newbold | 2020-01-14 | 1 | -3/+9 |
| | |||||
* | small fixups to SandcrawlerPostgrestClient | Bryan Newbold | 2020-01-14 | 2 | -1/+11 |
| | |||||
* | filter out archive.org and web.archive.org (until implemented) | Bryan Newbold | 2020-01-14 | 1 | -1/+12 |
| | |||||
* | SPNv2 doesn't support FTP; add a live test for non-revist FTP | Bryan Newbold | 2020-01-14 | 2 | -0/+26 |
| | |||||
* | more ftp status 226 support | Bryan Newbold | 2020-01-14 | 5 | -9/+23 |
| | |||||
* | add live tests for ftp, revisits | Bryan Newbold | 2020-01-14 | 1 | -1/+36 |
| | |||||
* | basic FTP ingest support; revist record resolution | Bryan Newbold | 2020-01-14 | 2 | -35/+78 |
| | | | | | | | - supporting revisits means more wayback hits (fewer crawls) => faster - ... but this is only partial support. will also need to work through sandcrawler db schema, etc. current status should be safe to merge/use. - ftp support via treating an ftp hit as a 200 |