Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | jan 2020 bulk ingest notes | Bryan Newbold | 2020-02-12 | 1 | -0/+26 |
| | |||||
* | pdftrio proposal and start on schema+kafka | Bryan Newbold | 2020-02-12 | 3 | -0/+122 |
| | |||||
* | add notes on recent ingest and backfill tasks | Bryan Newbold | 2020-02-05 | 3 | -0/+221 |
| | |||||
* | add ingestrequest_row2json.py | Bryan Newbold | 2020-02-05 | 1 | -0/+48 |
| | |||||
* | fix persist bug where ingest_request_source not saved | Bryan Newbold | 2020-02-05 | 1 | -0/+1 |
| | |||||
* | fix bug where ingest_request extra fields not persisted | Bryan Newbold | 2020-02-05 | 1 | -1/+2 |
| | |||||
* | handle alternative dt format in WARC headers | Bryan Newbold | 2020-02-05 | 1 | -2/+4 |
| | | | | | If there is a UTC timestamp, with trailing 'Z' indicating timezone, that is valid but increases string length by one. | ||||
* | decrease SPNv2 polling timeout to 3 minutes | Bryan Newbold | 2020-02-05 | 1 | -2/+2 |
| | |||||
* | improvements to reliability from prod testing | Bryan Newbold | 2020-02-03 | 2 | -7/+20 |
| | |||||
* | hack-y backoff ingest attempt | Bryan Newbold | 2020-02-03 | 2 | -3/+26 |
| | | | | | | | | | | | | | | | The goal here is to have SPNv2 requests backoff when we get back-pressure (usually caused by some sessions taking too long). Lack of proper back-pressure is making it hard to turn up parallelism. This is a hack because we still timeout and drop the slow request. A better way is probably to have a background thread run, while the KafkaPusher thread does polling. Maybe with timeouts to detect slow processing (greater than 30 seconds?) and only pause/resume in that case. This would also make taking batches easier. Unlike the existing code, however, the parallelism needs to happen at the Pusher level to do the polling (Kafka) and "await" (for all worker threads to complete) correctly. | ||||
* | more random sandcrawler-db queries | Bryan Newbold | 2020-02-03 | 2 | -32/+62 |
| | |||||
* | grobid petabox: fix fetch body/content | Bryan Newbold | 2020-02-03 | 1 | -1/+1 |
| | |||||
* | more SQL commands | Bryan Newbold | 2020-02-02 | 1 | -0/+15 |
| | |||||
* | wayback: try to resolve HTTPException due to many HTTP headers | Bryan Newbold | 2020-02-02 | 1 | -1/+9 |
| | | | | | | | | | This is withing GWB wayback code. Trying two things: - bump default max headers from 100 to 1000 in the (global?) http.client module itself. I didn't think through whether we would expect this to actually work - catch the exception, record it, move on | ||||
* | sandcrawler_worker: ingest worker distinct consumer groups | Bryan Newbold | 2020-01-29 | 1 | -1/+3 |
| | | | | | | I'm in the process of resetting these consumer groups, so might as well take the opportunity to split by topic and use the new canonical naming format. | ||||
* | 2020q1 fulltext ingest plans | Bryan Newbold | 2020-01-29 | 1 | -0/+272 |
| | |||||
* | grobid worker: catch PetaboxError also | Bryan Newbold | 2020-01-28 | 1 | -2/+2 |
| | |||||
* | worker kafka setting tweaks | Bryan Newbold | 2020-01-28 | 1 | -2/+4 |
| | | | | These are all attempts to get kafka workers operating more smoothly. | ||||
* | make grobid-extract worker batch size 1 | Bryan Newbold | 2020-01-28 | 1 | -0/+1 |
| | | | | | This is part of attempts to fix Kafka errors that look like they might be timeouts. | ||||
* | sql stats: typo fix | Bryan Newbold | 2020-01-28 | 1 | -1/+1 |
| | |||||
* | sql howto: database dumps | Bryan Newbold | 2020-01-28 | 1 | -0/+7 |
| | |||||
* | workers: yes, poll is necessary | Bryan Newbold | 2020-01-28 | 1 | -1/+1 |
| | |||||
* | grobid worker: always set a key in response | Bryan Newbold | 2020-01-28 | 1 | -4/+25 |
| | | | | | | | | | We have key-based compaction enabled for the GROBID output topic. This means it is an error to public to that topic without a key set. Hopefully this change will end these errors, which look like: KafkaError{code=INVALID_MSG,val=2,str="Broker: Invalid message"} | ||||
* | fix kafka worker partition-specific error | Bryan Newbold | 2020-01-28 | 1 | -1/+1 |
| | |||||
* | fix WaybackError exception formating | Bryan Newbold | 2020-01-28 | 1 | -1/+1 |
| | |||||
* | fix elif syntax error | Bryan Newbold | 2020-01-28 | 1 | -1/+1 |
| | |||||
* | block springer page-one domain | Bryan Newbold | 2020-01-28 | 1 | -0/+3 |
| | |||||
* | clarify petabox fetch behavior | Bryan Newbold | 2020-01-28 | 1 | -3/+6 |
| | |||||
* | re-enable figshare and zenodo crawling | Bryan Newbold | 2020-01-21 | 1 | -8/+0 |
| | | | | For daily imports | ||||
* | persist grobid: actually, status_code is required | Bryan Newbold | 2020-01-21 | 2 | -3/+10 |
| | | | | | | | Instead of working around when missing, force it to exist but skip in database insert section. Disk mode still needs to check if blank. | ||||
* | ingest: check for null-body before file_meta | Bryan Newbold | 2020-01-21 | 1 | -0/+3 |
| | | | | | gen_file_metadata raises an assert error if body is None (or false-y in general) | ||||
* | wayback: replay redirects have X-Archive-Redirect-Reason | Bryan Newbold | 2020-01-21 | 1 | -2/+4 |
| | |||||
* | persist: work around GROBID timeouts with no status_code | Bryan Newbold | 2020-01-21 | 2 | -3/+3 |
| | |||||
* | grobid: fix error_msg typo; set status_code for timeouts | Bryan Newbold | 2020-01-21 | 1 | -1/+2 |
| | |||||
* | add 200 second timeout to GROBID requests | Bryan Newbold | 2020-01-17 | 1 | -8/+15 |
| | |||||
* | add SKIP log line for skip-url-blocklist path | Bryan Newbold | 2020-01-17 | 1 | -0/+1 |
| | |||||
* | ingest: add URL blocklist feature | Bryan Newbold | 2020-01-17 | 2 | -4/+49 |
| | | | | And, temporarily, block zenodo and figshare. | ||||
* | handle UnicodeDecodeError in the other GET instance | Bryan Newbold | 2020-01-15 | 1 | -0/+2 |
| | |||||
* | increase SPNv2 polling timeout to 4 minutes | Bryan Newbold | 2020-01-15 | 1 | -1/+3 |
| | |||||
* | make failed replay fetch an error, not assert error | Bryan Newbold | 2020-01-15 | 1 | -1/+2 |
| | |||||
* | kafka config: actually we do want large bulk ingest request topic | Bryan Newbold | 2020-01-15 | 1 | -1/+1 |
| | |||||
* | improve sentry reporting with 'release' git hash | Bryan Newbold | 2020-01-15 | 2 | -2/+5 |
| | |||||
* | wayback replay: catch UnicodeDecodeError | Bryan Newbold | 2020-01-15 | 1 | -0/+2 |
| | | | | | | | | In prod, ran in to a redirect URL like: b'/web/20200116043630id_/https://mediarep.org/bitstream/handle/doc/1127/Barth\xe9l\xe9my_2015_Life_and_Technology.pdf;jsessionid=A9EFB2798846F5E14A8473BBFD6AB46C?sequence=1' which broke requests. | ||||
* | persist: fix dupe field copying | Bryan Newbold | 2020-01-15 | 1 | -1/+8 |
| | | | | | | In testing hit: AttributeError: 'str' object has no attribute 'get' | ||||
* | persist worker: implement updated ingest result semantics | Bryan Newbold | 2020-01-15 | 2 | -12/+17 |
| | |||||
* | clarify ingest result schema and semantics | Bryan Newbold | 2020-01-15 | 5 | -30/+82 |
| | |||||
* | pass through revisit_cdx | Bryan Newbold | 2020-01-15 | 2 | -5/+21 |
| | |||||
* | fix revisit resolution | Bryan Newbold | 2020-01-15 | 1 | -4/+12 |
| | | | | | Returns the *original* CDX record, but keeps the terminal_url and terminal_sha1hex info. | ||||
* | database stats | Bryan Newbold | 2020-01-14 | 2 | -0/+289 |
| | |||||
* | add new bulk ingest request topic | Bryan Newbold | 2020-01-14 | 1 | -1/+6 |
| |