Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | fetch_petabox_body: allow non-200 status code fetches | Bryan Newbold | 2020-02-24 | 1 | -2/+10 |
| | | | | | | But only if it matches what the revisit record indicated. This is mostly to enable better revisit fetching. | ||||
* | allow fuzzy revisit matches | Bryan Newbold | 2020-02-24 | 1 | -1/+26 |
| | |||||
* | ingest: more revisit fixes | Bryan Newbold | 2020-02-22 | 1 | -4/+4 |
| | |||||
* | html: more publisher-specific fulltext extraction tricks | Bryan Newbold | 2020-02-22 | 1 | -0/+47 |
| | |||||
* | ia: improve warc/revisit implementation | Bryan Newbold | 2020-02-22 | 1 | -26/+46 |
| | | | | | A lot of the terminal-bad-status seems to have due to not handling revisits correctly. They have status_code = '-' or None. | ||||
* | html: degruyter extraction; disabled journals.lww.com | Bryan Newbold | 2020-02-22 | 1 | -0/+19 |
| | |||||
* | ingest: include better terminal URL/status_code/dt | Bryan Newbold | 2020-02-22 | 1 | -0/+8 |
| | | | | Was getting a lot of "last hit" metadata for these columns. | ||||
* | ingest: skip more non-pdf, non-paper domains | Bryan Newbold | 2020-02-22 | 1 | -0/+9 |
| | |||||
* | cdx: handle empty/null CDX response | Bryan Newbold | 2020-02-22 | 1 | -0/+2 |
| | | | | Sometimes seem to get empty string instead of empty JSON list | ||||
* | html: handle TypeError during bs4 parse | Bryan Newbold | 2020-02-22 | 1 | -1/+7 |
| | |||||
* | filter out CDX rows missing WARC playback fields | Bryan Newbold | 2020-02-19 | 1 | -0/+4 |
| | |||||
* | pdf_trio persist fixes from prod | Bryan Newbold | 2020-02-19 | 2 | -5/+9 |
| | |||||
* | allow <meta property=citation_pdf_url> | Bryan Newbold | 2020-02-18 | 1 | -0/+3 |
| | | | | at least researchgate does this (!) | ||||
* | X-Archive-Src more robust than X-Archive-Redirect-Reason | Bryan Newbold | 2020-02-18 | 1 | -2/+3 |
| | |||||
* | move edit_extra path to top-level | Bryan Newbold | 2020-02-18 | 1 | -2/+1 |
| | |||||
* | wayback: on bad redirects, log instead of assert | Bryan Newbold | 2020-02-18 | 1 | -2/+13 |
| | | | | This is a different form of mangled redirect. | ||||
* | attempt to work around corrupt ARC files from alexa issue | Bryan Newbold | 2020-02-18 | 1 | -0/+5 |
| | |||||
* | unpaywall2ingestrequest transform script | Bryan Newbold | 2020-02-18 | 2 | -1/+104 |
| | |||||
* | pdftrio: mode controlled by CLI arg | Bryan Newbold | 2020-02-18 | 2 | -10/+14 |
| | |||||
* | pdftrio: fix error nesting in pdftrio key | Bryan Newbold | 2020-02-18 | 1 | -12/+20 |
| | |||||
* | include rel and oa_status in ingest request 'extra' | Bryan Newbold | 2020-02-18 | 3 | -2/+6 |
| | |||||
* | ingest: bulk workers don't hit SPNv2 | Bryan Newbold | 2020-02-13 | 1 | -0/+2 |
| | |||||
* | pdftrio fixes from testing | Bryan Newbold | 2020-02-13 | 1 | -3/+9 |
| | |||||
* | move pdf_trio results back under key in JSON/Kafka | Bryan Newbold | 2020-02-13 | 3 | -22/+49 |
| | |||||
* | pdftrio JSON object as top-level in Kafka results | Bryan Newbold | 2020-02-12 | 1 | -16/+16 |
| | | | | To be same as GROBID results | ||||
* | pdftrio: small fixes from testing | Bryan Newbold | 2020-02-12 | 1 | -2/+2 |
| | |||||
* | pdftrio basic python code | Bryan Newbold | 2020-02-12 | 8 | -3/+395 |
| | | | | This is basically just a copy/paste of GROBID code, only simpler! | ||||
* | add minio.conf | Bryan Newbold | 2020-02-12 | 1 | -0/+14 |
| | |||||
* | dump_regrobid_pdf_petabox.sql script | Bryan Newbold | 2020-02-12 | 1 | -0/+15 |
| | |||||
* | sandcrawler-db extra stats | Bryan Newbold | 2020-02-12 | 1 | -0/+42 |
| | |||||
* | jan 2020 bulk ingest notes | Bryan Newbold | 2020-02-12 | 1 | -0/+26 |
| | |||||
* | pdftrio proposal and start on schema+kafka | Bryan Newbold | 2020-02-12 | 3 | -0/+122 |
| | |||||
* | add notes on recent ingest and backfill tasks | Bryan Newbold | 2020-02-05 | 3 | -0/+221 |
| | |||||
* | add ingestrequest_row2json.py | Bryan Newbold | 2020-02-05 | 1 | -0/+48 |
| | |||||
* | fix persist bug where ingest_request_source not saved | Bryan Newbold | 2020-02-05 | 1 | -0/+1 |
| | |||||
* | fix bug where ingest_request extra fields not persisted | Bryan Newbold | 2020-02-05 | 1 | -1/+2 |
| | |||||
* | handle alternative dt format in WARC headers | Bryan Newbold | 2020-02-05 | 1 | -2/+4 |
| | | | | | If there is a UTC timestamp, with trailing 'Z' indicating timezone, that is valid but increases string length by one. | ||||
* | decrease SPNv2 polling timeout to 3 minutes | Bryan Newbold | 2020-02-05 | 1 | -2/+2 |
| | |||||
* | improvements to reliability from prod testing | Bryan Newbold | 2020-02-03 | 2 | -7/+20 |
| | |||||
* | hack-y backoff ingest attempt | Bryan Newbold | 2020-02-03 | 2 | -3/+26 |
| | | | | | | | | | | | | | | | The goal here is to have SPNv2 requests backoff when we get back-pressure (usually caused by some sessions taking too long). Lack of proper back-pressure is making it hard to turn up parallelism. This is a hack because we still timeout and drop the slow request. A better way is probably to have a background thread run, while the KafkaPusher thread does polling. Maybe with timeouts to detect slow processing (greater than 30 seconds?) and only pause/resume in that case. This would also make taking batches easier. Unlike the existing code, however, the parallelism needs to happen at the Pusher level to do the polling (Kafka) and "await" (for all worker threads to complete) correctly. | ||||
* | more random sandcrawler-db queries | Bryan Newbold | 2020-02-03 | 2 | -32/+62 |
| | |||||
* | grobid petabox: fix fetch body/content | Bryan Newbold | 2020-02-03 | 1 | -1/+1 |
| | |||||
* | more SQL commands | Bryan Newbold | 2020-02-02 | 1 | -0/+15 |
| | |||||
* | wayback: try to resolve HTTPException due to many HTTP headers | Bryan Newbold | 2020-02-02 | 1 | -1/+9 |
| | | | | | | | | | This is withing GWB wayback code. Trying two things: - bump default max headers from 100 to 1000 in the (global?) http.client module itself. I didn't think through whether we would expect this to actually work - catch the exception, record it, move on | ||||
* | sandcrawler_worker: ingest worker distinct consumer groups | Bryan Newbold | 2020-01-29 | 1 | -1/+3 |
| | | | | | | I'm in the process of resetting these consumer groups, so might as well take the opportunity to split by topic and use the new canonical naming format. | ||||
* | 2020q1 fulltext ingest plans | Bryan Newbold | 2020-01-29 | 1 | -0/+272 |
| | |||||
* | grobid worker: catch PetaboxError also | Bryan Newbold | 2020-01-28 | 1 | -2/+2 |
| | |||||
* | worker kafka setting tweaks | Bryan Newbold | 2020-01-28 | 1 | -2/+4 |
| | | | | These are all attempts to get kafka workers operating more smoothly. | ||||
* | make grobid-extract worker batch size 1 | Bryan Newbold | 2020-01-28 | 1 | -0/+1 |
| | | | | | This is part of attempts to fix Kafka errors that look like they might be timeouts. | ||||
* | sql stats: typo fix | Bryan Newbold | 2020-01-28 | 1 | -1/+1 |
| |