Commit message (Collapse) | Author | Age | Files | Lines | ||
---|---|---|---|---|---|---|
... | ||||||
* | ingest: make content-decoding more robust | Bryan Newbold | 2020-03-03 | 1 | -1/+2 | |
| | ||||||
* | make gzip content-encoding path more robust | Bryan Newbold | 2020-03-03 | 1 | -1/+10 | |
| | ||||||
* | ingest: crude content-encoding support | Bryan Newbold | 2020-03-02 | 1 | -1/+19 | |
| | | | | | | This perhaps should be handled in IA wrapper tool directly, instead of in ingest code. Or really, possibly a bug in wayback python library or SPN? | |||||
* | ingest: add force_recrawl flag to skip historical wayback lookup | Bryan Newbold | 2020-03-02 | 1 | -3/+5 | |
| | ||||||
* | remove protocols.io octet-stream hack | Bryan Newbold | 2020-03-02 | 1 | -6/+2 | |
| | ||||||
* | more mime normalization | Bryan Newbold | 2020-02-27 | 1 | -1/+18 | |
| | ||||||
* | ingest: narrow xhtml filter | Bryan Newbold | 2020-02-25 | 1 | -1/+1 | |
| | ||||||
* | pdftrio: tweaks to avoid connection errors | Bryan Newbold | 2020-02-24 | 1 | -1/+9 | |
| | ||||||
* | fix warc_offset -> offset | Bryan Newbold | 2020-02-24 | 1 | -1/+1 | |
| | ||||||
* | ingest: handle broken revisit records | Bryan Newbold | 2020-02-24 | 1 | -1/+4 | |
| | ||||||
* | ingest: handle missing chemrxvi tag | Bryan Newbold | 2020-02-24 | 1 | -1/+1 | |
| | ||||||
* | ingest: treat CDX lookup error as a wayback-error | Bryan Newbold | 2020-02-24 | 1 | -1/+4 | |
| | ||||||
* | ingest: more direct americanarchivist PDF url guess | Bryan Newbold | 2020-02-24 | 1 | -0/+4 | |
| | ||||||
* | ingest: make ehp.niehs.nih.gov rule more robust | Bryan Newbold | 2020-02-24 | 1 | -2/+3 | |
| | ||||||
* | small tweak to americanarchivist.org URL extraction | Bryan Newbold | 2020-02-24 | 1 | -1/+1 | |
| | ||||||
* | fetch_petabox_body: allow non-200 status code fetches | Bryan Newbold | 2020-02-24 | 1 | -2/+10 | |
| | | | | | | But only if it matches what the revisit record indicated. This is mostly to enable better revisit fetching. | |||||
* | allow fuzzy revisit matches | Bryan Newbold | 2020-02-24 | 1 | -1/+26 | |
| | ||||||
* | ingest: more revisit fixes | Bryan Newbold | 2020-02-22 | 1 | -4/+4 | |
| | ||||||
* | html: more publisher-specific fulltext extraction tricks | Bryan Newbold | 2020-02-22 | 1 | -0/+47 | |
| | ||||||
* | ia: improve warc/revisit implementation | Bryan Newbold | 2020-02-22 | 1 | -26/+46 | |
| | | | | | A lot of the terminal-bad-status seems to have due to not handling revisits correctly. They have status_code = '-' or None. | |||||
* | html: degruyter extraction; disabled journals.lww.com | Bryan Newbold | 2020-02-22 | 1 | -0/+19 | |
| | ||||||
* | ingest: include better terminal URL/status_code/dt | Bryan Newbold | 2020-02-22 | 1 | -0/+8 | |
| | | | | Was getting a lot of "last hit" metadata for these columns. | |||||
* | ingest: skip more non-pdf, non-paper domains | Bryan Newbold | 2020-02-22 | 1 | -0/+9 | |
| | ||||||
* | cdx: handle empty/null CDX response | Bryan Newbold | 2020-02-22 | 1 | -0/+2 | |
| | | | | Sometimes seem to get empty string instead of empty JSON list | |||||
* | html: handle TypeError during bs4 parse | Bryan Newbold | 2020-02-22 | 1 | -1/+7 | |
| | ||||||
* | filter out CDX rows missing WARC playback fields | Bryan Newbold | 2020-02-19 | 1 | -0/+4 | |
| | ||||||
* | pdf_trio persist fixes from prod | Bryan Newbold | 2020-02-19 | 2 | -5/+9 | |
| | ||||||
* | allow <meta property=citation_pdf_url> | Bryan Newbold | 2020-02-18 | 1 | -0/+3 | |
| | | | | at least researchgate does this (!) | |||||
* | X-Archive-Src more robust than X-Archive-Redirect-Reason | Bryan Newbold | 2020-02-18 | 1 | -2/+3 | |
| | ||||||
* | wayback: on bad redirects, log instead of assert | Bryan Newbold | 2020-02-18 | 1 | -2/+13 | |
| | | | | This is a different form of mangled redirect. | |||||
* | attempt to work around corrupt ARC files from alexa issue | Bryan Newbold | 2020-02-18 | 1 | -0/+5 | |
| | ||||||
* | unpaywall2ingestrequest transform script | Bryan Newbold | 2020-02-18 | 2 | -1/+104 | |
| | ||||||
* | pdftrio: mode controlled by CLI arg | Bryan Newbold | 2020-02-18 | 2 | -10/+14 | |
| | ||||||
* | pdftrio: fix error nesting in pdftrio key | Bryan Newbold | 2020-02-18 | 1 | -12/+20 | |
| | ||||||
* | include rel and oa_status in ingest request 'extra' | Bryan Newbold | 2020-02-18 | 2 | -2/+2 | |
| | ||||||
* | ingest: bulk workers don't hit SPNv2 | Bryan Newbold | 2020-02-13 | 1 | -0/+2 | |
| | ||||||
* | pdftrio fixes from testing | Bryan Newbold | 2020-02-13 | 1 | -3/+9 | |
| | ||||||
* | move pdf_trio results back under key in JSON/Kafka | Bryan Newbold | 2020-02-13 | 2 | -7/+31 | |
| | ||||||
* | pdftrio: small fixes from testing | Bryan Newbold | 2020-02-12 | 1 | -2/+2 | |
| | ||||||
* | pdftrio basic python code | Bryan Newbold | 2020-02-12 | 7 | -1/+393 | |
| | | | | This is basically just a copy/paste of GROBID code, only simpler! | |||||
* | add ingestrequest_row2json.py | Bryan Newbold | 2020-02-05 | 1 | -0/+48 | |
| | ||||||
* | fix persist bug where ingest_request_source not saved | Bryan Newbold | 2020-02-05 | 1 | -0/+1 | |
| | ||||||
* | fix bug where ingest_request extra fields not persisted | Bryan Newbold | 2020-02-05 | 1 | -1/+2 | |
| | ||||||
* | handle alternative dt format in WARC headers | Bryan Newbold | 2020-02-05 | 1 | -2/+4 | |
| | | | | | If there is a UTC timestamp, with trailing 'Z' indicating timezone, that is valid but increases string length by one. | |||||
* | decrease SPNv2 polling timeout to 3 minutes | Bryan Newbold | 2020-02-05 | 1 | -2/+2 | |
| | ||||||
* | improvements to reliability from prod testing | Bryan Newbold | 2020-02-03 | 2 | -7/+20 | |
| | ||||||
* | hack-y backoff ingest attempt | Bryan Newbold | 2020-02-03 | 2 | -3/+26 | |
| | | | | | | | | | | | | | | | The goal here is to have SPNv2 requests backoff when we get back-pressure (usually caused by some sessions taking too long). Lack of proper back-pressure is making it hard to turn up parallelism. This is a hack because we still timeout and drop the slow request. A better way is probably to have a background thread run, while the KafkaPusher thread does polling. Maybe with timeouts to detect slow processing (greater than 30 seconds?) and only pause/resume in that case. This would also make taking batches easier. Unlike the existing code, however, the parallelism needs to happen at the Pusher level to do the polling (Kafka) and "await" (for all worker threads to complete) correctly. | |||||
* | grobid petabox: fix fetch body/content | Bryan Newbold | 2020-02-03 | 1 | -1/+1 | |
| | ||||||
* | wayback: try to resolve HTTPException due to many HTTP headers | Bryan Newbold | 2020-02-02 | 1 | -1/+9 | |
| | | | | | | | | | This is withing GWB wayback code. Trying two things: - bump default max headers from 100 to 1000 in the (global?) http.client module itself. I didn't think through whether we would expect this to actually work - catch the exception, record it, move on | |||||
* | sandcrawler_worker: ingest worker distinct consumer groups | Bryan Newbold | 2020-01-29 | 1 | -1/+3 | |
| | | | | | | I'm in the process of resetting these consumer groups, so might as well take the opportunity to split by topic and use the new canonical naming format. |