Commit message (Collapse) | Author | Age | Files | Lines | ||
---|---|---|---|---|---|---|
... | ||||||
* | fix warc_offset -> offset | Bryan Newbold | 2020-02-24 | 1 | -1/+1 | |
| | ||||||
* | ingest: handle broken revisit records | Bryan Newbold | 2020-02-24 | 1 | -1/+4 | |
| | ||||||
* | recent sandcrawler-db / ingest stats (interesting) | Bryan Newbold | 2020-02-24 | 2 | -0/+488 | |
| | ||||||
* | ingest: handle missing chemrxvi tag | Bryan Newbold | 2020-02-24 | 1 | -1/+1 | |
| | ||||||
* | ingest: treat CDX lookup error as a wayback-error | Bryan Newbold | 2020-02-24 | 1 | -1/+4 | |
| | ||||||
* | ingest: more direct americanarchivist PDF url guess | Bryan Newbold | 2020-02-24 | 1 | -0/+4 | |
| | ||||||
* | ingest backfill notes | Bryan Newbold | 2020-02-24 | 3 | -0/+150 | |
| | ||||||
* | ingest: make ehp.niehs.nih.gov rule more robust | Bryan Newbold | 2020-02-24 | 1 | -2/+3 | |
| | ||||||
* | small tweak to americanarchivist.org URL extraction | Bryan Newbold | 2020-02-24 | 1 | -1/+1 | |
| | ||||||
* | fetch_petabox_body: allow non-200 status code fetches | Bryan Newbold | 2020-02-24 | 1 | -2/+10 | |
| | | | | | | But only if it matches what the revisit record indicated. This is mostly to enable better revisit fetching. | |||||
* | allow fuzzy revisit matches | Bryan Newbold | 2020-02-24 | 1 | -1/+26 | |
| | ||||||
* | ingest: more revisit fixes | Bryan Newbold | 2020-02-22 | 1 | -4/+4 | |
| | ||||||
* | html: more publisher-specific fulltext extraction tricks | Bryan Newbold | 2020-02-22 | 1 | -0/+47 | |
| | ||||||
* | ia: improve warc/revisit implementation | Bryan Newbold | 2020-02-22 | 1 | -26/+46 | |
| | | | | | A lot of the terminal-bad-status seems to have due to not handling revisits correctly. They have status_code = '-' or None. | |||||
* | html: degruyter extraction; disabled journals.lww.com | Bryan Newbold | 2020-02-22 | 1 | -0/+19 | |
| | ||||||
* | ingest: include better terminal URL/status_code/dt | Bryan Newbold | 2020-02-22 | 1 | -0/+8 | |
| | | | | Was getting a lot of "last hit" metadata for these columns. | |||||
* | ingest: skip more non-pdf, non-paper domains | Bryan Newbold | 2020-02-22 | 1 | -0/+9 | |
| | ||||||
* | cdx: handle empty/null CDX response | Bryan Newbold | 2020-02-22 | 1 | -0/+2 | |
| | | | | Sometimes seem to get empty string instead of empty JSON list | |||||
* | html: handle TypeError during bs4 parse | Bryan Newbold | 2020-02-22 | 1 | -1/+7 | |
| | ||||||
* | filter out CDX rows missing WARC playback fields | Bryan Newbold | 2020-02-19 | 1 | -0/+4 | |
| | ||||||
* | pdf_trio persist fixes from prod | Bryan Newbold | 2020-02-19 | 2 | -5/+9 | |
| | ||||||
* | allow <meta property=citation_pdf_url> | Bryan Newbold | 2020-02-18 | 1 | -0/+3 | |
| | | | | at least researchgate does this (!) | |||||
* | X-Archive-Src more robust than X-Archive-Redirect-Reason | Bryan Newbold | 2020-02-18 | 1 | -2/+3 | |
| | ||||||
* | move edit_extra path to top-level | Bryan Newbold | 2020-02-18 | 1 | -2/+1 | |
| | ||||||
* | wayback: on bad redirects, log instead of assert | Bryan Newbold | 2020-02-18 | 1 | -2/+13 | |
| | | | | This is a different form of mangled redirect. | |||||
* | attempt to work around corrupt ARC files from alexa issue | Bryan Newbold | 2020-02-18 | 1 | -0/+5 | |
| | ||||||
* | unpaywall2ingestrequest transform script | Bryan Newbold | 2020-02-18 | 2 | -1/+104 | |
| | ||||||
* | pdftrio: mode controlled by CLI arg | Bryan Newbold | 2020-02-18 | 2 | -10/+14 | |
| | ||||||
* | pdftrio: fix error nesting in pdftrio key | Bryan Newbold | 2020-02-18 | 1 | -12/+20 | |
| | ||||||
* | include rel and oa_status in ingest request 'extra' | Bryan Newbold | 2020-02-18 | 3 | -2/+6 | |
| | ||||||
* | ingest: bulk workers don't hit SPNv2 | Bryan Newbold | 2020-02-13 | 1 | -0/+2 | |
| | ||||||
* | pdftrio fixes from testing | Bryan Newbold | 2020-02-13 | 1 | -3/+9 | |
| | ||||||
* | move pdf_trio results back under key in JSON/Kafka | Bryan Newbold | 2020-02-13 | 3 | -22/+49 | |
| | ||||||
* | pdftrio JSON object as top-level in Kafka results | Bryan Newbold | 2020-02-12 | 1 | -16/+16 | |
| | | | | To be same as GROBID results | |||||
* | pdftrio: small fixes from testing | Bryan Newbold | 2020-02-12 | 1 | -2/+2 | |
| | ||||||
* | pdftrio basic python code | Bryan Newbold | 2020-02-12 | 8 | -3/+395 | |
| | | | | This is basically just a copy/paste of GROBID code, only simpler! | |||||
* | add minio.conf | Bryan Newbold | 2020-02-12 | 1 | -0/+14 | |
| | ||||||
* | dump_regrobid_pdf_petabox.sql script | Bryan Newbold | 2020-02-12 | 1 | -0/+15 | |
| | ||||||
* | sandcrawler-db extra stats | Bryan Newbold | 2020-02-12 | 1 | -0/+42 | |
| | ||||||
* | jan 2020 bulk ingest notes | Bryan Newbold | 2020-02-12 | 1 | -0/+26 | |
| | ||||||
* | pdftrio proposal and start on schema+kafka | Bryan Newbold | 2020-02-12 | 3 | -0/+122 | |
| | ||||||
* | add notes on recent ingest and backfill tasks | Bryan Newbold | 2020-02-05 | 3 | -0/+221 | |
| | ||||||
* | add ingestrequest_row2json.py | Bryan Newbold | 2020-02-05 | 1 | -0/+48 | |
| | ||||||
* | fix persist bug where ingest_request_source not saved | Bryan Newbold | 2020-02-05 | 1 | -0/+1 | |
| | ||||||
* | fix bug where ingest_request extra fields not persisted | Bryan Newbold | 2020-02-05 | 1 | -1/+2 | |
| | ||||||
* | handle alternative dt format in WARC headers | Bryan Newbold | 2020-02-05 | 1 | -2/+4 | |
| | | | | | If there is a UTC timestamp, with trailing 'Z' indicating timezone, that is valid but increases string length by one. | |||||
* | decrease SPNv2 polling timeout to 3 minutes | Bryan Newbold | 2020-02-05 | 1 | -2/+2 | |
| | ||||||
* | improvements to reliability from prod testing | Bryan Newbold | 2020-02-03 | 2 | -7/+20 | |
| | ||||||
* | hack-y backoff ingest attempt | Bryan Newbold | 2020-02-03 | 2 | -3/+26 | |
| | | | | | | | | | | | | | | | The goal here is to have SPNv2 requests backoff when we get back-pressure (usually caused by some sessions taking too long). Lack of proper back-pressure is making it hard to turn up parallelism. This is a hack because we still timeout and drop the slow request. A better way is probably to have a background thread run, while the KafkaPusher thread does polling. Maybe with timeouts to detect slow processing (greater than 30 seconds?) and only pause/resume in that case. This would also make taking batches easier. Unlike the existing code, however, the parallelism needs to happen at the Pusher level to do the polling (Kafka) and "await" (for all worker threads to complete) correctly. | |||||
* | more random sandcrawler-db queries | Bryan Newbold | 2020-02-03 | 2 | -32/+62 | |
| |