aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* ingest: include better terminal URL/status_code/dtBryan Newbold2020-02-221-0/+8
| | | | Was getting a lot of "last hit" metadata for these columns.
* ingest: skip more non-pdf, non-paper domainsBryan Newbold2020-02-221-0/+9
|
* cdx: handle empty/null CDX responseBryan Newbold2020-02-221-0/+2
| | | | Sometimes seem to get empty string instead of empty JSON list
* html: handle TypeError during bs4 parseBryan Newbold2020-02-221-1/+7
|
* filter out CDX rows missing WARC playback fieldsBryan Newbold2020-02-191-0/+4
|
* pdf_trio persist fixes from prodBryan Newbold2020-02-192-5/+9
|
* allow <meta property=citation_pdf_url>Bryan Newbold2020-02-181-0/+3
| | | | at least researchgate does this (!)
* X-Archive-Src more robust than X-Archive-Redirect-ReasonBryan Newbold2020-02-181-2/+3
|
* move edit_extra path to top-levelBryan Newbold2020-02-181-2/+1
|
* wayback: on bad redirects, log instead of assertBryan Newbold2020-02-181-2/+13
| | | | This is a different form of mangled redirect.
* attempt to work around corrupt ARC files from alexa issueBryan Newbold2020-02-181-0/+5
|
* unpaywall2ingestrequest transform scriptBryan Newbold2020-02-182-1/+104
|
* pdftrio: mode controlled by CLI argBryan Newbold2020-02-182-10/+14
|
* pdftrio: fix error nesting in pdftrio keyBryan Newbold2020-02-181-12/+20
|
* include rel and oa_status in ingest request 'extra'Bryan Newbold2020-02-183-2/+6
|
* ingest: bulk workers don't hit SPNv2Bryan Newbold2020-02-131-0/+2
|
* pdftrio fixes from testingBryan Newbold2020-02-131-3/+9
|
* move pdf_trio results back under key in JSON/KafkaBryan Newbold2020-02-133-22/+49
|
* pdftrio JSON object as top-level in Kafka resultsBryan Newbold2020-02-121-16/+16
| | | | To be same as GROBID results
* pdftrio: small fixes from testingBryan Newbold2020-02-121-2/+2
|
* pdftrio basic python codeBryan Newbold2020-02-128-3/+395
| | | | This is basically just a copy/paste of GROBID code, only simpler!
* add minio.confBryan Newbold2020-02-121-0/+14
|
* dump_regrobid_pdf_petabox.sql scriptBryan Newbold2020-02-121-0/+15
|
* sandcrawler-db extra statsBryan Newbold2020-02-121-0/+42
|
* jan 2020 bulk ingest notesBryan Newbold2020-02-121-0/+26
|
* pdftrio proposal and start on schema+kafkaBryan Newbold2020-02-123-0/+122
|
* add notes on recent ingest and backfill tasksBryan Newbold2020-02-053-0/+221
|
* add ingestrequest_row2json.pyBryan Newbold2020-02-051-0/+48
|
* fix persist bug where ingest_request_source not savedBryan Newbold2020-02-051-0/+1
|
* fix bug where ingest_request extra fields not persistedBryan Newbold2020-02-051-1/+2
|
* handle alternative dt format in WARC headersBryan Newbold2020-02-051-2/+4
| | | | | If there is a UTC timestamp, with trailing 'Z' indicating timezone, that is valid but increases string length by one.
* decrease SPNv2 polling timeout to 3 minutesBryan Newbold2020-02-051-2/+2
|
* improvements to reliability from prod testingBryan Newbold2020-02-032-7/+20
|
* hack-y backoff ingest attemptBryan Newbold2020-02-032-3/+26
| | | | | | | | | | | | | | | The goal here is to have SPNv2 requests backoff when we get back-pressure (usually caused by some sessions taking too long). Lack of proper back-pressure is making it hard to turn up parallelism. This is a hack because we still timeout and drop the slow request. A better way is probably to have a background thread run, while the KafkaPusher thread does polling. Maybe with timeouts to detect slow processing (greater than 30 seconds?) and only pause/resume in that case. This would also make taking batches easier. Unlike the existing code, however, the parallelism needs to happen at the Pusher level to do the polling (Kafka) and "await" (for all worker threads to complete) correctly.
* more random sandcrawler-db queriesBryan Newbold2020-02-032-32/+62
|
* grobid petabox: fix fetch body/contentBryan Newbold2020-02-031-1/+1
|
* more SQL commandsBryan Newbold2020-02-021-0/+15
|
* wayback: try to resolve HTTPException due to many HTTP headersBryan Newbold2020-02-021-1/+9
| | | | | | | | | This is withing GWB wayback code. Trying two things: - bump default max headers from 100 to 1000 in the (global?) http.client module itself. I didn't think through whether we would expect this to actually work - catch the exception, record it, move on
* sandcrawler_worker: ingest worker distinct consumer groupsBryan Newbold2020-01-291-1/+3
| | | | | | I'm in the process of resetting these consumer groups, so might as well take the opportunity to split by topic and use the new canonical naming format.
* 2020q1 fulltext ingest plansBryan Newbold2020-01-291-0/+272
|
* grobid worker: catch PetaboxError alsoBryan Newbold2020-01-281-2/+2
|
* worker kafka setting tweaksBryan Newbold2020-01-281-2/+4
| | | | These are all attempts to get kafka workers operating more smoothly.
* make grobid-extract worker batch size 1Bryan Newbold2020-01-281-0/+1
| | | | | This is part of attempts to fix Kafka errors that look like they might be timeouts.
* sql stats: typo fixBryan Newbold2020-01-281-1/+1
|
* sql howto: database dumpsBryan Newbold2020-01-281-0/+7
|
* workers: yes, poll is necessaryBryan Newbold2020-01-281-1/+1
|
* grobid worker: always set a key in responseBryan Newbold2020-01-281-4/+25
| | | | | | | | | We have key-based compaction enabled for the GROBID output topic. This means it is an error to public to that topic without a key set. Hopefully this change will end these errors, which look like: KafkaError{code=INVALID_MSG,val=2,str="Broker: Invalid message"}
* fix kafka worker partition-specific errorBryan Newbold2020-01-281-1/+1
|
* fix WaybackError exception formatingBryan Newbold2020-01-281-1/+1
|
* fix elif syntax errorBryan Newbold2020-01-281-1/+1
|