aboutsummaryrefslogtreecommitdiffstats
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
* grobid persist: if status_code is not set, default to 0bnewbold-persist-grobid-errorsBryan Newbold2020-01-282-7/+2
| | | | | | | | | | | | | | | We have to set something currently because of a NOT NULL constraint on the table. Originally I thought we would just not record rows if there was an error, and that is still sort of a valid stance. However, when doing bulk GROBID-ing from cdx table, there exist some "bad" CDX rows which cause wayback or petabox errors. We should fix bugs or delete these rows as a cleanup, but until that happens we should record the error state so we don't loop forever. One danger of this commit is that we can clobber existing good rows with new errors rapidly if there is wayback downtime or something like that.
* workers: yes, poll is necessaryBryan Newbold2020-01-281-1/+1
|
* grobid worker: always set a key in responseBryan Newbold2020-01-281-4/+25
| | | | | | | | | We have key-based compaction enabled for the GROBID output topic. This means it is an error to public to that topic without a key set. Hopefully this change will end these errors, which look like: KafkaError{code=INVALID_MSG,val=2,str="Broker: Invalid message"}
* fix kafka worker partition-specific errorBryan Newbold2020-01-281-1/+1
|
* fix WaybackError exception formatingBryan Newbold2020-01-281-1/+1
|
* fix elif syntax errorBryan Newbold2020-01-281-1/+1
|
* block springer page-one domainBryan Newbold2020-01-281-0/+3
|
* clarify petabox fetch behaviorBryan Newbold2020-01-281-3/+6
|
* re-enable figshare and zenodo crawlingBryan Newbold2020-01-211-8/+0
| | | | For daily imports
* persist grobid: actually, status_code is requiredBryan Newbold2020-01-212-3/+10
| | | | | | | Instead of working around when missing, force it to exist but skip in database insert section. Disk mode still needs to check if blank.
* ingest: check for null-body before file_metaBryan Newbold2020-01-211-0/+3
| | | | | gen_file_metadata raises an assert error if body is None (or false-y in general)
* wayback: replay redirects have X-Archive-Redirect-ReasonBryan Newbold2020-01-211-2/+4
|
* persist: work around GROBID timeouts with no status_codeBryan Newbold2020-01-212-3/+3
|
* grobid: fix error_msg typo; set status_code for timeoutsBryan Newbold2020-01-211-1/+2
|
* add 200 second timeout to GROBID requestsBryan Newbold2020-01-171-8/+15
|
* add SKIP log line for skip-url-blocklist pathBryan Newbold2020-01-171-0/+1
|
* ingest: add URL blocklist featureBryan Newbold2020-01-172-4/+49
| | | | And, temporarily, block zenodo and figshare.
* handle UnicodeDecodeError in the other GET instanceBryan Newbold2020-01-151-0/+2
|
* increase SPNv2 polling timeout to 4 minutesBryan Newbold2020-01-151-1/+3
|
* make failed replay fetch an error, not assert errorBryan Newbold2020-01-151-1/+2
|
* improve sentry reporting with 'release' git hashBryan Newbold2020-01-152-2/+5
|
* wayback replay: catch UnicodeDecodeErrorBryan Newbold2020-01-151-0/+2
| | | | | | | | In prod, ran in to a redirect URL like: b'/web/20200116043630id_/https://mediarep.org/bitstream/handle/doc/1127/Barth\xe9l\xe9my_2015_Life_and_Technology.pdf;jsessionid=A9EFB2798846F5E14A8473BBFD6AB46C?sequence=1' which broke requests.
* persist: fix dupe field copyingBryan Newbold2020-01-151-1/+8
| | | | | | In testing hit: AttributeError: 'str' object has no attribute 'get'
* persist worker: implement updated ingest result semanticsBryan Newbold2020-01-152-12/+17
|
* clarify ingest result schema and semanticsBryan Newbold2020-01-153-7/+32
|
* pass through revisit_cdxBryan Newbold2020-01-152-5/+21
|
* fix revisit resolutionBryan Newbold2020-01-151-4/+12
| | | | | Returns the *original* CDX record, but keeps the terminal_url and terminal_sha1hex info.
* add postgrest checks to test mocksBryan Newbold2020-01-141-1/+9
|
* tests: don't use localhost as a responses mock hostBryan Newbold2020-01-142-6/+6
|
* bulk ingest file request topic supportBryan Newbold2020-01-141-1/+7
|
* ingest: sketch out more of how 'existing' path would workBryan Newbold2020-01-141-8/+22
|
* ingest: check existing GROBID; also push results to sinkBryan Newbold2020-01-141-4/+22
|
* ingest persist skips 'existing' ingest resultsBryan Newbold2020-01-141-0/+3
|
* grobid-to-kafka support in ingest workerBryan Newbold2020-01-141-0/+6
|
* grobid worker fixes for newer ia lib refactorsBryan Newbold2020-01-141-3/+9
|
* small fixups to SandcrawlerPostgrestClientBryan Newbold2020-01-142-1/+11
|
* filter out archive.org and web.archive.org (until implemented)Bryan Newbold2020-01-141-1/+12
|
* SPNv2 doesn't support FTP; add a live test for non-revist FTPBryan Newbold2020-01-142-0/+26
|
* more ftp status 226 supportBryan Newbold2020-01-145-9/+23
|
* add live tests for ftp, revisitsBryan Newbold2020-01-141-1/+36
|
* basic FTP ingest support; revist record resolutionBryan Newbold2020-01-142-35/+78
| | | | | | | - supporting revisits means more wayback hits (fewer crawls) => faster - ... but this is only partial support. will also need to work through sandcrawler db schema, etc. current status should be safe to merge/use. - ftp support via treating an ftp hit as a 200
* arabesque2ingestrequest: ingest type flagBryan Newbold2020-01-141-1/+4
|
* better print() outputBryan Newbold2020-01-102-4/+4
|
* fix trivial typoBryan Newbold2020-01-101-1/+1
|
* hack/workaround for protocols.io octet PDFsBryan Newbold2020-01-101-2/+4
|
* html extract: protocols.io, fix americanarchivistBryan Newbold2020-01-101-1/+7
|
* fix redirect replay fetch methodBryan Newbold2020-01-101-1/+4
|
* limit length of error messagesBryan Newbold2020-01-101-4/+4
|
* handle SPNv2-then-CDX lookup failuresBryan Newbold2020-01-101-6/+23
| | | | | | - use a 10 second delay if CDX result isn't immediately available. blech. - if there is a lookup failure, call it a wayback-error and move on
* SPNv2 hack specifically for elsevier lookupsBryan Newbold2020-01-101-0/+15
| | | | | I'm not really sure why this is needed, and maybe with more careful testing it isn't. But it works!