aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler
Commit message (Collapse)AuthorAgeFilesLines
* url cleaning (canonicalization) for ingest base_urlBryan Newbold2020-03-103-3/+14
| | | | | | | | | | | As mentioned in comment, this first version does not re-write the URL in the `base_url` field. If we did so, then ingest_request rows would not SQL JOIN to ingest_file_result rows, which we wouldn't want. In the future, behaviour should maybe be to refuse to process URLs that aren't clean (eg, if base_url != clean_url(base_url)) and return a 'bad-url' status or soemthing. Then we would only accept clean URLs in both tables, and clear out all old/bad URLs with a cleanup script.
* fixes to ingest-request persistBryan Newbold2020-03-051-3/+1
|
* persist: ingest_request tool (with no ingest_file_result)Bryan Newbold2020-03-052-1/+30
|
* ia: catch wayback ChunkedEncodingErrorBryan Newbold2020-03-051-0/+3
|
* ingest: make content-decoding more robustBryan Newbold2020-03-031-1/+2
|
* make gzip content-encoding path more robustBryan Newbold2020-03-031-1/+10
|
* ingest: crude content-encoding supportBryan Newbold2020-03-021-1/+19
| | | | | | This perhaps should be handled in IA wrapper tool directly, instead of in ingest code. Or really, possibly a bug in wayback python library or SPN?
* ingest: add force_recrawl flag to skip historical wayback lookupBryan Newbold2020-03-021-3/+5
|
* remove protocols.io octet-stream hackBryan Newbold2020-03-021-6/+2
|
* more mime normalizationBryan Newbold2020-02-271-1/+18
|
* ingest: narrow xhtml filterBryan Newbold2020-02-251-1/+1
|
* pdftrio: tweaks to avoid connection errorsBryan Newbold2020-02-241-1/+9
|
* fix warc_offset -> offsetBryan Newbold2020-02-241-1/+1
|
* ingest: handle broken revisit recordsBryan Newbold2020-02-241-1/+4
|
* ingest: handle missing chemrxvi tagBryan Newbold2020-02-241-1/+1
|
* ingest: treat CDX lookup error as a wayback-errorBryan Newbold2020-02-241-1/+4
|
* ingest: more direct americanarchivist PDF url guessBryan Newbold2020-02-241-0/+4
|
* ingest: make ehp.niehs.nih.gov rule more robustBryan Newbold2020-02-241-2/+3
|
* small tweak to americanarchivist.org URL extractionBryan Newbold2020-02-241-1/+1
|
* fetch_petabox_body: allow non-200 status code fetchesBryan Newbold2020-02-241-2/+10
| | | | | | But only if it matches what the revisit record indicated. This is mostly to enable better revisit fetching.
* allow fuzzy revisit matchesBryan Newbold2020-02-241-1/+26
|
* ingest: more revisit fixesBryan Newbold2020-02-221-4/+4
|
* html: more publisher-specific fulltext extraction tricksBryan Newbold2020-02-221-0/+47
|
* ia: improve warc/revisit implementationBryan Newbold2020-02-221-26/+46
| | | | | A lot of the terminal-bad-status seems to have due to not handling revisits correctly. They have status_code = '-' or None.
* html: degruyter extraction; disabled journals.lww.comBryan Newbold2020-02-221-0/+19
|
* ingest: include better terminal URL/status_code/dtBryan Newbold2020-02-221-0/+8
| | | | Was getting a lot of "last hit" metadata for these columns.
* ingest: skip more non-pdf, non-paper domainsBryan Newbold2020-02-221-0/+9
|
* cdx: handle empty/null CDX responseBryan Newbold2020-02-221-0/+2
| | | | Sometimes seem to get empty string instead of empty JSON list
* html: handle TypeError during bs4 parseBryan Newbold2020-02-221-1/+7
|
* filter out CDX rows missing WARC playback fieldsBryan Newbold2020-02-191-0/+4
|
* pdf_trio persist fixes from prodBryan Newbold2020-02-192-5/+9
|
* allow <meta property=citation_pdf_url>Bryan Newbold2020-02-181-0/+3
| | | | at least researchgate does this (!)
* X-Archive-Src more robust than X-Archive-Redirect-ReasonBryan Newbold2020-02-181-2/+3
|
* wayback: on bad redirects, log instead of assertBryan Newbold2020-02-181-2/+13
| | | | This is a different form of mangled redirect.
* attempt to work around corrupt ARC files from alexa issueBryan Newbold2020-02-181-0/+5
|
* unpaywall2ingestrequest transform scriptBryan Newbold2020-02-181-1/+1
|
* pdftrio: mode controlled by CLI argBryan Newbold2020-02-181-4/+5
|
* pdftrio: fix error nesting in pdftrio keyBryan Newbold2020-02-181-12/+20
|
* include rel and oa_status in ingest request 'extra'Bryan Newbold2020-02-182-2/+2
|
* pdftrio fixes from testingBryan Newbold2020-02-131-3/+9
|
* move pdf_trio results back under key in JSON/KafkaBryan Newbold2020-02-132-7/+31
|
* pdftrio: small fixes from testingBryan Newbold2020-02-121-2/+2
|
* pdftrio basic python codeBryan Newbold2020-02-124-1/+238
| | | | This is basically just a copy/paste of GROBID code, only simpler!
* fix persist bug where ingest_request_source not savedBryan Newbold2020-02-051-0/+1
|
* fix bug where ingest_request extra fields not persistedBryan Newbold2020-02-051-1/+2
|
* handle alternative dt format in WARC headersBryan Newbold2020-02-051-2/+4
| | | | | If there is a UTC timestamp, with trailing 'Z' indicating timezone, that is valid but increases string length by one.
* decrease SPNv2 polling timeout to 3 minutesBryan Newbold2020-02-051-2/+2
|
* improvements to reliability from prod testingBryan Newbold2020-02-032-7/+20
|
* hack-y backoff ingest attemptBryan Newbold2020-02-032-3/+26
| | | | | | | | | | | | | | | The goal here is to have SPNv2 requests backoff when we get back-pressure (usually caused by some sessions taking too long). Lack of proper back-pressure is making it hard to turn up parallelism. This is a hack because we still timeout and drop the slow request. A better way is probably to have a background thread run, while the KafkaPusher thread does polling. Maybe with timeouts to detect slow processing (greater than 30 seconds?) and only pause/resume in that case. This would also make taking batches easier. Unlike the existing code, however, the parallelism needs to happen at the Pusher level to do the polling (Kafka) and "await" (for all worker threads to complete) correctly.
* grobid petabox: fix fetch body/contentBryan Newbold2020-02-031-1/+1
|