aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler
Commit message (Expand)AuthorAgeFilesLines
* more bad PDF sha1; print sha1 before poppler extractBryan Newbold2020-08-051-0/+7
* spn2: skip js behavior (experiment)Bryan Newbold2020-08-051-0/+1
* SPN2: ensure not fetching outlinksBryan Newbold2020-08-051-0/+1
* another bad PDF sha1Bryan Newbold2020-08-041-0/+1
* another PDF sha1hexBryan Newbold2020-07-271-0/+1
* yet another 'bad' PDF sha1hexBryan Newbold2020-07-271-0/+1
* use new SPNv2 'skip_first_archive' paramBryan Newbold2020-07-221-0/+1
* add more slow PDF hashesBryan Newbold2020-07-051-0/+2
* add another bad PDF sha1hexBryan Newbold2020-07-021-0/+1
* another bad PDF SHA-1Bryan Newbold2020-06-301-0/+1
* hack to unblock thumbnail processing pipelineBryan Newbold2020-06-291-0/+16
* customize timeout per worker; 120sec for pdf-extractBryan Newbold2020-06-292-2/+3
* handle empty fetched blobBryan Newbold2020-06-271-1/+6
* CDX KeyError as WaybackError from fetch workerBryan Newbold2020-06-261-1/+1
* handle None 'metadata' field correctlyBryan Newbold2020-06-261-1/+1
* handle non-success case of parsing extract from JSON/dictBryan Newbold2020-06-261-1/+1
* report revisit non-200 as a WaybackErrorBryan Newbold2020-06-261-7/+7
* Revert "simpler handling of null PDF text pages"Bryan Newbold2020-06-251-4/+11
* simpler handling of null PDF text pagesBryan Newbold2020-06-251-11/+4
* pdfextract: attributerror with text extractionBryan Newbold2020-06-251-4/+12
* catch UnicodeDecodeError in pdfextractBryan Newbold2020-06-251-1/+10
* don't nest generic fetch errors under pdf_trioBryan Newbold2020-06-251-12/+6
* pdfextract: handle too-large fulltextBryan Newbold2020-06-251-0/+17
* another bad/non PDF test; catch correct errorBryan Newbold2020-06-251-1/+1
* pdfextract: catch poppler.LockedDocumentErrorBryan Newbold2020-06-251-1/+1
* pdfextract support in ingest workerBryan Newbold2020-06-252-1/+59
* poppler: correct RGBA buffer endian-nessBryan Newbold2020-06-251-1/+1
* pdfextract_tool fixes from prod usageBryan Newbold2020-06-251-1/+1
* pdfextract: fix pdf_extra key namesBryan Newbold2020-06-251-2/+2
* ensure pdf_meta isn't passed an empty dict()Bryan Newbold2020-06-251-1/+4
* changes from prodBryan Newbold2020-06-252-4/+18
* fixes and tweaks from testing locallyBryan Newbold2020-06-175-17/+132
* tweak kafka topic names and seaweedfs layoutBryan Newbold2020-06-171-1/+2
* make process_pdf() more robust to parse errorsBryan Newbold2020-06-171-5/+29
* note about text layout with pdf extractionBryan Newbold2020-06-171-0/+8
* lint fixesBryan Newbold2020-06-171-1/+1
* rename pdf tools to pdfextractBryan Newbold2020-06-171-0/+0
* partial test coverage of pdf extract workerBryan Newbold2020-06-171-6/+9
* add new pdf workers/persistersBryan Newbold2020-06-172-2/+101
* pdf: mypy and typo fixesBryan Newbold2020-06-172-15/+22
* workers: refactor to pass key to process()Bryan Newbold2020-06-176-20/+28
* initial work on PDF extraction workerBryan Newbold2020-06-162-1/+158
* refactor worker fetch code into wrapper classBryan Newbold2020-06-163-141/+111
* rename KafkaGrobidSink -> KafkaCompressSinkBryan Newbold2020-06-162-2/+2
* handle UnboundLocalError in HTML parsingBryan Newbold2020-05-191-1/+4
* hotfix for html meta extract codepathBryan Newbold2020-05-031-1/+1
* ingest: handle partial citation_pdf_url tagBryan Newbold2020-05-031-0/+3
* workers: add missing want() dataflow pathBryan Newbold2020-04-301-0/+9
* ingest: don't 'want' non-PDF ingestBryan Newbold2020-04-301-0/+5
* timeouts: don't push through None error messagesBryan Newbold2020-04-291-2/+2