aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/pdfextract.py
Commit message (Expand)AuthorAgeFilesLines
...
* handle non-success case of parsing extract from JSON/dictBryan Newbold2020-06-261-1/+1
* Revert "simpler handling of null PDF text pages"Bryan Newbold2020-06-251-4/+11
* simpler handling of null PDF text pagesBryan Newbold2020-06-251-11/+4
* pdfextract: attributerror with text extractionBryan Newbold2020-06-251-4/+12
* catch UnicodeDecodeError in pdfextractBryan Newbold2020-06-251-1/+10
* pdfextract: handle too-large fulltextBryan Newbold2020-06-251-0/+17
* another bad/non PDF test; catch correct errorBryan Newbold2020-06-251-1/+1
* pdfextract: catch poppler.LockedDocumentErrorBryan Newbold2020-06-251-1/+1
* pdfextract support in ingest workerBryan Newbold2020-06-251-0/+24
* poppler: correct RGBA buffer endian-nessBryan Newbold2020-06-251-1/+1
* pdfextract_tool fixes from prod usageBryan Newbold2020-06-251-1/+1
* pdfextract: fix pdf_extra key namesBryan Newbold2020-06-251-2/+2
* ensure pdf_meta isn't passed an empty dict()Bryan Newbold2020-06-251-1/+4
* fixes and tweaks from testing locallyBryan Newbold2020-06-171-3/+64
* make process_pdf() more robust to parse errorsBryan Newbold2020-06-171-5/+29
* note about text layout with pdf extractionBryan Newbold2020-06-171-0/+8
* rename pdf tools to pdfextractBryan Newbold2020-06-171-0/+167