aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/pdfextract.py
Commit message (Collapse)AuthorAgeFilesLines
...
* handle non-success case of parsing extract from JSON/dictBryan Newbold2020-06-261-1/+1
|
* Revert "simpler handling of null PDF text pages"Bryan Newbold2020-06-251-4/+11
| | | | | | This reverts commit 254f24ad6566c9d4b5814868911b604802847b58. Attribute was actually internal to text() call, not a None page.
* simpler handling of null PDF text pagesBryan Newbold2020-06-251-11/+4
|
* pdfextract: attributerror with text extractionBryan Newbold2020-06-251-4/+12
|
* catch UnicodeDecodeError in pdfextractBryan Newbold2020-06-251-1/+10
|
* pdfextract: handle too-large fulltextBryan Newbold2020-06-251-0/+17
|
* another bad/non PDF test; catch correct errorBryan Newbold2020-06-251-1/+1
| | | | | | This test doesn't actually catch the error. I'm not sure why type checks don't discover the "LockedDocumentError not part of poppler" issue though.
* pdfextract: catch poppler.LockedDocumentErrorBryan Newbold2020-06-251-1/+1
|
* pdfextract support in ingest workerBryan Newbold2020-06-251-0/+24
|
* poppler: correct RGBA buffer endian-nessBryan Newbold2020-06-251-1/+1
|
* pdfextract_tool fixes from prod usageBryan Newbold2020-06-251-1/+1
|
* pdfextract: fix pdf_extra key namesBryan Newbold2020-06-251-2/+2
|
* ensure pdf_meta isn't passed an empty dict()Bryan Newbold2020-06-251-1/+4
|
* fixes and tweaks from testing locallyBryan Newbold2020-06-171-3/+64
|
* make process_pdf() more robust to parse errorsBryan Newbold2020-06-171-5/+29
|
* note about text layout with pdf extractionBryan Newbold2020-06-171-0/+8
|
* rename pdf tools to pdfextractBryan Newbold2020-06-171-0/+167