diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-06-16 17:28:33 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-06-16 17:28:36 -0700 |
commit | 5c32007e23a4f3b6902b760b5e06e4dd341918b3 (patch) | |
tree | 86fe446ef6f980d09fa95867ddb0bae847cc2765 /python/.gitignore | |
parent | d49ea4fb3f567351c63816e703348d8a9fd49ff0 (diff) | |
download | sandcrawler-5c32007e23a4f3b6902b760b5e06e4dd341918b3.tar.gz sandcrawler-5c32007e23a4f3b6902b760b5e06e4dd341918b3.zip |
initial work on PDF extraction worker
This worker fetches full PDFs, then extracts thumbnails, raw text, and
PDF metadata. Similar to GROBID worker.
Diffstat (limited to 'python/.gitignore')
0 files changed, 0 insertions, 0 deletions