aboutsummaryrefslogtreecommitdiffstats
path: root/python/persist_tool.py
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-06-16 17:28:33 -0700
committerBryan Newbold <bnewbold@archive.org>2020-06-16 17:28:36 -0700
commit5c32007e23a4f3b6902b760b5e06e4dd341918b3 (patch)
tree86fe446ef6f980d09fa95867ddb0bae847cc2765 /python/persist_tool.py
parentd49ea4fb3f567351c63816e703348d8a9fd49ff0 (diff)
downloadsandcrawler-5c32007e23a4f3b6902b760b5e06e4dd341918b3.tar.gz
sandcrawler-5c32007e23a4f3b6902b760b5e06e4dd341918b3.zip
initial work on PDF extraction worker
This worker fetches full PDFs, then extracts thumbnails, raw text, and PDF metadata. Similar to GROBID worker.
Diffstat (limited to 'python/persist_tool.py')
0 files changed, 0 insertions, 0 deletions