aboutsummaryrefslogtreecommitdiffstats
path: root/python_hadoop/extraction_cdx_grobid.py
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2021-08-16 20:17:30 -0700
committerBryan Newbold <bnewbold@archive.org>2021-08-16 20:17:30 -0700
commite1cde3c95e5176f232ecbc22a8619149078dc91f (patch)
tree2624b700015663272e5d9edd21d7bf180e3803b6 /python_hadoop/extraction_cdx_grobid.py
parent26d90505bda2d1dfcc25af6b8a0270faa11729e7 (diff)
downloadsandcrawler-e1cde3c95e5176f232ecbc22a8619149078dc91f.tar.gz
sandcrawler-e1cde3c95e5176f232ecbc22a8619149078dc91f.zip
html ingest: detect some blog platforms, and allow lower wordcount threshold
Diffstat (limited to 'python_hadoop/extraction_cdx_grobid.py')
0 files changed, 0 insertions, 0 deletions