diff options
author | Bryan Newbold <bnewbold@archive.org> | 2022-10-24 14:17:44 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2022-10-24 14:17:46 -0700 |
commit | 4f0d10f4b38534eda673a8dfe28e3a58af9a8a8a (patch) | |
tree | 87c84d496a9976084fc4af7825e549c07fbcffb9 /python_hadoop/kafka_grobid_hbase.py | |
parent | 855153ae4fe03656adde16c56a4347f4b3d26487 (diff) | |
download | sandcrawler-4f0d10f4b38534eda673a8dfe28e3a58af9a8a8a.tar.gz sandcrawler-4f0d10f4b38534eda673a8dfe28e3a58af9a8a8a.zip |
ingest: don't prefer WARC over SPN so strongly
We generally prefer an older WARC record over an SPN record, because the
lookup is easier. But, this was causing problems with repeated ingest,
so demote it.
We may want to make this more configurable in the future, so things like
HTML sub-resource lookups or bulk ingest won't prefer random new SPN
captures.
Diffstat (limited to 'python_hadoop/kafka_grobid_hbase.py')
0 files changed, 0 insertions, 0 deletions