diff options
author | Bryan Newbold <bnewbold@archive.org> | 2018-05-08 10:06:14 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2018-05-08 10:06:20 -0700 |
commit | 18a55d37a87d4391bd8161201c523dd7d7f0f1e7 (patch) | |
tree | 86db4c84cf4fd0dde5ea9508617344018e640104 /TODO | |
parent | 1831a3b4495aee275e4b4b187fa545eba75eb87b (diff) | |
download | sandcrawler-18a55d37a87d4391bd8161201c523dd7d7f0f1e7.tar.gz sandcrawler-18a55d37a87d4391bd8161201c523dd7d7f0f1e7.zip |
fix tests post-DISTINCT
Confirms it's working!
Diffstat (limited to 'TODO')
-rw-r--r-- | TODO | 5 |
1 files changed, 5 insertions, 0 deletions
@@ -1,4 +1,9 @@ +pig: +- potentially want to *not* de-dupe CDX lines by uniq sha1 in all cases; run + this as a second-stage filter? for example, may want many URL links in fatcat + for a single file (different links, different policies) + - include input file name (and chunk? and CDX?) in sentry context - play with test image on older releases (eg, trusty) |