From 4b0a50b25f72a5b2d693379f9693b37131c12371 Mon Sep 17 00:00:00 2001 From: bnewbold Date: Thu, 25 May 2017 23:55:16 -0700 Subject: init with notes --- README.md | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 73 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..e09d1f5 --- /dev/null +++ b/README.md @@ -0,0 +1,73 @@ + +bad-hashish: a tool for recursively, remotely multi-hashing files + +"recursively" meaning that files inside archives (.zip, .tar.gz) are hashed +without extracting everything to disk. + +"remotely" meaning that large remote (HTTP/HTTPS) files can be hashed in a +streaming fashion without saving to disk. + +"multi-" meaning that mulitple hash algorithms are computed in a single pass. + +There are other ways to do most of these; in un-UNIX-y fashion (for now) this +tool does them all together. + +## Planned Features + +- sha1, sha256, sha512, md5, blake2b +- support base64, base32, hex (upper/lower), etc +- can recurse on .tar and .zip (and more?) without hitting disk +- can stream files via HTTP(S) without hitting disk +- variable output (json, tsv, etc) + +Someday? + +- dat, ipfs, zsync index computations +- simhash/minhash/etc, for plain text + https://github.com/bartolsthoorn/simhash-rs +- support piping out to arbitary other commands + (eg, for pdf extraction simhash, image hash...) + https://github.com/abonander/img_hash + +## Planned Libraries + +rust: +- zip +- tar + flate2 +- tree_magic +- rust-crypto +- crc +- clap +- error-chain +- reqwest +- log (or slog?) +- rayon (for parallelization?) +- something json +- csv (xsv?) +- data-encoding + +## Minimum Viable Version + +Parse arguments as local files or URLs. Either way, start reading/streaming +data and hand off pipe to a thing that consumes 4MB chunks at a time and +hashes. + +Next, add parallelization (rayon?) for hashes. + +Output as space-separated (default), csv, or json, one line per file. + +Examples: + + hashish some_file.txt + + cat zip_urls.txt | parallel -j8 hashish --recurse-only {} > all_hashes.txt + +Arguments: +- chunk size +- recurse into files or not +- output format +- cores to use? + +## Later Thoughts + +Limited by {CPU, disk, network}? Where to parallelize? Data locality. -- cgit v1.2.3