archive/bad-hashish

bad-hashish: a tool for recursively, remotely multi-hashing files

"recursively" meaning that files inside archives (.zip, .tar.gz) are hashed without extracting everything to disk.

"remotely" meaning that large remote (HTTP/HTTPS) files can be hashed in a streaming fashion without saving to disk.

"multi-" meaning that mulitple hash algorithms are computed in a single pass.

There are other ways to do most of these; in un-UNIX-y fashion (for now) this tool does them all together.

Planned Features

sha1, sha256, sha512, md5, blake2b
support base64, base32, hex (upper/lower), etc
can recurse on .tar and .zip (and more?) without hitting disk
can stream files via HTTP(S) without hitting disk
variable output (json, tsv, etc)

Someday?

dat, ipfs, zsync index computations
simhash/minhash/etc, for plain text https://github.com/bartolsthoorn/simhash-rs
support piping out to arbitary other commands (eg, for pdf extraction simhash, image hash...) https://github.com/abonander/img_hash

Planned Libraries

rust: - zip - tar + flate2 - tree_magic - rust-crypto - crc - clap - error-chain - reqwest - log (or slog?) - rayon (for parallelization?) - something json - csv (xsv?) - data-encoding

Minimum Viable Version

Parse arguments as local files or URLs. Either way, start reading/streaming data and hand off pipe to a thing that consumes 4MB chunks at a time and hashes.

Next, add parallelization (rayon?) for hashes.

Output as space-separated (default), csv, or json, one line per file.

Examples:

hashish some_file.txt

cat zip_urls.txt | parallel -j8 hashish --recurse-only {} > all_hashes.txt

Arguments: - chunk size - recurse into files or not - output format - cores to use?

Later Thoughts

Limited by {CPU, disk, network}? Where to parallelize? Data locality.