1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
|
bad-hashish: a tool for recursively, remotely multi-hashing files
"recursively" meaning that files inside archives (.zip, .tar.gz) are hashed
without extracting everything to disk.
"remotely" meaning that large remote (HTTP/HTTPS) files can be hashed in a
streaming fashion without saving to disk.
"multi-" meaning that mulitple hash algorithms are computed in a single pass.
There are other ways to do most of these; in un-UNIX-y fashion (for now) this
tool does them all together.
## Planned Features
- sha1, sha256, sha512, md5, blake2b
- support base64, base32, hex (upper/lower), etc
- can recurse on .tar and .zip (and more?) without hitting disk
- can stream files via HTTP(S) without hitting disk
- variable output (json, tsv, etc)
Someday?
- dat, ipfs, zsync index computations
- simhash/minhash/etc, for plain text
https://github.com/bartolsthoorn/simhash-rs
- support piping out to arbitary other commands
(eg, for pdf extraction simhash, image hash...)
https://github.com/abonander/img_hash
## Planned Libraries
rust:
- zip
- tar + flate2
- tree_magic
- rust-crypto
- crc
- clap
- error-chain
- reqwest
- log (or slog?)
- rayon (for parallelization?)
- something json
- csv (xsv?)
- data-encoding
## Minimum Viable Version
Parse arguments as local files or URLs. Either way, start reading/streaming
data and hand off pipe to a thing that consumes 4MB chunks at a time and
hashes.
Next, add parallelization (rayon?) for hashes.
Output as space-separated (default), csv, or json, one line per file.
Examples:
hashish some_file.txt
cat zip_urls.txt | parallel -j8 hashish --recurse-only {} > all_hashes.txt
Arguments:
- chunk size
- recurse into files or not
- output format
- cores to use?
## Later Thoughts
Limited by {CPU, disk, network}? Where to parallelize? Data locality.
|