wip

author: Bryan Newbold <bnewbold@archive.org> 2017-07-25 21:16:28 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2017-07-25 21:16:28 -0700
commit: 9155b7da0ff7e02776f34bed40215aca97d04cf9 (patch)
tree: 144a6d50a0def098add10abf013b68c6ff9d3a74
parent: 6cb3a82f42c8f1365fa4765f0ceb7d56a3c1547c (diff)
download: lsh-interop-9155b7da0ff7e02776f34bed40215aca97d04cf9.tar.gz
lsh-interop-9155b7da0ff7e02776f34bed40215aca97d04cf9.zip
1 files changed, 129 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..b3d8a90
--- /dev/null
+++ b/README.md
@@ -0,0 +1,129 @@
+
+# Interoperability of Similarity Hashing Schemes
+
+This (work-in-progress) document proposes specific document fingerprinting
+algorithms, configuration, and practices for use in identifying and
+cross-referencing nearly identical documents in several medias. For example,
+web content across various HTML encodings and web design revisions, or research
+publications across PDF, HTML, and XML formats.
+
+The motivation to come to a public consensus on specific details of these
+techniques is to enhance collaboration in efforts like web archiving,
+automated document identification, verifying repository contents, and
+near-duplicate-detection across distributed datasets.
+
+These techniques are variously refered to as "near duplicate detection" and
+"similarity hashing". The major advances in this area historically came from
+commercial search engine development and the field of information retrieval in
+academia.
+
+Contributions, corrections, and feedback to this document welcomed! You can use
+github features (issues and pull-requests) or email contributors directly.
+
+## Background
+
+The techniques described here usually complement both regular content hashing
+and full-document similarity measures. Regular hashing (or digests) generally
+have the property that exact byte-for-byte file matches will have equal hashes,
+but any small difference between files results in a completely different hash.
+The hash is of a fixed length of some 40 to 64 bytes (very small compared to
+document size).  This is useful for fast equality checking and verify the
+integrity of files, but is too brittle for use identification across media.
+Similarity measures (such as edit distance or cosine similarity) often take a
+full document (or an extract set of tokens from the document) and give a
+fractional measure of similarity. They have the property of being relatively
+inefficient for lookups in large corpuses because each document must be
+compared one-to-one.
+
+Similarity hashing is often used as an optimization on top of similarity
+measures: the hashes of all documents are pre-computed and then sorted, and
+only those documents within a fixed distance in the sorted list need to be
+compared. This reduces identification of near-duplicates in a corpus from
+`O(N^2)` similarity computations to `O(N)` similarity compurations. Indexed
+lists also reduce work in the context of individual lookup query for large
+corpuses.
+
+Some notable algorithms to consider are:
+
+- `simhash`
+- `minhash`
+
+However, the papers describing these algorithms have a few parameters (eg, hash
+bit length, token hash algorithm), and leave several important steps up to the
+implementor, such as:
+
+- media-specific content extraction (eg, stripping some or all HTML tags)
+- input encoding normalization (UTF-8, etc)
+- tokenization (particularly for non-latin languages)
+- stop words
+- string serialization (eg, base16, base32, or base64 encoding)
+- prefixes or short identifiers for specific schemes
+
+These parameters are usually left to be tuned by implementors. Some parameters
+can still be left to tuning by implementors, because they happen at comparison
+time, not during batch indexing:
+
+- number of segmented rotations when sorting and distance of match (in bit
+  flips) to seach for 
+- algorithm or measure used for final similarity measurement
+
+Note that unlike regular hashing algorithms, the details for similarity hashes
+don't necessarily need to be specified in completeness: if differences between
+implementations only result in (statistically) tiny changes in the hash, hashes
+should still compare somewhat accurately. Complete reproducibility (exact
+similarity hash output for exact document input) between implementations is
+desirable, but not strictly necessary in all cases, leaving some wiggle room
+for implementors and optimization.
+
+## Proposed Schemes
+
+Similar to the `blake2` and `SHA` families of hashes, I can imagine that a
+small (1-4) set of variants might be useful for use in different contexts. It's
+also good practice to build software, protocols, and data models to permit
+swapping out algorithms for future flexibility, though of course the whole
+concept here is to settle on a small number of consensus schemes for
+interoperability.
+
+TODO: summary table here
+
+### simhash-doc
+
+The roughly defined scope for this scheme would be "documents" of length
+between one and fifty pages when printed. If it could scale down to 2-3
+paragraphs and up to thousand-page documents that would be great, but it should
+be optimized for the narrower scope. It should be useful across media (web
+pages, PDFs, plain text, etc), and across the most popular languages.
+
+Proposed extraction and tokenization:
+
+- strip all metadata and styling when extracting. include figure captions,
+  table contents, quoted text. Do include reference lists, but do not include
+  tokens from URLs or identifiers.
+- UTF-8 encoded tokens
+- fallback to unicode word separators for tokenization (TODO: ???)
+- no zero-width or non-printing unicode modifiers
+- tokens should include only "alphanumeric" characters (TODO: as defined by
+  unicode plane?)
+- numbers (unless part of an alphanumeric string, eg an acronym) should not be
+  included
+- tokens (words) must be 3 characters minimum
+- OPTIONALLY, a language-specific stop-list appropriate for search-engine
+  indexing may be used.
+
+Proposed algorithm and parameters:
+
+- `simhash` algorithm
+- TODO: fast token hashing algorithm?
+- TODO: 64x signed 8-bit buckets during calculation
+- 64bit length final form
+- base32 (non-capitalization-sensitive) when serialized as a string
+
+Recommended lookup parameters:
+
+- TODO: how many columns in rotation?
+- TODO: N-bit window
+- TODO: cosine similarity measure? edit distance?
+- TODO: 0.XYZ considered a loose match, 0.XYZ considered a close match
+
+## References
+
author	Bryan Newbold <bnewbold@archive.org>	2017-07-25 21:16:28 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2017-07-25 21:16:28 -0700
commit	9155b7da0ff7e02776f34bed40215aca97d04cf9 (patch)
tree	144a6d50a0def098add10abf013b68c6ff9d3a74
parent	6cb3a82f42c8f1365fa4765f0ceb7d56a3c1547c (diff)
download	lsh-interop-9155b7da0ff7e02776f34bed40215aca97d04cf9.tar.gz lsh-interop-9155b7da0ff7e02776f34bed40215aca97d04cf9.zip