update tokenization notes

author: Bryan Newbold <bnewbold@archive.org> 2017-07-26 10:41:09 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2017-07-26 10:41:09 -0700
commit: 1e3189dc39c8facb6e27f076b45d7f1138e6c2eb (patch)
tree: a8e17ee348774e2eb1f8d68ec27ca6d527c185e4
parent: df56911d240d5153055285525940a549724905ba (diff)
download: lsh-interop-1e3189dc39c8facb6e27f076b45d7f1138e6c2eb.tar.gz
lsh-interop-1e3189dc39c8facb6e27f076b45d7f1138e6c2eb.zip
1 files changed, 8 insertions, 5 deletions
diff --git a/README.md b/README.md
index 94dfc81..54782be 100644
--- a/README.md
+++ b/README.md
@@ -101,13 +101,16 @@ Proposed extraction and tokenization:
   table contents, quoted text. Do include reference lists, but do not include
   tokens from URLs or identifiers.
 - UTF-8 encoded tokens
-- fallback to unicode word separators for tokenization (TODO: ???)
-- no zero-width or non-printing unicode modifiers
-- tokens should include only "alphanumeric" characters (TODO: as defined by
-  unicode plane?)
+- fallback to unicode word-character boundaries for tokenization if a
+  language-specific tokenizer is not available
+- tokens should include only "word characters", as commonly included in
+  unicode-aware regex libraries. Specifically including the cateogires: `Ll Lu
+  Lt Lo Lm Mn Nd Pc`. They must include at least one letter/"Alphabetic"
+  character.
+- specifically, no zero-width or non-printing unicode modifiers
 - numbers (unless part of an alphanumeric string, eg an acronym) should not be
   included
-- tokens (words) must be 3 characters minimum
+- TODO: instead, strip all numeric characters?
 - OPTIONALLY, a language-specific stop-list appropriate for search-engine
   indexing may be used.
author	Bryan Newbold <bnewbold@archive.org>	2017-07-26 10:41:09 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2017-07-26 10:41:09 -0700
commit	1e3189dc39c8facb6e27f076b45d7f1138e6c2eb (patch)
tree	a8e17ee348774e2eb1f8d68ec27ca6d527c185e4
parent	df56911d240d5153055285525940a549724905ba (diff)
download	lsh-interop-1e3189dc39c8facb6e27f076b45d7f1138e6c2eb.tar.gz lsh-interop-1e3189dc39c8facb6e27f076b45d7f1138e6c2eb.zip