diff options
author | Bryan Newbold <bnewbold@archive.org> | 2017-07-26 10:41:09 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2017-07-26 10:41:09 -0700 |
commit | 1e3189dc39c8facb6e27f076b45d7f1138e6c2eb (patch) | |
tree | a8e17ee348774e2eb1f8d68ec27ca6d527c185e4 | |
parent | df56911d240d5153055285525940a549724905ba (diff) | |
download | lsh-interop-1e3189dc39c8facb6e27f076b45d7f1138e6c2eb.tar.gz lsh-interop-1e3189dc39c8facb6e27f076b45d7f1138e6c2eb.zip |
update tokenization notes
-rw-r--r-- | README.md | 13 |
1 files changed, 8 insertions, 5 deletions
@@ -101,13 +101,16 @@ Proposed extraction and tokenization: table contents, quoted text. Do include reference lists, but do not include tokens from URLs or identifiers. - UTF-8 encoded tokens -- fallback to unicode word separators for tokenization (TODO: ???) -- no zero-width or non-printing unicode modifiers -- tokens should include only "alphanumeric" characters (TODO: as defined by - unicode plane?) +- fallback to unicode word-character boundaries for tokenization if a + language-specific tokenizer is not available +- tokens should include only "word characters", as commonly included in + unicode-aware regex libraries. Specifically including the cateogires: `Ll Lu + Lt Lo Lm Mn Nd Pc`. They must include at least one letter/"Alphabetic" + character. +- specifically, no zero-width or non-printing unicode modifiers - numbers (unless part of an alphanumeric string, eg an acronym) should not be included -- tokens (words) must be 3 characters minimum +- TODO: instead, strip all numeric characters? - OPTIONALLY, a language-specific stop-list appropriate for search-engine indexing may be used. |