diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2020-08-12 15:05:51 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2020-08-12 15:05:51 +0200 |
commit | 0b4db31a797a582c25942e693d531ee37b618674 (patch) | |
tree | 3363c9ba42e711234a911931e65ac184520892a3 /notes | |
parent | 703fdbebc53352036bfa9e9a13599421e38d949e (diff) | |
download | fuzzycat-0b4db31a797a582c25942e693d531ee37b618674.tar.gz fuzzycat-0b4db31a797a582c25942e693d531ee37b618674.zip |
note on optimization: marisa-trie
Currently, the JSON mapping is 172M, turning this into a dict takes a
bit, plus consumes GBs of memory. For exact lookups, we might want to
use marisa-trie:
> String data in a MARISA-trie may take up to 50x-100x less memory than
in a standard Python dict; the raw lookup speed is comparable; trie also
provides fast advanced methods like prefix search.
Diffstat (limited to 'notes')
-rw-r--r-- | notes/plan.md | 1 |
1 files changed, 1 insertions, 0 deletions
diff --git a/notes/plan.md b/notes/plan.md index 1660f25..0e319ae 100644 --- a/notes/plan.md +++ b/notes/plan.md @@ -7,6 +7,7 @@ ## Containers * [ ] create notebook on duplicates +* [ ] static mapping, that is efficient to store, maybe via: https://github.com/pytries/marisa-trie ## Bulk |