From 0b4db31a797a582c25942e693d531ee37b618674 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Wed, 12 Aug 2020 15:05:51 +0200 Subject: note on optimization: marisa-trie Currently, the JSON mapping is 172M, turning this into a dict takes a bit, plus consumes GBs of memory. For exact lookups, we might want to use marisa-trie: > String data in a MARISA-trie may take up to 50x-100x less memory than in a standard Python dict; the raw lookup speed is comparable; trie also provides fast advanced methods like prefix search. --- notes/plan.md | 1 + 1 file changed, 1 insertion(+) diff --git a/notes/plan.md b/notes/plan.md index 1660f25..0e319ae 100644 --- a/notes/plan.md +++ b/notes/plan.md @@ -7,6 +7,7 @@ ## Containers * [ ] create notebook on duplicates +* [ ] static mapping, that is efficient to store, maybe via: https://github.com/pytries/marisa-trie ## Bulk -- cgit v1.2.3