aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
...
* issn: simhash exampleMartin Czygan2020-08-172-0/+20
|
* add notes on abbrevsMartin Czygan2020-08-153-1/+2261
|
* include original and normalized name in default shelve (1G)Martin Czygan2020-08-153-8/+16
|
* separate cleanupsMartin Czygan2020-08-152-0/+47
|
* cleanup handling: add parameterMartin Czygan2020-08-154-19/+26
| | | | allow string cleanup be called directly
* update static filesMartin Czygan2020-08-152-1/+3
|
* add extra filesMartin Czygan2020-08-153-0/+17
|
* try out shelve for name lookupsMartin Czygan2020-08-151-10/+62
| | | | | uncompressed about 500 MB; marisa-trie would need extra encoding approach (plus it is a heavy dependency).
* update READMEMartin Czygan2020-08-151-1/+5
|
* issn: pair with issnlMartin Czygan2020-08-141-19/+26
|
* update planMartin Czygan2020-08-141-0/+5
|
* add de-jsonld flagMartin Czygan2020-08-141-15/+57
|
* issn: jsonld breakupMartin Czygan2020-08-131-25/+190
|
* update journal name notebookMartin Czygan2020-08-131-434/+442
|
* update notebookMartin Czygan2020-08-121-86/+729
|
* update READMEMartin Czygan2020-08-121-1/+3
|
* add journal name notebookMartin Czygan2020-08-124-0/+16016
|
* add deps for notebooksMartin Czygan2020-08-121-4/+6
|
* update setup.pyMartin Czygan2020-08-121-2/+10
|
* note on optimization: marisa-trieMartin Czygan2020-08-121-0/+1
| | | | | | | | | | Currently, the JSON mapping is 172M, turning this into a dict takes a bit, plus consumes GBs of memory. For exact lookups, we might want to use marisa-trie: > String data in a MARISA-trie may take up to 50x-100x less memory than in a standard Python dict; the raw lookup speed is comparable; trie also provides fast advanced methods like prefix search.
* update MakefileMartin Czygan2020-08-121-8/+12
|
* issn: generate a name to issn mappingMartin Czygan2020-08-122-31/+88
| | | | | | | | | | This allows to make suggestions about potentially ambiguous titles. Maybe suggest a minimal length. Ultimately, there are only about 2M journal titles. If an arbitrary string must match a journal title (not a generic container title), then we can use a combination of direct lookup; plus some extra processing based on this dataset.
* stub tool: fuzzycat-issn to generate test dataMartin Czygan2020-08-121-0/+69
| | | | | | | | | | | | | | | | | | | currently: fuzzycat-issn --make-pairs will generate a TSV with (issn, a, b) example, e.g. ... 0011-9717 Detskaâ literatura. Детская литература. 0011-9717 Detskaâ literatura. Detskaâ literatura 0011-9717 Детская литература. Detskaâ literatura 0011-6637 Darbininkas. Darbininkas 0012-0820 deutsche Tabakbau deutsche Tabakbau. 0011-5444 Daily Kent stater. Daily Kent stater ... The idea is that these names per definition denote the same journal. We might even have a fixed lookup table, since some variants involve multiple scripts (and there are only around 2M names in total). Currently 1992176 pairs can be generated.
* adjust formattingMartin Czygan2020-08-122-2/+6
|
* fix importsMartin Czygan2020-08-122-2/+2
|
* update READMEMartin Czygan2020-08-121-2/+4
|
* yapf: reduce column limitMartin Czygan2020-08-121-1/+1
|
* improve docs and importsMartin Czygan2020-08-121-9/+8
|
* try: all matching methods should start with matchMartin Czygan2020-08-122-2/+2
|
* makefile: add container export downloadMartin Czygan2020-08-121-1/+6
|
* add matching submoduleMartin Czygan2020-08-122-0/+149
|
* add deps: ftfy, unidecode, ipythonMartin Czygan2020-08-121-1/+3
|
* add notes/todoMartin Czygan2020-08-121-0/+17
|
* makefile: fix typoMartin Czygan2020-08-121-1/+1
|
* add coverage dependencyMartin Czygan2020-08-122-7/+12
|
* setup: require fatcat-openapi-clientMartin Czygan2020-08-121-1/+3
|
* switch to yapfMartin Czygan2020-08-126-10/+28
|
* add testsMartin Czygan2020-08-121-0/+115
|
* utils: fix importsMartin Czygan2020-08-121-1/+1
|
* fix status definitionMartin Czygan2020-08-122-1/+3
|
* add pytest dev dependencyMartin Czygan2020-08-121-1/+1
|
* import utility functionsMartin Czygan2020-08-123-0/+165
|
* apply formatting styleMartin Czygan2020-08-121-1/+1
|
* add basic str utilsMartin Czygan2020-08-122-0/+83
|
* add makefile style targetMartin Czygan2020-08-122-3/+5
|
* cleanup build directory as wellMartin Czygan2020-08-121-0/+1
|
* v0.1.1Martin Czygan2020-08-121-0/+1
|
* specify version in one place onlyMartin Czygan2020-08-122-2/+5
| | | | use: fuzzycat/__init__.py
* let make deps pipenv install use pre releasesMartin Czygan2020-08-123-3/+359
| | | | | The problem appeared as black seems to be a pre-release, cf. https://github.com/microsoft/vscode-python/issues/5171.
* allow pypi uploadsMartin Czygan2020-08-122-3/+19
| | | | see: https://pypi.org/project/fuzzycat/