fuzzycat - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	add example grobid output	Martin Czygan	2020-08-27	1	-0/+195
\|
*	README: add performance data point	Martin Czygan	2020-08-27	2	-0/+22
\|
*	update project README	Martin Czygan	2020-08-27	4	-0/+22
\|
*	move datasets to projects	Martin Czygan	2020-08-27	4	-0/+10
\|
*	update notes	Martin Czygan	2020-08-25	1	-3/+4
\|
*	datasets: add samples item	Martin Czygan	2020-08-25	2	-1/+1
\|
*	start datasets section	Martin Czygan	2020-08-25	2	-0/+16
\| \| \| \| \|	Datasets to run fuzzy matching over, including a way to download all inputs, run with various parameters, etc.
*	stub: command line	Martin Czygan	2020-08-18	3	-7/+18
\|
*	serial name: no default path	Martin Czygan	2020-08-17	1	-1/+1
\|
*	serial name: no default path	Martin Czygan	2020-08-17	1	-0/+2
\|
*	ignore tmp	Martin Czygan	2020-08-17	1	-0/+1
\|
*	matching: verify release match stub	Martin Czygan	2020-08-17	1	-2/+24
\|
*	tests: add stub	Martin Czygan	2020-08-17	1	-0/+5
\|
*	matching: verify container can verify serial name first	Martin Czygan	2020-08-17	1	-2/+7
\|
*	add stub script	Martin Czygan	2020-08-17	2	-0/+9
\|
*	matching: two stage verification	Martin Czygan	2020-08-17	1	-18/+29
\|
*	large overhaul	Martin Czygan	2020-08-17	14	-234/+577
\| \| \| \| \| \|	* separate all fatcat related code into fatcat submodule * more type annotations * add verify_serial_name for journal names
*	issn: simhash example	Martin Czygan	2020-08-17	2	-0/+20
\|
*	add notes on abbrevs	Martin Czygan	2020-08-15	3	-1/+2261
\|
*	include original and normalized name in default shelve (1G)	Martin Czygan	2020-08-15	3	-8/+16
\|
*	separate cleanups	Martin Czygan	2020-08-15	2	-0/+47
\|
*	cleanup handling: add parameter	Martin Czygan	2020-08-15	4	-19/+26
\| \| \| \|	allow string cleanup be called directly
*	update static files	Martin Czygan	2020-08-15	2	-1/+3
\|
*	add extra files	Martin Czygan	2020-08-15	3	-0/+17
\|
*	try out shelve for name lookups	Martin Czygan	2020-08-15	1	-10/+62
\| \| \| \| \|	uncompressed about 500 MB; marisa-trie would need extra encoding approach (plus it is a heavy dependency).
*	update README	Martin Czygan	2020-08-15	1	-1/+5
\|
*	issn: pair with issnl	Martin Czygan	2020-08-14	1	-19/+26
\|
*	update plan	Martin Czygan	2020-08-14	1	-0/+5
\|
*	add de-jsonld flag	Martin Czygan	2020-08-14	1	-15/+57
\|
*	issn: jsonld breakup	Martin Czygan	2020-08-13	1	-25/+190
\|
*	update journal name notebook	Martin Czygan	2020-08-13	1	-434/+442
\|
*	update notebook	Martin Czygan	2020-08-12	1	-86/+729
\|
*	update README	Martin Czygan	2020-08-12	1	-1/+3
\|
*	add journal name notebook	Martin Czygan	2020-08-12	4	-0/+16016
\|
*	add deps for notebooks	Martin Czygan	2020-08-12	1	-4/+6
\|
*	update setup.py	Martin Czygan	2020-08-12	1	-2/+10
\|
*	note on optimization: marisa-trie	Martin Czygan	2020-08-12	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	Currently, the JSON mapping is 172M, turning this into a dict takes a bit, plus consumes GBs of memory. For exact lookups, we might want to use marisa-trie: > String data in a MARISA-trie may take up to 50x-100x less memory than in a standard Python dict; the raw lookup speed is comparable; trie also provides fast advanced methods like prefix search.
*	update Makefile	Martin Czygan	2020-08-12	1	-8/+12
\|
*	issn: generate a name to issn mapping	Martin Czygan	2020-08-12	2	-31/+88
\| \| \| \| \| \| \| \| \| \|	This allows to make suggestions about potentially ambiguous titles. Maybe suggest a minimal length. Ultimately, there are only about 2M journal titles. If an arbitrary string must match a journal title (not a generic container title), then we can use a combination of direct lookup; plus some extra processing based on this dataset.
*	stub tool: fuzzycat-issn to generate test data	Martin Czygan	2020-08-12	1	-0/+69
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	currently: fuzzycat-issn --make-pairs will generate a TSV with (issn, a, b) example, e.g. ... 0011-9717 Detskaâ literatura. Детская литература. 0011-9717 Detskaâ literatura. Detskaâ literatura 0011-9717 Детская литература. Detskaâ literatura 0011-6637 Darbininkas. Darbininkas 0012-0820 deutsche Tabakbau deutsche Tabakbau. 0011-5444 Daily Kent stater. Daily Kent stater ... The idea is that these names per definition denote the same journal. We might even have a fixed lookup table, since some variants involve multiple scripts (and there are only around 2M names in total). Currently 1992176 pairs can be generated.
*	adjust formatting	Martin Czygan	2020-08-12	2	-2/+6
\|
*	fix imports	Martin Czygan	2020-08-12	2	-2/+2
\|
*	update README	Martin Czygan	2020-08-12	1	-2/+4
\|
*	yapf: reduce column limit	Martin Czygan	2020-08-12	1	-1/+1
\|
*	improve docs and imports	Martin Czygan	2020-08-12	1	-9/+8
\|
*	try: all matching methods should start with match	Martin Czygan	2020-08-12	2	-2/+2
\|
*	makefile: add container export download	Martin Czygan	2020-08-12	1	-1/+6
\|
*	add matching submodule	Martin Czygan	2020-08-12	2	-0/+149
\|
*	add deps: ftfy, unidecode, ipython	Martin Czygan	2020-08-12	1	-1/+3
\|
*	add notes/todo	Martin Czygan	2020-08-12	1	-0/+17
\|