aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
...
* docs: ambiguous titlesMartin Czygan2020-09-031-0/+8
|
* docs: another example of a long titleMartin Czygan2020-09-031-0/+6
|
* docs: another quality issueMartin Czygan2020-09-031-0/+7
|
* docs: common title issueMartin Czygan2020-09-031-0/+12
|
* docs: add link to issueMartin Czygan2020-09-031-0/+2
|
* update various docs; start data issue logMartin Czygan2020-09-035-2/+26
|
* add example grobid outputMartin Czygan2020-08-271-0/+195
|
* README: add performance data pointMartin Czygan2020-08-272-0/+22
|
* update project READMEMartin Czygan2020-08-274-0/+22
|
* move datasets to projectsMartin Czygan2020-08-274-0/+10
|
* update notesMartin Czygan2020-08-251-3/+4
|
* datasets: add samples itemMartin Czygan2020-08-252-1/+1
|
* start datasets sectionMartin Czygan2020-08-252-0/+16
| | | | | Datasets to run fuzzy matching over, including a way to download all inputs, run with various parameters, etc.
* stub: command lineMartin Czygan2020-08-183-7/+18
|
* serial name: no default pathMartin Czygan2020-08-171-1/+1
|
* serial name: no default pathMartin Czygan2020-08-171-0/+2
|
* ignore tmpMartin Czygan2020-08-171-0/+1
|
* matching: verify release match stubMartin Czygan2020-08-171-2/+24
|
* tests: add stubMartin Czygan2020-08-171-0/+5
|
* matching: verify container can verify serial name firstMartin Czygan2020-08-171-2/+7
|
* add stub scriptMartin Czygan2020-08-172-0/+9
|
* matching: two stage verificationMartin Czygan2020-08-171-18/+29
|
* large overhaulMartin Czygan2020-08-1714-234/+577
| | | | | | * separate all fatcat related code into fatcat submodule * more type annotations * add verify_serial_name for journal names
* issn: simhash exampleMartin Czygan2020-08-172-0/+20
|
* add notes on abbrevsMartin Czygan2020-08-153-1/+2261
|
* include original and normalized name in default shelve (1G)Martin Czygan2020-08-153-8/+16
|
* separate cleanupsMartin Czygan2020-08-152-0/+47
|
* cleanup handling: add parameterMartin Czygan2020-08-154-19/+26
| | | | allow string cleanup be called directly
* update static filesMartin Czygan2020-08-152-1/+3
|
* add extra filesMartin Czygan2020-08-153-0/+17
|
* try out shelve for name lookupsMartin Czygan2020-08-151-10/+62
| | | | | uncompressed about 500 MB; marisa-trie would need extra encoding approach (plus it is a heavy dependency).
* update READMEMartin Czygan2020-08-151-1/+5
|
* issn: pair with issnlMartin Czygan2020-08-141-19/+26
|
* update planMartin Czygan2020-08-141-0/+5
|
* add de-jsonld flagMartin Czygan2020-08-141-15/+57
|
* issn: jsonld breakupMartin Czygan2020-08-131-25/+190
|
* update journal name notebookMartin Czygan2020-08-131-434/+442
|
* update notebookMartin Czygan2020-08-121-86/+729
|
* update READMEMartin Czygan2020-08-121-1/+3
|
* add journal name notebookMartin Czygan2020-08-124-0/+16016
|
* add deps for notebooksMartin Czygan2020-08-121-4/+6
|
* update setup.pyMartin Czygan2020-08-121-2/+10
|
* note on optimization: marisa-trieMartin Czygan2020-08-121-0/+1
| | | | | | | | | | Currently, the JSON mapping is 172M, turning this into a dict takes a bit, plus consumes GBs of memory. For exact lookups, we might want to use marisa-trie: > String data in a MARISA-trie may take up to 50x-100x less memory than in a standard Python dict; the raw lookup speed is comparable; trie also provides fast advanced methods like prefix search.
* update MakefileMartin Czygan2020-08-121-8/+12
|
* issn: generate a name to issn mappingMartin Czygan2020-08-122-31/+88
| | | | | | | | | | This allows to make suggestions about potentially ambiguous titles. Maybe suggest a minimal length. Ultimately, there are only about 2M journal titles. If an arbitrary string must match a journal title (not a generic container title), then we can use a combination of direct lookup; plus some extra processing based on this dataset.
* stub tool: fuzzycat-issn to generate test dataMartin Czygan2020-08-121-0/+69
| | | | | | | | | | | | | | | | | | | currently: fuzzycat-issn --make-pairs will generate a TSV with (issn, a, b) example, e.g. ... 0011-9717 Detskaâ literatura. Детская литература. 0011-9717 Detskaâ literatura. Detskaâ literatura 0011-9717 Детская литература. Detskaâ literatura 0011-6637 Darbininkas. Darbininkas 0012-0820 deutsche Tabakbau deutsche Tabakbau. 0011-5444 Daily Kent stater. Daily Kent stater ... The idea is that these names per definition denote the same journal. We might even have a fixed lookup table, since some variants involve multiple scripts (and there are only around 2M names in total). Currently 1992176 pairs can be generated.
* adjust formattingMartin Czygan2020-08-122-2/+6
|
* fix importsMartin Czygan2020-08-122-2/+2
|
* update READMEMartin Czygan2020-08-121-2/+4
|
* yapf: reduce column limitMartin Czygan2020-08-121-1/+1
|