<feed xmlns='http://www.w3.org/2005/Atom'>
<title>fuzzycat/tests, branch master</title>
<subtitle>Unnamed repository; edit this file 'description' to name the repository.
</subtitle>
<id>https://git.bnewbold.net/fuzzycat/atom?h=master</id>
<link rel='self' href='https://git.bnewbold.net/fuzzycat/atom?h=master'/>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fuzzycat/'/>
<updated>2021-12-21T19:56:56+00:00</updated>
<entry>
<title>apply first round of feedback on matching</title>
<updated>2021-12-21T19:56:56+00:00</updated>
<author>
<name>Martin Czygan</name>
<email>martin.czygan@gmail.com</email>
</author>
<published>2021-12-17T09:07:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fuzzycat/commit/?id=de9f1155ea57c812171abd5517ab39f4fe135cb3'/>
<id>urn:sha1:de9f1155ea57c812171abd5517ab39f4fe135cb3</id>
<content type='text'>
</content>
</entry>
<entry>
<title>matching: cleanup test files</title>
<updated>2021-12-06T18:59:51+00:00</updated>
<author>
<name>Martin Czygan</name>
<email>martin.czygan@gmail.com</email>
</author>
<published>2021-12-06T18:59:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fuzzycat/commit/?id=5bd8ee08a3e0f52893c1b7afa6bc4f062b7c062c'/>
<id>urn:sha1:5bd8ee08a3e0f52893c1b7afa6bc4f062b7c062c</id>
<content type='text'>
</content>
</entry>
<entry>
<title>complete FuzzyReleaseMatcher refactoring</title>
<updated>2021-12-06T18:53:30+00:00</updated>
<author>
<name>Martin Czygan</name>
<email>martin.czygan@gmail.com</email>
</author>
<published>2021-11-17T13:51:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fuzzycat/commit/?id=dd6149140542585f2b0bfc3b334ec2b0a88b790e'/>
<id>urn:sha1:dd6149140542585f2b0bfc3b334ec2b0a88b790e</id>
<content type='text'>
We keep the name, since the api - "matcher.match(release)" - is the
same; simplified queries; at most one query is performed against
elasticsearch; parallel release retrieval from the API; optional support
for release year windows;

Test cases are expressed in yaml and will be auto-loaded from the
specified directory; test work against the current search endpoint,
which means the actual output may change on index updates; for the
moment, we think this setup is relatively simple and not too unstable.

    about: title contrib, partial name
    input: &gt;
      {
        "contribs": [
          {
            "raw_name": "Adams"
          }
        ],
        "title": "digital libraries",
        "ext_ids": {}
      }
    release_year_padding: 1
    expected:
      - 7rmvqtrb2jdyhcxxodihzzcugy
      - a2u6ougtsjcbvczou6sazsulcm
      - dy45vilej5diros6zmax46nm4e
      - exuwhhayird4fdjmmsiqpponlq
      - gqrj7jikezgcfpjfazhpf4e7c4
      - mkmqt3453relbpuyktnmsg6hjq
      - t2g5sl3dgzchtnq7dynxyzje44
      - t4tvenhrvzamraxrvvxivxmvga
      - wd3oeoi3bffknfbg2ymleqc4ja
      - y63a6dhrfnb7bltlxfynydbojy
</content>
</entry>
<entry>
<title>complete migration from away from match_release_fuzzy</title>
<updated>2021-11-16T20:13:46+00:00</updated>
<author>
<name>Martin Czygan</name>
<email>martin.czygan@gmail.com</email>
</author>
<published>2021-11-16T20:13:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fuzzycat/commit/?id=d104f8d0ba8eef5563555de82be66bbf17f961db'/>
<id>urn:sha1:d104f8d0ba8eef5563555de82be66bbf17f961db</id>
<content type='text'>
Instead, use `FuzzyReleaseMatcher.match`, which has approximately the
same behavior.
</content>
</entry>
<entry>
<title>turn "match_release_fuzzy" into a class</title>
<updated>2021-11-16T17:58:42+00:00</updated>
<author>
<name>Martin Czygan</name>
<email>martin.czygan@gmail.com</email>
</author>
<published>2021-11-05T16:19:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fuzzycat/commit/?id=0c84af603894049dd8edd95da18d8990ab0516d1'/>
<id>urn:sha1:0c84af603894049dd8edd95da18d8990ab0516d1</id>
<content type='text'>
Goal of this refactoring was to make the matching process a bit more
configurable by using a class and a cascade of queries.

For a limited test set: `FuzzyReleaseMatcher.match` is works the same as
`match_release_fuzzy`.
</content>
</entry>
<entry>
<title>use grobid_tei_xml for grobid unstructured lookups</title>
<updated>2021-10-28T21:00:49+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2021-10-28T21:00:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fuzzycat/commit/?id=2f41335d268b0e2705a1ebff0ff104e965630837'/>
<id>urn:sha1:2f41335d268b0e2705a1ebff0ff104e965630837</id>
<content type='text'>
</content>
</entry>
<entry>
<title>start larger refactoring: remove cluster</title>
<updated>2021-09-24T11:58:51+00:00</updated>
<author>
<name>Martin Czygan</name>
<email>martin.czygan@gmail.com</email>
</author>
<published>2021-09-24T11:58:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fuzzycat/commit/?id=478d7d06ad9e56145cb94f3461c355b1ba9eb491'/>
<id>urn:sha1:478d7d06ad9e56145cb94f3461c355b1ba9eb491</id>
<content type='text'>
background: verifying hundreds of millions of documents turned out to be
a bit slow; anecdata: running clustering and verification over 1.8B
inputs tooks over 50h; cf. the Go port (skate) required about 2-4h for
those operations. Also: with Go we do not need the extra GNU parallel
wrapping.

In any case, we aim for fuzzycat refactoring to provide:

* better, more configurable verification and small scale matching
* removal of batch clustering code (and improve refcat docs)
* a place for a bit more generic, similarity based utils

The most important piece in fuzzycat is a CSV file containing hand
picked test examples for verification - and the code that is able to
fulfill that test suite. We want to make this part more robust.
</content>
</entry>
<entry>
<title>tests: temporarily disable tests</title>
<updated>2021-09-21T14:36:55+00:00</updated>
<author>
<name>Martin Czygan</name>
<email>martin.czygan@gmail.com</email>
</author>
<published>2021-09-21T14:36:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fuzzycat/commit/?id=5fa61d89320af880d5bf6b3231f6478887cfb6a6'/>
<id>urn:sha1:5fa61d89320af880d5bf6b3231f6478887cfb6a6</id>
<content type='text'>
We want to first move to elasticsearch dsl and will reactivate and
extends after refactoring.
</content>
</entry>
<entry>
<title>matching: run an additional es query for fuzzy matching</title>
<updated>2021-09-21T13:55:52+00:00</updated>
<author>
<name>Martin Czygan</name>
<email>martin.czygan@gmail.com</email>
</author>
<published>2021-09-21T13:55:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fuzzycat/commit/?id=dccbaa5c1b0ba556449de6024540ba05d67ef6a0'/>
<id>urn:sha1:dccbaa5c1b0ba556449de6024540ba05d67ef6a0</id>
<content type='text'>
</content>
</entry>
<entry>
<title>style: apply formatting</title>
<updated>2021-09-21T13:54:46+00:00</updated>
<author>
<name>Martin Czygan</name>
<email>martin.czygan@gmail.com</email>
</author>
<published>2021-09-21T13:54:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fuzzycat/commit/?id=08a9242e2ed19aaec14d92fe174bee21bb4232eb'/>
<id>urn:sha1:08a9242e2ed19aaec14d92fe174bee21bb4232eb</id>
<content type='text'>
</content>
</entry>
</feed>
