1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
|
# Fuzzy matching review and retrospective
> 2021-09-15
After [refcat](https://gitlab.com/internetarchive/refcat) has reached a
milestone, I'd like to review fuzzycat and fuzzy matching in general; this
should help pave the way to a slight redesign of the overall approach.
## TL;DR
* performance matters at scale and a faster language (e.g. Go) is essential
* for small scale, the api matters more than performance
* a lot of the code currently is base on specific schemas (e.g. release, a
specific elasticsearch mapping, etc), so not that much code is generic or
reusable - it also seems overkill to try to abstract the schema away
## Ideas
* [ ] use pydantic or dataclass to make schema more explicit
* [ ] extend type annotation coverage
* [ ] remove bulk stuff, remove clustering etc; improve the verification part
* [ ] use cases: work merging
A few more things to revisit:
* [ ] revisit journal matching; what is weak, strong?
* [ ] refactor: author list or string comparison
* [ ] go beyond title matching when querying elasticsearch
* [ ] better container name matching
Take a look at:
> https://github.com/djudd/human-name
----
## Redesign ideas
### Large scale processing
JSON decoding and encoding does not seem to be the bottleneck, but working with
various, often optional fields gets expensive in Python (whereas in Go, we can
use a struct).
The mapping stage in
[refcat/skate](https://gitlab.com/internetarchive/refcat/-/blob/3a79551dfe54ba668f7eee9de88625a0d33d9c7f/skate/map.go#L109-111)
is a simple operation (blob to fields), that can be implemented in isolation
and then [added to the command
line](https://gitlab.com/internetarchive/refcat/-/blob/3a79551dfe54ba668f7eee9de88625a0d33d9c7f/skate/cmd/skate-map/main.go#L67-87).
In skate, we already have over a dozen mappers working on various types.
There's even a bit of [map
middleware](https://gitlab.com/internetarchive/refcat/-/blob/3a79551dfe54ba668f7eee9de88625a0d33d9c7f/skate/map.go#L152-161).
In fuzzycat, the
[Cluster](https://git.archive.org/webgroup/fuzzycat/-/blob/c587a084defe54103aa147b7ab91542a11a548b1/fuzzycat/cluster.py#L309-347)
class does mapping, via
[key](https://git.archive.org/webgroup/fuzzycat/-/blob/c587a084defe54103aa147b7ab91542a11a548b1/fuzzycat/cluster.py#L331),
[sorting](https://git.archive.org/webgroup/fuzzycat/-/blob/c587a084defe54103aa147b7ab91542a11a548b1/fuzzycat/cluster.py#L406-426),
and a specific
[grouping](https://git.archive.org/webgroup/fuzzycat/-/blob/c587a084defe54103aa147b7ab91542a11a548b1/fuzzycat/cluster.py#L428-454)
all in one go.
For example, we did not use the single cluster document in refcat/skate anymore
(there, we keep two separate files and use an extra
[zipkey](https://gitlab.com/internetarchive/refcat/-/blob/3a79551dfe54ba668f7eee9de88625a0d33d9c7f/skate/zipkey/zipkey.go#L23-33)
type, which is a slightly generalized
[comm](https://en.wikipedia.org/wiki/Comm), e.g. it allows to run a function
over a cluster of documents (coming from currently two streams).
A higher level command could encapsulate the whole pipeline, without needed an extra framework like luigi:
inputs A B
| |
mapped M1 M2
| |
sorted S1 S2
\ /
\ /
reduced V
|
|
C
> Not sure, if we need mappers at all, if we have them in refcat.
An a command could look like this.
$ fuzzycat pipeline -a A.json -b B.json --mapper-a "tn" --mapper-b "tn" --reduce "bref"
Nice, if this would actually run fast. Could also be run programmatically:
output = fuzzycat_pipeline(a="A.json", b="B.json", mapper_a="tn", mapper_b="tn", reduce="bref")
Mappers should have a minimal scope; each mapper will have a format it can work
on. Reducers will have two inputs types specified.
### Running continuously
> With a couple of the inputs (metadata, extracted data, ...) getting updated
> all the time, it might for the moment be simpler to rerun the derivation in
> batch mode.
A list of steps we would need to implement for continuous reference index updates:
* a new metadata document arrives (e.g. via "changelog")
* if the metadata contains outbound references, nice; if not, we try to download the associated PDF, run grobid and get the references out that way
At this point, we have the easier part - outbound references - covered.
Where do the outbound references of all existing docs live? In the database
only, hence we cannot search for them currently.
* [7ppmkfo5krb2zhefhwkwdp4mqe](https://search.fatcat.wiki/fatcat_release/_search?q=ident:7ppmkfo5krb2zhefhwkwdp4mqe)
says `ref_count` 12, but the list of refs we can only get via
[api](https://api.fatcat.wiki/v0/release/7ppmkfo5krb2zhefhwkwdp4mqe)
We could add another elasticsearch index only for the raw refs. E.g. everytime
an item is updated, this index gets updated as well (taking refs from the API
and put them into ES). We can then query for any ID we find in the reference or
any string match, etc. Once we find e.g. ten documents, that have the document
in question in their reference list, we can update the reference index for each
of these documents.
We could keep a (weekly) refs snapshot file around that would be used for
matching. The result would be the e.g. ten document, that refer to the document
in question. We can take their ids and update the document to establish the
link. The on-disk file (or files) should be all prepared, e.g. sorted by key,
so the lookup will be fast.
|