proposals/2021-11-17_entity_mergers.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110


status: implemented

Entity Mergers
===============

One category of type of catalog metadata cleanup is merging multiple duplicate
entries into a single record. The fatcat catalog allows this via during the
duplicate entities into "redirect records" which point at the single merged
record.

This proposal briefly describes the process for doing bulk merges.


## External Identifier Duplicates

The easiest category of entity duplicates to discover is cases where multiple
entities have the same external (persistent) identifier. For example, releases
with the exact same DOI, containers with the same ISSN-L, or creators with the
same ORCiD. Files with the same SHA-1 hash is a similar issue. The catalog does
not block the creation of such entities, though it is assumed that editors and
bots will do their best to prevent creating duplicates, and that this is
checked and monitored via review bots (auto-annotation) and bulk quality
checks.

In these cases, it is simple enough to use the external identifier dumps (part
of the fatcat bulk exports), find duplicates by identifier, and create merge
requests.


## Merge Requests JSON Schema

Proposed JSON schema for bulk entity merging:

    entity_type: str, required. eg: "file"
    primary_id: str, optional, entity ident
    duplicate_ids: [str], required, entity idents
    evidence: dict, optional, merger/entity specific
        # evidence fields for external identifier dupes
        extid: str, the identifier value
        extid_type: str, eg "doi" or "sha1"

The merge request generation process might indicate which of the entities
should end up as the "primary", or it might leave that determination to the
merger itself. `primary_id` should not be set arbitrarily or randomly if there
is not a good reason for a specific entity to be the "primary" which others
redirect to.

The `primary_id` should not be included in `duplicate_ids`, but the merger code
will remove it if included accidentally.

The `evidence` fields are flexible. By default they will all be included as
top-level "edit extra" metadata on each individual entity redirected, but not
on the primary entity (if it gets updated).


## Merge Process and Semantics

The assumption is that all the entities indicated in `duplicate_ids` will be
redirected to the `primary_id`. Any metadata included in the duplicates which
is not included in the primary will be copied in to the primary, but existing
primary metadata fields will not be "clobbered" (overwritten) by duplicate
metadata. This includes top-level fields of the `extra` metadata dict, if
appropriate. If there is no unique metadata in the redirected entities, the
primary does not need to be updated and will not be.


## Work/Release Grouping and Merging

Work and Release entities are something of a special case.

Merging two release entities will result in all artifact entities (files,
filesets, webcaptures) being updated which previously pointed at the duplicate
entity to point to the primary entity. If the work entities associated with the
duplicate releases have no other releases associated with them, they also will
be merged (redirected) to the primary release's work entity.

"Grouping" releases is the same as merging their works. In this situation, the
number of distinct release entities stays the same, but the duplicates are
updated to be under the same work as the primary. This is initially implemented
by merging the work entities, and then updating *all* the releases under each
merged work towards the primary work identifier. No artifact entities need to
be updated in this scenario.

A currently planned option would be to pull a single release out of a group of
releases under a work, and point it to a new work. This would be a form of
"regrouping". For now this can only be achieved by updating the release
entities individually, not in a bulk/automated manner.


## Container Merging

Because many releases point to containers, it is not practical to update all
the releases at the same time as merging the containers. In the long run it is
good for the health of the catalog to have all the releases updated to point at
the the primary container, but these updates can be delayed.

To keep statistics and functionality working before release updates happen,
downstream users of release entities should "expand" container sub-entities and
use the "redirect" ident of the container entity instead of "ident", if the
"redirect" is set. For example, when linking in web interfaces, or when doing a
schema transform in the fatcat and scholar.archive.org search index.


## Background Reading

"The Lens MetaRecord and LensID: An open identifier system for aggregated
metadata and versioning of knowledge artefacts"
https://osf.io/preprints/lissa/t56yh/