summaryrefslogtreecommitdiffstats
path: root/proposals/2021-11-17_entity_mergers.md
diff options
context:
space:
mode:
Diffstat (limited to 'proposals/2021-11-17_entity_mergers.md')
-rw-r--r--proposals/2021-11-17_entity_mergers.md110
1 files changed, 110 insertions, 0 deletions
diff --git a/proposals/2021-11-17_entity_mergers.md b/proposals/2021-11-17_entity_mergers.md
new file mode 100644
index 00000000..6c41ca58
--- /dev/null
+++ b/proposals/2021-11-17_entity_mergers.md
@@ -0,0 +1,110 @@
+
+status: implemented
+
+Entity Mergers
+===============
+
+One category of type of catalog metadata cleanup is merging multiple duplicate
+entries into a single record. The fatcat catalog allows this via during the
+duplicate entities into "redirect reccords" which point at the single merged
+record.
+
+This proposal briefly describes the process for doing bulk merges.
+
+
+## External Identifier Duplicates
+
+The easiest category of entity duplicates to discover is cases where multiple
+entities have the same external (persistent) identifier. For example, releases
+with the exact same DOI, containers with the same ISSN-L, or creators with the
+same ORCiD. Files with the same SHA-1 hash is a similar issue. The catalog does
+not block the creation of such entities, though it is assumed that editors and
+bots will do their best to prevent creating duplicates, and that this is
+checked and monitored via review bots (auto-annotation) and bulk quality
+checks.
+
+In these cases, it is simple enough to use the external identifier dumps (part
+of the fatcat bulk exports), find duplicates by identifier, and create merge
+requests.
+
+
+## Merge Requests JSON Schema
+
+Proposed JSON schema for bulk entity merging:
+
+ entity_type: str, required. eg: "file"
+ primary_id: str, optional, entity ident
+ duplicate_ids: [str], required, entity idents
+ evidence: dict, optional, merger/entity specific
+ # evidence fields for external identifier dupes
+ extid: str, the identifier value
+ extid_type: str, eg "doi" or "sha1"
+
+The merge request generation process might indicate which of the entities
+should end up as the "primary", or it might leave that determination to the
+merger itself. `primary_id` should not be set arbitrarily or randomly if there
+is not a good reason for a specific entity to be the "primary" which others
+redirect to.
+
+The `primary_id` should not be included in `duplicate_ids`, but the merger code
+will remove it if included accidentally.
+
+The `evidence` fields are flexible. By default they will all be included as
+top-level "edit extra" metadata on each individual entity redirected, but not
+on the primary entity (if it gets updated).
+
+
+## Merge Process and Semantics
+
+The assumption is that all the entities indicated in `duplicate_ids` will be
+redirected to the `primary_id`. Any metadata included in the duplicates which
+is not included in the primary will be copied in to the primary, but existing
+primary metadata fields will not be "clobbered" (overwritten) by duplicate
+metadata. This includes top-level fields of the `extra` metadata dict, if
+appropriate. If there is no unique metadata in the redirected entities, the
+primary does not need to be updated and will not be.
+
+
+## Work/Release Grouping and Merging
+
+Work and Release entities are something of a special case.
+
+Merging two release entities will result in all artifact entities (files,
+filesets, webcaptures) being updated which previously pointed at the duplicate
+entity to point to the primary entity. If the work entities associated with the
+duplicate releases have no other releases associated with them, they also will
+be merged (redirected) to the primary release's work entity.
+
+"Grouping" releases is the same as merging their works. In this situation, the
+number of distinct release entities stays the same, but the duplicates are
+updated to be under the same work as the primary. This is initially implemented
+by merging the work entites, and then updating *all* the releases under each
+merged work towards the primary work identifier. No artifact entities need to
+be updated in this scenario.
+
+A currently planned option would be to pull a single release out of a group of
+releases under a work, and point it to a new work. This would be a form of
+"regrouping". For now this can only be achieved by updating the release
+entities individually, not in a bulk/automated manner.
+
+
+## Container Merging
+
+Because many releases point to containers, it is not practical to update all
+the releases at the same time as merging the containers. In the long run it is
+good for the health of the catalog to have all the releases updated to point at
+the the primary container, but these updates can be delayed.
+
+To keep statistics and functionality working before release updates happen,
+downstream users of release entities should "expand" container sub-entities and
+use the "redirect" ident of the container entity instead of "ident", if the
+"redirect" is set. For example, when linking in web interfaces, or when doing a
+schema transform in the fatcat and scholar.archive.org search index.
+
+
+## Background Reading
+
+"The Lens MetaRecord and LensID: An open identifier system for aggregated
+metadata and versioning of knowledge artefacts"
+https://osf.io/preprints/lissa/t56yh/
+