From e613e4ceb607f01287c44c8f7f526a776abed986 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Wed, 17 Nov 2021 13:36:57 -0800 Subject: add proposal for entity mergers --- proposals/2021-11-17_entity_mergers.md | 110 +++++++++++++++++++++++++++++++++ 1 file changed, 110 insertions(+) create mode 100644 proposals/2021-11-17_entity_mergers.md (limited to 'proposals/2021-11-17_entity_mergers.md') diff --git a/proposals/2021-11-17_entity_mergers.md b/proposals/2021-11-17_entity_mergers.md new file mode 100644 index 00000000..6c41ca58 --- /dev/null +++ b/proposals/2021-11-17_entity_mergers.md @@ -0,0 +1,110 @@ + +status: implemented + +Entity Mergers +=============== + +One category of type of catalog metadata cleanup is merging multiple duplicate +entries into a single record. The fatcat catalog allows this via during the +duplicate entities into "redirect reccords" which point at the single merged +record. + +This proposal briefly describes the process for doing bulk merges. + + +## External Identifier Duplicates + +The easiest category of entity duplicates to discover is cases where multiple +entities have the same external (persistent) identifier. For example, releases +with the exact same DOI, containers with the same ISSN-L, or creators with the +same ORCiD. Files with the same SHA-1 hash is a similar issue. The catalog does +not block the creation of such entities, though it is assumed that editors and +bots will do their best to prevent creating duplicates, and that this is +checked and monitored via review bots (auto-annotation) and bulk quality +checks. + +In these cases, it is simple enough to use the external identifier dumps (part +of the fatcat bulk exports), find duplicates by identifier, and create merge +requests. + + +## Merge Requests JSON Schema + +Proposed JSON schema for bulk entity merging: + + entity_type: str, required. eg: "file" + primary_id: str, optional, entity ident + duplicate_ids: [str], required, entity idents + evidence: dict, optional, merger/entity specific + # evidence fields for external identifier dupes + extid: str, the identifier value + extid_type: str, eg "doi" or "sha1" + +The merge request generation process might indicate which of the entities +should end up as the "primary", or it might leave that determination to the +merger itself. `primary_id` should not be set arbitrarily or randomly if there +is not a good reason for a specific entity to be the "primary" which others +redirect to. + +The `primary_id` should not be included in `duplicate_ids`, but the merger code +will remove it if included accidentally. + +The `evidence` fields are flexible. By default they will all be included as +top-level "edit extra" metadata on each individual entity redirected, but not +on the primary entity (if it gets updated). + + +## Merge Process and Semantics + +The assumption is that all the entities indicated in `duplicate_ids` will be +redirected to the `primary_id`. Any metadata included in the duplicates which +is not included in the primary will be copied in to the primary, but existing +primary metadata fields will not be "clobbered" (overwritten) by duplicate +metadata. This includes top-level fields of the `extra` metadata dict, if +appropriate. If there is no unique metadata in the redirected entities, the +primary does not need to be updated and will not be. + + +## Work/Release Grouping and Merging + +Work and Release entities are something of a special case. + +Merging two release entities will result in all artifact entities (files, +filesets, webcaptures) being updated which previously pointed at the duplicate +entity to point to the primary entity. If the work entities associated with the +duplicate releases have no other releases associated with them, they also will +be merged (redirected) to the primary release's work entity. + +"Grouping" releases is the same as merging their works. In this situation, the +number of distinct release entities stays the same, but the duplicates are +updated to be under the same work as the primary. This is initially implemented +by merging the work entites, and then updating *all* the releases under each +merged work towards the primary work identifier. No artifact entities need to +be updated in this scenario. + +A currently planned option would be to pull a single release out of a group of +releases under a work, and point it to a new work. This would be a form of +"regrouping". For now this can only be achieved by updating the release +entities individually, not in a bulk/automated manner. + + +## Container Merging + +Because many releases point to containers, it is not practical to update all +the releases at the same time as merging the containers. In the long run it is +good for the health of the catalog to have all the releases updated to point at +the the primary container, but these updates can be delayed. + +To keep statistics and functionality working before release updates happen, +downstream users of release entities should "expand" container sub-entities and +use the "redirect" ident of the container entity instead of "ident", if the +"redirect" is set. For example, when linking in web interfaces, or when doing a +schema transform in the fatcat and scholar.archive.org search index. + + +## Background Reading + +"The Lens MetaRecord and LensID: An open identifier system for aggregated +metadata and versioning of knowledge artefacts" +https://osf.io/preprints/lissa/t56yh/ + -- cgit v1.2.3 From a367c366852eaaeeecd4c6142ebac187df7dcf71 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Wed, 24 Nov 2021 15:19:26 -0800 Subject: merger proposal typos --- proposals/2021-11-17_entity_mergers.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'proposals/2021-11-17_entity_mergers.md') diff --git a/proposals/2021-11-17_entity_mergers.md b/proposals/2021-11-17_entity_mergers.md index 6c41ca58..d196d549 100644 --- a/proposals/2021-11-17_entity_mergers.md +++ b/proposals/2021-11-17_entity_mergers.md @@ -6,7 +6,7 @@ Entity Mergers One category of type of catalog metadata cleanup is merging multiple duplicate entries into a single record. The fatcat catalog allows this via during the -duplicate entities into "redirect reccords" which point at the single merged +duplicate entities into "redirect records" which point at the single merged record. This proposal briefly describes the process for doing bulk merges. @@ -78,7 +78,7 @@ be merged (redirected) to the primary release's work entity. "Grouping" releases is the same as merging their works. In this situation, the number of distinct release entities stays the same, but the duplicates are updated to be under the same work as the primary. This is initially implemented -by merging the work entites, and then updating *all* the releases under each +by merging the work entities, and then updating *all* the releases under each merged work towards the primary work identifier. No artifact entities need to be updated in this scenario. -- cgit v1.2.3