From 4397bf37b8aabf4df6139f56681ce112da669990 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Thu, 28 Oct 2021 12:52:33 +0200 Subject: data quality: dataset spam issue --- notes/data_issues.md | 30 ++++++++++++++++++++++++------ 1 file changed, 24 insertions(+), 6 deletions(-) (limited to 'notes') diff --git a/notes/data_issues.md b/notes/data_issues.md index 87a91b9..e450c8d 100644 --- a/notes/data_issues.md +++ b/notes/data_issues.md @@ -1,9 +1,25 @@ # Data issues specifically in Citation Graph +## Occurence Download + +* date: 2021-10-28 +* status: open +* example: https://fatcat.wiki/release/ns4v2jvhgbhh7mbg45bjtpzway/refs-in + +Symptom: Many datasets pointing to a publication; e.g. all having "Occurrence Download" as title + +Possible mitigation: + +* [ ] extract all titles from fatcat +* [ ] find most common titles, decide if it should be blacklisted for citation graph +* [ ] keep blacklist of release ident to ignore in edges +* [ ] filter refcat, remove edges with blacklisted id as source (and target) + ## Repeated entries -* 2020-04-19 -* https://qa.fatcat.wiki/release/lcarb5rg5vf3tk4hpvosja5sm4/outbound-refs +* date: 2021-04-19 +* status: solved +* example: https://fatcat.wiki/release/lcarb5rg5vf3tk4hpvosja5sm4/refs-out A DOI seems to be using the key, which leads to repeated entries. @@ -13,8 +29,9 @@ A DOI seems to be using the key, which leads to repeated entries. ## Self references -* 2020-04-19 -* https://qa.fatcat.wiki/release/3fcp4pk7nfamvkbjekqam24bfq/outbound-refs +* date: 2021-04-19 +* status: solved +* example: https://fatcat.wiki/release/3fcp4pk7nfamvkbjekqam24bfq/refs-out The source and target seem to be the same. @@ -22,8 +39,9 @@ The source and target seem to be the same. ## Duplicated Edges -* 2020-04-20 -* https://qa.fatcat.wiki/release/22222736evcc7kdn3bleua3fge/outbound-refs +* date: 2021-04-20 +* status: solved +* example: https://fatcat.wiki/release/22222736evcc7kdn3bleua3fge/refs-out * found 16/1M Source and target are the same, maybe DOI with ref key? -- cgit v1.2.3