aboutsummaryrefslogtreecommitdiffstats
path: root/notes
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-10-28 12:52:33 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-10-28 12:52:33 +0200
commit4397bf37b8aabf4df6139f56681ce112da669990 (patch)
treeacd32d7637c046553bf0d711c1d070a30b0a7a2e /notes
parent7382b23e2d184c76e51be4245f8f4732e0c7679c (diff)
downloadrefcat-4397bf37b8aabf4df6139f56681ce112da669990.tar.gz
refcat-4397bf37b8aabf4df6139f56681ce112da669990.zip
data quality: dataset spam issue
Diffstat (limited to 'notes')
-rw-r--r--notes/data_issues.md30
1 files changed, 24 insertions, 6 deletions
diff --git a/notes/data_issues.md b/notes/data_issues.md
index 87a91b9..e450c8d 100644
--- a/notes/data_issues.md
+++ b/notes/data_issues.md
@@ -1,9 +1,25 @@
# Data issues specifically in Citation Graph
+## Occurence Download
+
+* date: 2021-10-28
+* status: open
+* example: https://fatcat.wiki/release/ns4v2jvhgbhh7mbg45bjtpzway/refs-in
+
+Symptom: Many datasets pointing to a publication; e.g. all having "Occurrence Download" as title
+
+Possible mitigation:
+
+* [ ] extract all titles from fatcat
+* [ ] find most common titles, decide if it should be blacklisted for citation graph
+* [ ] keep blacklist of release ident to ignore in edges
+* [ ] filter refcat, remove edges with blacklisted id as source (and target)
+
## Repeated entries
-* 2020-04-19
-* https://qa.fatcat.wiki/release/lcarb5rg5vf3tk4hpvosja5sm4/outbound-refs
+* date: 2021-04-19
+* status: solved
+* example: https://fatcat.wiki/release/lcarb5rg5vf3tk4hpvosja5sm4/refs-out
A DOI seems to be using the key, which leads to repeated entries.
@@ -13,8 +29,9 @@ A DOI seems to be using the key, which leads to repeated entries.
## Self references
-* 2020-04-19
-* https://qa.fatcat.wiki/release/3fcp4pk7nfamvkbjekqam24bfq/outbound-refs
+* date: 2021-04-19
+* status: solved
+* example: https://fatcat.wiki/release/3fcp4pk7nfamvkbjekqam24bfq/refs-out
The source and target seem to be the same.
@@ -22,8 +39,9 @@ The source and target seem to be the same.
## Duplicated Edges
-* 2020-04-20
-* https://qa.fatcat.wiki/release/22222736evcc7kdn3bleua3fge/outbound-refs
+* date: 2021-04-20
+* status: solved
+* example: https://fatcat.wiki/release/22222736evcc7kdn3bleua3fge/refs-out
* found 16/1M
Source and target are the same, maybe DOI with ref key?