diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-10-28 12:52:33 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-10-28 12:52:33 +0200 |
commit | 4397bf37b8aabf4df6139f56681ce112da669990 (patch) | |
tree | acd32d7637c046553bf0d711c1d070a30b0a7a2e /notes | |
parent | 7382b23e2d184c76e51be4245f8f4732e0c7679c (diff) | |
download | refcat-4397bf37b8aabf4df6139f56681ce112da669990.tar.gz refcat-4397bf37b8aabf4df6139f56681ce112da669990.zip |
data quality: dataset spam issue
Diffstat (limited to 'notes')
-rw-r--r-- | notes/data_issues.md | 30 |
1 files changed, 24 insertions, 6 deletions
diff --git a/notes/data_issues.md b/notes/data_issues.md index 87a91b9..e450c8d 100644 --- a/notes/data_issues.md +++ b/notes/data_issues.md @@ -1,9 +1,25 @@ # Data issues specifically in Citation Graph +## Occurence Download + +* date: 2021-10-28 +* status: open +* example: https://fatcat.wiki/release/ns4v2jvhgbhh7mbg45bjtpzway/refs-in + +Symptom: Many datasets pointing to a publication; e.g. all having "Occurrence Download" as title + +Possible mitigation: + +* [ ] extract all titles from fatcat +* [ ] find most common titles, decide if it should be blacklisted for citation graph +* [ ] keep blacklist of release ident to ignore in edges +* [ ] filter refcat, remove edges with blacklisted id as source (and target) + ## Repeated entries -* 2020-04-19 -* https://qa.fatcat.wiki/release/lcarb5rg5vf3tk4hpvosja5sm4/outbound-refs +* date: 2021-04-19 +* status: solved +* example: https://fatcat.wiki/release/lcarb5rg5vf3tk4hpvosja5sm4/refs-out A DOI seems to be using the key, which leads to repeated entries. @@ -13,8 +29,9 @@ A DOI seems to be using the key, which leads to repeated entries. ## Self references -* 2020-04-19 -* https://qa.fatcat.wiki/release/3fcp4pk7nfamvkbjekqam24bfq/outbound-refs +* date: 2021-04-19 +* status: solved +* example: https://fatcat.wiki/release/3fcp4pk7nfamvkbjekqam24bfq/refs-out The source and target seem to be the same. @@ -22,8 +39,9 @@ The source and target seem to be the same. ## Duplicated Edges -* 2020-04-20 -* https://qa.fatcat.wiki/release/22222736evcc7kdn3bleua3fge/outbound-refs +* date: 2021-04-20 +* status: solved +* example: https://fatcat.wiki/release/22222736evcc7kdn3bleua3fge/refs-out * found 16/1M Source and target are the same, maybe DOI with ref key? |