aboutsummaryrefslogtreecommitdiffstats
path: root/notes/data_issues.md
blob: e450c8d42173e43fdb3989921fec9fc29e2ad0ba (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Data issues specifically in Citation Graph

## Occurence Download

* date: 2021-10-28
* status: open
* example: https://fatcat.wiki/release/ns4v2jvhgbhh7mbg45bjtpzway/refs-in

Symptom: Many datasets pointing to a publication; e.g. all having "Occurrence Download" as title

Possible mitigation:

* [ ] extract all titles from fatcat
* [ ] find most common titles, decide if it should be blacklisted for citation graph
* [ ] keep blacklist of release ident to ignore in edges
* [ ] filter refcat, remove edges with blacklisted id as source (and target)

## Repeated entries

* date: 2021-04-19
* status: solved
* example: https://fatcat.wiki/release/lcarb5rg5vf3tk4hpvosja5sm4/refs-out

A DOI seems to be using the key, which leads to repeated entries.

> 2021-07-02: Solved, kind of. We get rid of various duplicates in a
> post-processing step. It would still be better to not generate these in the
> first place.

## Self references

* date: 2021-04-19
* status: solved
* example: https://fatcat.wiki/release/3fcp4pk7nfamvkbjekqam24bfq/refs-out

The source and target seem to be the same.

> 2021-07-02: Solved in post-processing, for now.

## Duplicated Edges

* date: 2021-04-20
* status: solved
* example: https://fatcat.wiki/release/22222736evcc7kdn3bleua3fge/refs-out
* found 16/1M

Source and target are the same, maybe DOI with ref key?

> 2021-07-02: Solved in post-processing, for now.