aboutsummaryrefslogtreecommitdiffstats
path: root/extra/cleanups/TODO
blob: 723628de69f8f316e3bb51c79d31428b9c4e941a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95

## Containers: bad publisher strings

    fatcat-cli search container publisher:NULL --count
    # 131
    # update to empty string (?)

## Releases: very long titles


## Bad PDFs

    https://fatcat.wiki/file/ypoyxwqw5zexbamwtdmpavjjbi
    https://web.archive.org/web/20190305033128/http://pdfs.semanticscholar.org/ceb2/b47a7647c710cd8e2c1937395b5d4a3a0204.pdf
    sha1:ceb2b47a7647c710cd8e2c1937395b5d4a3a0204
    not actually even a PDF?

Should do a query of `file_meta` and/or `pdf_meta` from sandcrawler DB, with
updated `fatcat_file` table, and look for mismatches, then remove/update on
fatcat side.


## Partial PDFs

look in to `ieeexplore.ieee.org` PDFs; may be partial?


## Invalid DOIs

We get a bunch of bogus DOIs from various sources. Eg, pubmed and doaj metadata
(and probably dblp).

It is not hard to verify individual DOIs, but doing so at scale is a bit harder.

We could start by identifying bogus DOIs from failed ingests in sandcrawler-db,
then verifying and removing from fatcat. Need to ensure we aren't "looping" the
DOIs on the fatcat side (eg, re-importing).

Could also do random sampling across, eg, DOAJ containers, to identify
publishers which don't register DOIs, then verify all of them.

Also, deleted DOIs


## Likely Bogus Dates

If 1970-01-01, then set to none (UNIX timestamp zero)


## Forthcoming Articles

These entities are created when the DOI is registered, but perhaps shouldn't be?

Forthcoming Article 2019   Astrophysical Journal Letters
doi:10.3847/2041-8213/ab0c96 


## File Slides

Many PDFs in fatcat, which are associated with "papers", seem to actually be slide decks.

#### Sandcrawler SQL Exploration

    SELECT *
    FROM pdf_meta
    LEFT JOIN fatcat_file
        ON pdf_meta.sha1hex = fatcat_file.sha1hex
    WHERE
        status = 'success'
        AND page0_height < page0_width
        AND fatcat_file.sha1hex IS NOT NULL
    LIMIT 10;

    SELECT COUNT(*)
    FROM pdf_meta
    LEFT JOIN fatcat_file
        ON pdf_meta.sha1hex = fatcat_file.sha1hex
    WHERE
        status = 'success'
        AND page0_height < page0_width
        AND fatcat_file.sha1hex IS NOT NULL
    LIMIT 10;
    # 199,126

#### Low-Code Cleanup Idea

1. do a SQL dump of file idents with this issue
2. use fatcat-cli to fetch the file entities, with releases expanded
3. use jq to filter to files with only one release associated
4. use jq to filter to files where the single release is a paper (eg, "article-journal") and maybe also has a `container_id`
5. use jq to modify the entities, setting `release_id` to null/empty, and setting `file_scope`
6. use `fatcat-cli` to update the file entities

This should fix many, though not all, such cases.