1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
|
## Containers: bad publisher strings
fatcat-cli search container publisher:NULL --count
# 131
# update to empty string (?)
## Releases: very long titles
## Bad PDFs
https://fatcat.wiki/file/ypoyxwqw5zexbamwtdmpavjjbi
https://web.archive.org/web/20190305033128/http://pdfs.semanticscholar.org/ceb2/b47a7647c710cd8e2c1937395b5d4a3a0204.pdf
sha1:ceb2b47a7647c710cd8e2c1937395b5d4a3a0204
not actually even a PDF?
Should do a query of `file_meta` and/or `pdf_meta` from sandcrawler DB, with
updated `fatcat_file` table, and look for mismatches, then remove/update on
fatcat side.
## Partial PDFs
look in to `ieeexplore.ieee.org` PDFs; may be partial?
## Invalid DOIs
We get a bunch of bogus DOIs from various sources. Eg, pubmed and doaj metadata
(and probably dblp).
It is not hard to verify individual DOIs, but doing so at scale is a bit harder.
We could start by identifying bogus DOIs from failed ingests in sandcrawler-db,
then verifying and removing from fatcat. Need to ensure we aren't "looping" the
DOIs on the fatcat side (eg, re-importing).
Could also do random sampling across, eg, DOAJ containers, to identify
publishers which don't register DOIs, then verify all of them.
Also, deleted DOIs
## Likely Bogus Dates
If 1970-01-01, then set to none (UNIX timestamp zero)
## Forthcoming Articles
These entities are created when the DOI is registered, but perhaps shouldn't be?
Forthcoming Article 2019 Astrophysical Journal Letters
doi:10.3847/2041-8213/ab0c96
## File Slides
Many PDFs in fatcat, which are associated with "papers", seem to actually be slide decks.
#### Sandcrawler SQL Exploration
SELECT *
FROM pdf_meta
LEFT JOIN fatcat_file
ON pdf_meta.sha1hex = fatcat_file.sha1hex
WHERE
status = 'success'
AND page0_height < page0_width
AND fatcat_file.sha1hex IS NOT NULL
LIMIT 10;
SELECT COUNT(*)
FROM pdf_meta
LEFT JOIN fatcat_file
ON pdf_meta.sha1hex = fatcat_file.sha1hex
WHERE
status = 'success'
AND page0_height < page0_width
AND fatcat_file.sha1hex IS NOT NULL
LIMIT 10;
# 199,126
#### Low-Code Cleanup Idea
1. do a SQL dump of file idents with this issue
2. use fatcat-cli to fetch the file entities, with releases expanded
3. use jq to filter to files with only one release associated
4. use jq to filter to files where the single release is a paper (eg, "article-journal") and maybe also has a `container_id`
5. use jq to modify the entities, setting `release_id` to null/empty, and setting `file_scope`
6. use `fatcat-cli` to update the file entities
This should fix many, though not all, such cases.
|