blob: 39c84dd445e495f6dc970a16ded55fdf07c0bdbb (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
|
Want to update fatcat file entities with "full" file metadata for those which are missing it.
How many `file_meta` rows *still* don't have metadata?
SELECT COUNT(*) FROM file_meta WHERE sha256hex IS NULL;
=> 62962
First generate list of sha1hex from most recent bulk export which are missing
at least some metadata (based on missing sha256):
zcat file_hashes.tsv.gz | rg '\t\t' | cut -f3 | sort -u -S 4G | pv -l > fatcat_file_partial_sha1hex.tsv
=> 18.7M 0:05:46 [53.8k/s]
Then dump the entire sandcrawler `file_meta` table as TSV, with first column
sha1hex and second column JSON with all the file metadata fields:
COPY (
SELECT sha1hex, row_to_json(file_meta)
FROM file_meta
WHERE sha256hex IS NOT NULL
ORDER BY sha1hex ASC
)
TO '/grande/snapshots/file_meta_dump.tsv'
WITH NULL '';
Join/cut:
export LC_ALL=C
join -t$'\t' fatcat_file_partial_sha1hex.tsv /grande/snapshots/file_meta_dump.tsv | uniq -w 40 | cut -f2 | pv -l > fatcat_file_partial.file_meta.json
=> 18.1M 0:03:37 [83.2k/s]
Check counts:
cat fatcat_file_partial.file_meta.json | jq .sha1hex -r | sort -u -S 4G | wc -l
=> 18135313
zcat fatcat_file_partial.file_meta.json.gz | jq .mimetype -r | sort -S 4G | uniq -c | sort -nr
18103860 application/pdf
29977 application/octet-stream
876 text/html
199 application/postscript
171 application/gzip
84 text/plain
48 application/xml
38 application/vnd.ms-powerpoint
16 application/msword
8 application/vnd.openxmlformats-officedocument.wordprocessingml.document
6 image/jpeg
4 message/rfc822
4 application/zip
4 application/vnd.openxmlformats-officedocument.presentationml.presentation
3 text/x-tex
3 application/x-dosexec
2 application/x-tar
2 application/vnd.ms-tnef
1 video/mpeg
1 image/tiff
1 image/svg+xml
1 image/png
1 image/gif
1 audio/x-ape
1 application/vnd.ms-office
1 application/CDFV2-unknown
TODO: fatcat importer
|