aboutsummaryrefslogtreecommitdiffstats
path: root/notes/tasks/2020-08-20_file_meta.md
blob: 39c84dd445e495f6dc970a16ded55fdf07c0bdbb (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

Want to update fatcat file entities with "full" file metadata for those which are missing it.

How many `file_meta` rows *still* don't have metadata?

    SELECT COUNT(*) FROM file_meta WHERE sha256hex IS NULL;
    => 62962

First generate list of sha1hex from most recent bulk export which are missing
at least some metadata (based on missing sha256):

    zcat file_hashes.tsv.gz | rg '\t\t' | cut -f3 | sort -u -S 4G | pv -l > fatcat_file_partial_sha1hex.tsv
    => 18.7M 0:05:46 [53.8k/s]

Then dump the entire sandcrawler `file_meta` table as TSV, with first column
sha1hex and second column JSON with all the file metadata fields:

    COPY (
      SELECT sha1hex, row_to_json(file_meta)
      FROM file_meta
      WHERE sha256hex IS NOT NULL
      ORDER BY sha1hex ASC
    )
    TO '/grande/snapshots/file_meta_dump.tsv'
    WITH NULL '';

Join/cut:

    export LC_ALL=C
    join -t$'\t' fatcat_file_partial_sha1hex.tsv /grande/snapshots/file_meta_dump.tsv | uniq -w 40 | cut -f2 | pv -l > fatcat_file_partial.file_meta.json
    => 18.1M 0:03:37 [83.2k/s]

Check counts:

    cat fatcat_file_partial.file_meta.json | jq .sha1hex -r | sort -u -S 4G | wc -l
    => 18135313

    zcat fatcat_file_partial.file_meta.json.gz | jq .mimetype -r | sort -S 4G | uniq -c | sort -nr
    18103860 application/pdf
      29977 application/octet-stream
        876 text/html
        199 application/postscript
        171 application/gzip
         84 text/plain
         48 application/xml
         38 application/vnd.ms-powerpoint
         16 application/msword
          8 application/vnd.openxmlformats-officedocument.wordprocessingml.document
          6 image/jpeg
          4 message/rfc822
          4 application/zip
          4 application/vnd.openxmlformats-officedocument.presentationml.presentation
          3 text/x-tex
          3 application/x-dosexec
          2 application/x-tar
          2 application/vnd.ms-tnef
          1 video/mpeg
          1 image/tiff
          1 image/svg+xml
          1 image/png
          1 image/gif
          1 audio/x-ape
          1 application/vnd.ms-office
          1 application/CDFV2-unknown

TODO: fatcat importer