aboutsummaryrefslogtreecommitdiffstats
path: root/sql/stats/2020-01-31_supplement.txt
blob: 6bd43ea863a911c43ecbfb42fd97cbe4aa5329ad (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
How many file_meta still missing core metadata?

    SELECT COUNT(*) FROM file_meta WHERE sha256hex IS NULL;
    => 1,130,915

Great! Not many.

And are in petabox?

    SELECT COUNT(*)
    FROM file_meta
    LEFT JOIN petabox ON file_meta.sha1hex = petabox.sha1hex
    WHERE file_meta.sha256hex IS NULL
      AND file_meta.sha1hex IS NOT NULL;
    => 1,149,194

Almost all; maybe just some CDX fetch failures or something in there. So,
should run these on, eg, grobid2-vm.

    COPY (
      SELECT row_to_json(petabox.*)
      FROM file_meta
      LEFT JOIN petabox ON file_meta.sha1hex = petabox.sha1hex
      WHERE file_meta.sha256hex IS NULL
        AND file_meta.sha1hex IS NOT NULL
    ) TO '/grande/snapshots/dump_grobid_petabox_todo.json';

Count of PDF files that GROBID processed and matched to a release (via
glutton), but no PDF in `fatcat_file` (note: `fatcat_file` is out of date by a
couple million files):

    SELECT COUNT(*) as total_count, COUNT(DISTINCT grobid.fatcat_release) as release_count
    FROM grobid
    LEFT JOIN fatcat_file ON grobid.sha1hex = fatcat_file.sha1hex
    WHERE fatcat_file.sha1hex IS NULL
      AND grobid.fatcat_release IS NOT NULL;

     total_count |  count  
    -------------+---------
         5072452 | 4130405