summaryrefslogtreecommitdiffstats
path: root/notes/dblp_hacking.txt
blob: 6ebcdc45eb4d57c3f99bc80064d26bd7499d17af (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72

Notes from fall 2020 

## prefix counts

    # of conferences: 5,329
    # of journals: 1,724

    zcat dblp.xml.gz | rg "key=" | rg "mdate=" | cut -f3 -d' ' | cut -f2 -d'"' | pv -l > keys.txt
    => 8.00M

    cat keys.txt | cut -f1 -d/ | sort | uniq -c | sort -nr
    2764029 conf
    2640949 homepages
    2431614 journals
      77682 phd
      37402 books
      27830 reference
      19153 series
        555 tr
                tr/ibm/LILOG34
                tr/sql/X3H2-90-292
         16 persons
         15 www
                www/org/w3/TR/xquery
                www/org/mitre/future
          6 ms
          3 dblpnote

    cat keys.txt | cut -f1-2 -d/ | sort -u | cut -f1 -d/ | sort | uniq -c | sort -nr
       5138 conf
       1725 journals
        291 homepages
        125 phd
         96 series
         77 books
         60 reference
         16 persons
          9 tr
          6 ms
          3 dblpnote
          2 www

Fetch all the HTML:

    shuf prefixes.txt | pv -l | parallel -j1 wget -nc -q "https://dblp.org/db/{}/index.html" -O {}.html

Got blocked; supposed to do only one per minute. Delete missing and try again with `-j1` not `-j4`:

    find . -empty -type f -delete

Roughly 500x in 2:38

TODO: wrap this script so it iterates over filenames, instead of one-per-call

## Dev Import Counts

Counter({'total': 7953365, 'has-doi': 4277307, 'skip': 2953841, 'skip-key-type': 2640968, 'skip-arxiv-corr': 312872, 'skip-title': 1, 'insert': 0, 'update': 0, 'exists': 0})

Container imports:

    # blank database
    Counter({'total': 6954, 'insert': 6944, 'skip-update': 10, 'skip': 0, 'update': 0, 'exists': 0})

    # repeated
    Counter({'total': 6954, 'insert': 5325, 'skip-update': 1629, 'skip': 0, 'update': 0, 'exists': 0})

    # repeated with previous complete TSV file
    Counter({'total': 6954, 'skip-update': 6954, 'skip': 0, 'insert': 0, 'update': 0, 'exists': 0})


./fatcat_import.py dblp-release --dblp-container-map-file /data/dblp/all_dblp_containers.tsv /data/dblp/dblp.xml