extra/bulk_edits/2022-07-13_dblp.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114


## Prep

    2022-07-13 05:24:33 (177 KB/s) - ‘dblp.xml.gz’ saved [715701831/715701831]

    Counter({'total': 9186263, 'skip': 9186263, 'has-doi': 4960506, 'skip-key-type': 3037457, 'skip-arxiv-corr': 439104, 'skip-title': 1, 'insert': 0, 'update': 0, 'exists': 0})
    5.71M 3:37:38 [ 437 /s]

    7.48k 0:38:18 [3.25 /s]


## Container Import

Run 2022-07-15, after a database backup/snapshot.

    export FATCAT_AUTH_WORKER_DBLP=[...]
    ./fatcat_import.py dblp-container --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --dblp-container-map-file ../extra/dblp/existing_dblp_containers.tsv --dblp-container-map-output ../extra/dblp/all_dblp_containers.tsv ../extra/dblp/dblp_container_meta.json
    # Got 5310 existing dblp container mappings.
    # Counter({'total': 7471, 'exists': 7130, 'insert': 341, 'skip': 0, 'update': 0})

    wc -l existing_dblp_containers.tsv all_dblp_containers.tsv dblp_container_meta.json prefix_list.txt
       5310 existing_dblp_containers.tsv
      12782 all_dblp_containers.tsv
       7471 dblp_container_meta.json
       7476 prefix_list.txt


## Release Import

    export FATCAT_AUTH_WORKER_DBLP=[...]
    ./fatcat_import.py dblp-release --dblp-container-map-file ../extra/dblp/all_dblp_containers.tsv ../extra/dblp/dblp.xml
    # Got 7480 dblp container mappings.

    /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=conf/gg/X90 ident=gfvkxubvsfdede7ps4af3oa34q
    warnings.warn(warn_str)
    /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=conf/visalg/X88 ident=lvfyrd3lvva3hjuaaokzyoscmm
    warnings.warn(warn_str)
    /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=conf/msr/PerumaANMO22 ident=2grlescl2bcpvd5yoc4npad3bm
    warnings.warn(warn_str)
    /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=conf/dagstuhl/Brodlie97 ident=l6nh222fpjdzfotchu7vfjh6qu
    warnings.warn(warn_str)
    /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=series/gidiss/2018 ident=x6t7ze4z55enrlq2dnac4qqbve

    Counter({'total': 9186263, 'exists': 5356574, 'has-doi': 4960506, 'skip': 3633039, 'skip-key-type': 3037457, 'skip-arxiv-corr': 439104, 'exists-fuzzy': 192376, 'skip-dblp-container-missing': 156477, 'insert': 4216, 'skip-arxiv': 53, 'skip-dblp-id-mismatch': 5, 'skip-title': 1, 'update': 0})

NOTE: had to re-try in the middle, so these counts not accurate overall.

Seems like a large number of `skip-dblp-container-missing`. Maybe should have
re-generated that file differently?

After this import there are 2,217,670 releases with a dblp ID, and 478,983 with
a dblp ID and no DOI.


## Sandcrawler Seedlist Generation

Almost none of the ~487k dblp releases with no DOI have an associated file.
This implies that no ingest has happened yet, even though the fatcat importer
does parse and filter the "fulltext" URLs out of dblp records.

    cat dblp_releases_partial.json | pipenv run ./dblp2ingestrequest.py - | pv -l | gzip > dblp_sandcrawler_ingest_requests.json.gz
    # 631k 0:02:39 [3.96k/s]

    zcat dblp_sandcrawler_ingest_requests.json.gz | jq -r .base_url | cut -f3 -d/ | sort | uniq -c | sort -nr | head -n25
      43851 ceur-ws.org
      33638 aclanthology.org
      32077 aisel.aisnet.org
      31017 ieeexplore.ieee.org
      26426 dl.acm.org
      23817 hdl.handle.net
      22400 www.isca-speech.org
      20072 tel.archives-ouvertes.fr
      18609 www.aaai.org
      18244 eprint.iacr.org
      15720 ethos.bl.uk
      14727 nbn-resolving.org
      14470 proceedings.mlr.press
      14095 dl.gi.de
      12159 proceedings.neurips.cc
      10890 knowledge.amia.org
      10049 www.usenix.org
       9675 papers.nips.cc
       7541 subs.emis.de
       7396 openaccess.thecvf.com
       7345 mindmodeling.org
       6574 ojs.aaai.org
       5814 www.lrec-conf.org
       5773 search.ndltd.org
       5311 ijcai.org

This is the first ingest, so let's do some sampling in the 'daily' queue:

    zcat dblp_sandcrawler_ingest_requests.json.gz

    zcat dblp_sandcrawler_ingest_requests.json.gz | shuf -n100 | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-daily -p -1

Looks like we can probably get away with doing these in the daily ingest queue,
instead of bulk? Try a larger batch:

    zcat dblp_sandcrawler_ingest_requests.json.gz | shuf -n10000 | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-daily -p -1

Nope, these are going to need bulk ingest then follow-up crawling. Will
heritrix crawl along with JALC and DOAJ stuff.

    zcat dblp_sandcrawler_ingest_requests.json.gz | rg -v "\\\\" | jq . -c | pv -l | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
    # 631k 0:00:11 [54.0k/s]


TODO:
x python or jq transform of JSON objects
x filter out german book/library URLs
x ensure fatcat importer will actually import dblp matches
x test with a small batch in daily or priority queue
- enqueue all in bulk mode, even if processed before? many probably MAG or OAI-PMH previously