## Prep 2022-07-13 05:24:33 (177 KB/s) - ‘dblp.xml.gz’ saved [715701831/715701831] Counter({'total': 9186263, 'skip': 9186263, 'has-doi': 4960506, 'skip-key-type': 3037457, 'skip-arxiv-corr': 439104, 'skip-title': 1, 'insert': 0, 'update': 0, 'exists': 0}) 5.71M 3:37:38 [ 437 /s] 7.48k 0:38:18 [3.25 /s] ## Container Import Run 2022-07-15, after a database backup/snapshot. export FATCAT_AUTH_WORKER_DBLP=[...] ./fatcat_import.py dblp-container --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --dblp-container-map-file ../extra/dblp/existing_dblp_containers.tsv --dblp-container-map-output ../extra/dblp/all_dblp_containers.tsv ../extra/dblp/dblp_container_meta.json # Got 5310 existing dblp container mappings. # Counter({'total': 7471, 'exists': 7130, 'insert': 341, 'skip': 0, 'update': 0}) wc -l existing_dblp_containers.tsv all_dblp_containers.tsv dblp_container_meta.json prefix_list.txt 5310 existing_dblp_containers.tsv 12782 all_dblp_containers.tsv 7471 dblp_container_meta.json 7476 prefix_list.txt ## Release Import export FATCAT_AUTH_WORKER_DBLP=[...] ./fatcat_import.py dblp-release --dblp-container-map-file ../extra/dblp/all_dblp_containers.tsv ../extra/dblp/dblp.xml # Got 7480 dblp container mappings. /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=conf/gg/X90 ident=gfvkxubvsfdede7ps4af3oa34q warnings.warn(warn_str) /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=conf/visalg/X88 ident=lvfyrd3lvva3hjuaaokzyoscmm warnings.warn(warn_str) /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=conf/msr/PerumaANMO22 ident=2grlescl2bcpvd5yoc4npad3bm warnings.warn(warn_str) /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=conf/dagstuhl/Brodlie97 ident=l6nh222fpjdzfotchu7vfjh6qu warnings.warn(warn_str) /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=series/gidiss/2018 ident=x6t7ze4z55enrlq2dnac4qqbve Counter({'total': 9186263, 'exists': 5356574, 'has-doi': 4960506, 'skip': 3633039, 'skip-key-type': 3037457, 'skip-arxiv-corr': 439104, 'exists-fuzzy': 192376, 'skip-dblp-container-missing': 156477, 'insert': 4216, 'skip-arxiv': 53, 'skip-dblp-id-mismatch': 5, 'skip-title': 1, 'update': 0}) NOTE: had to re-try in the middle, so these counts not accurate overall. Seems like a large number of `skip-dblp-container-missing`. Maybe should have re-generated that file differently? After this import there are 2,217,670 releases with a dblp ID, and 478,983 with a dblp ID and no DOI. ## Sandcrawler Seedlist Generation Almost none of the ~487k dblp releases with no DOI have an associated file. This implies that no ingest has happened yet, even though the fatcat importer does parse and filter the "fulltext" URLs out of dblp records. cat dblp_releases_partial.json | pipenv run ./dblp2ingestrequest.py - | pv -l | gzip > dblp_sandcrawler_ingest_requests.json.gz # 631k 0:02:39 [3.96k/s] zcat dblp_sandcrawler_ingest_requests.json.gz | jq -r .base_url | cut -f3 -d/ | sort | uniq -c | sort -nr | head -n25 43851 ceur-ws.org 33638 aclanthology.org 32077 aisel.aisnet.org 31017 ieeexplore.ieee.org 26426 dl.acm.org 23817 hdl.handle.net 22400 www.isca-speech.org 20072 tel.archives-ouvertes.fr 18609 www.aaai.org 18244 eprint.iacr.org 15720 ethos.bl.uk 14727 nbn-resolving.org 14470 proceedings.mlr.press 14095 dl.gi.de 12159 proceedings.neurips.cc 10890 knowledge.amia.org 10049 www.usenix.org 9675 papers.nips.cc 7541 subs.emis.de 7396 openaccess.thecvf.com 7345 mindmodeling.org 6574 ojs.aaai.org 5814 www.lrec-conf.org 5773 search.ndltd.org 5311 ijcai.org This is the first ingest, so let's do some sampling in the 'daily' queue: zcat dblp_sandcrawler_ingest_requests.json.gz zcat dblp_sandcrawler_ingest_requests.json.gz | shuf -n100 | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-daily -p -1 Looks like we can probably get away with doing these in the daily ingest queue, instead of bulk? Try a larger batch: zcat dblp_sandcrawler_ingest_requests.json.gz | shuf -n10000 | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-daily -p -1 Nope, these are going to need bulk ingest then follow-up crawling. Will heritrix crawl along with JALC and DOAJ stuff. zcat dblp_sandcrawler_ingest_requests.json.gz | rg -v "\\\\" | jq . -c | pv -l | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 # 631k 0:00:11 [54.0k/s] TODO: x python or jq transform of JSON objects x filter out german book/library URLs x ensure fatcat importer will actually import dblp matches x test with a small batch in daily or priority queue - enqueue all in bulk mode, even if processed before? many probably MAG or OAI-PMH previously