dblp import/update notes

author: Bryan Newbold <bnewbold@robocracy.org> 2022-07-19 12:48:15 -0700
committer: Bryan Newbold <bnewbold@robocracy.org> 2022-07-19 12:48:15 -0700
commit: b12d4f0bde96bfe39df1cc94a993da4b25e53304 (patch)
tree: bbd6d5db7cfa61d76b4f8bfabe80102cf4d07db7
parent: 519733b77832ccbf97491a794e7f10884e39acdb (diff)
download: fatcat-b12d4f0bde96bfe39df1cc94a993da4b25e53304.tar.gz
fatcat-b12d4f0bde96bfe39df1cc94a993da4b25e53304.zip
1 files changed, 114 insertions, 0 deletions
diff --git a/extra/bulk_edits/2022-07-13_dblp.md b/extra/bulk_edits/2022-07-13_dblp.md
new file mode 100644
index 00000000..25405132
--- /dev/null
+++ b/extra/bulk_edits/2022-07-13_dblp.md
@@ -0,0 +1,114 @@
+
+## Prep
+
+    2022-07-13 05:24:33 (177 KB/s) - ‘dblp.xml.gz’ saved [715701831/715701831]
+
+    Counter({'total': 9186263, 'skip': 9186263, 'has-doi': 4960506, 'skip-key-type': 3037457, 'skip-arxiv-corr': 439104, 'skip-title': 1, 'insert': 0, 'update': 0, 'exists': 0})
+    5.71M 3:37:38 [ 437 /s]
+
+    7.48k 0:38:18 [3.25 /s]
+
+
+## Container Import
+
+Run 2022-07-15, after a database backup/snapshot.
+
+    export FATCAT_AUTH_WORKER_DBLP=[...]
+    ./fatcat_import.py dblp-container --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --dblp-container-map-file ../extra/dblp/existing_dblp_containers.tsv --dblp-container-map-output ../extra/dblp/all_dblp_containers.tsv ../extra/dblp/dblp_container_meta.json
+    # Got 5310 existing dblp container mappings.
+    # Counter({'total': 7471, 'exists': 7130, 'insert': 341, 'skip': 0, 'update': 0})
+
+    wc -l existing_dblp_containers.tsv all_dblp_containers.tsv dblp_container_meta.json prefix_list.txt
+       5310 existing_dblp_containers.tsv
+      12782 all_dblp_containers.tsv
+       7471 dblp_container_meta.json
+       7476 prefix_list.txt
+
+
+## Release Import
+
+    export FATCAT_AUTH_WORKER_DBLP=[...]
+    ./fatcat_import.py dblp-release --dblp-container-map-file ../extra/dblp/all_dblp_containers.tsv ../extra/dblp/dblp.xml
+    # Got 7480 dblp container mappings.
+
+    /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=conf/gg/X90 ident=gfvkxubvsfdede7ps4af3oa34q
+    warnings.warn(warn_str)
+    /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=conf/visalg/X88 ident=lvfyrd3lvva3hjuaaokzyoscmm
+    warnings.warn(warn_str)
+    /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=conf/msr/PerumaANMO22 ident=2grlescl2bcpvd5yoc4npad3bm
+    warnings.warn(warn_str)
+    /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=conf/dagstuhl/Brodlie97 ident=l6nh222fpjdzfotchu7vfjh6qu
+    warnings.warn(warn_str)
+    /1/srv/fatcat/src/python/fatcat_tools/importers/dblp_release.py:358: UserWarning: unexpected dblp ext_id match after lookup failed dblp=series/gidiss/2018 ident=x6t7ze4z55enrlq2dnac4qqbve
+
+    Counter({'total': 9186263, 'exists': 5356574, 'has-doi': 4960506, 'skip': 3633039, 'skip-key-type': 3037457, 'skip-arxiv-corr': 439104, 'exists-fuzzy': 192376, 'skip-dblp-container-missing': 156477, 'insert': 4216, 'skip-arxiv': 53, 'skip-dblp-id-mismatch': 5, 'skip-title': 1, 'update': 0})
+
+NOTE: had to re-try in the middle, so these counts not accurate overall.
+
+Seems like a large number of `skip-dblp-container-missing`. Maybe should have
+re-generated that file differently?
+
+After this import there are 2,217,670 releases with a dblp ID, and 478,983 with
+a dblp ID and no DOI.
+
+
+## Sandcrawler Seedlist Generation
+
+Almost none of the ~487k dblp releases with no DOI have an associated file.
+This implies that no ingest has happened yet, even though the fatcat importer
+does parse and filter the "fulltext" URLs out of dblp records.
+
+    cat dblp_releases_partial.json | pipenv run ./dblp2ingestrequest.py - | pv -l | gzip > dblp_sandcrawler_ingest_requests.json.gz
+    # 631k 0:02:39 [3.96k/s]
+
+    zcat dblp_sandcrawler_ingest_requests.json.gz | jq -r .base_url | cut -f3 -d/ | sort | uniq -c | sort -nr | head -n25
+      43851 ceur-ws.org
+      33638 aclanthology.org
+      32077 aisel.aisnet.org
+      31017 ieeexplore.ieee.org
+      26426 dl.acm.org
+      23817 hdl.handle.net
+      22400 www.isca-speech.org
+      20072 tel.archives-ouvertes.fr
+      18609 www.aaai.org
+      18244 eprint.iacr.org
+      15720 ethos.bl.uk
+      14727 nbn-resolving.org
+      14470 proceedings.mlr.press
+      14095 dl.gi.de
+      12159 proceedings.neurips.cc
+      10890 knowledge.amia.org
+      10049 www.usenix.org
+       9675 papers.nips.cc
+       7541 subs.emis.de
+       7396 openaccess.thecvf.com
+       7345 mindmodeling.org
+       6574 ojs.aaai.org
+       5814 www.lrec-conf.org
+       5773 search.ndltd.org
+       5311 ijcai.org
+
+This is the first ingest, so let's do some sampling in the 'daily' queue:
+
+    zcat dblp_sandcrawler_ingest_requests.json.gz
+
+    zcat dblp_sandcrawler_ingest_requests.json.gz | shuf -n100 | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-daily -p -1
+
+Looks like we can probably get away with doing these in the daily ingest queue,
+instead of bulk? Try a larger batch:
+
+    zcat dblp_sandcrawler_ingest_requests.json.gz | shuf -n10000 | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-daily -p -1
+
+Nope, these are going to need bulk ingest then follow-up crawling. Will
+heritrix crawl along with JALC and DOAJ stuff.
+
+    zcat dblp_sandcrawler_ingest_requests.json.gz | rg -v "\\\\" | jq . -c | pv -l | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+    # 631k 0:00:11 [54.0k/s]
+
+
+TODO:
+x python or jq transform of JSON objects
+x filter out german book/library URLs
+x ensure fatcat importer will actually import dblp matches
+x test with a small batch in daily or priority queue
+- enqueue all in bulk mode, even if processed before? many probably MAG or OAI-PMH previously
author	Bryan Newbold <bnewbold@robocracy.org>	2022-07-19 12:48:15 -0700
committer	Bryan Newbold <bnewbold@robocracy.org>	2022-07-19 12:48:15 -0700
commit	b12d4f0bde96bfe39df1cc94a993da4b25e53304 (patch)
tree	bbd6d5db7cfa61d76b4f8bfabe80102cf4d07db7
parent	519733b77832ccbf97491a794e7f10884e39acdb (diff)
download	fatcat-b12d4f0bde96bfe39df1cc94a993da4b25e53304.tar.gz fatcat-b12d4f0bde96bfe39df1cc94a993da4b25e53304.zip