summaryrefslogtreecommitdiffstats
path: root/extra/dblp/README.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2020-12-17 13:55:57 -0800
committerBryan Newbold <bnewbold@robocracy.org>2020-12-17 23:03:08 -0800
commit8c2394a5e0ae73c5d534bed30e339ab5004d11e1 (patch)
tree8e3f1564529abb5f3a3fd5890f1e222d787b581f /extra/dblp/README.md
parent9451b3063c2d446748db74027c40c13ee69c24fb (diff)
downloadfatcat-8c2394a5e0ae73c5d534bed30e339ab5004d11e1.tar.gz
fatcat-8c2394a5e0ae73c5d534bed30e339ab5004d11e1.zip
dblp: script and notes on container metadata generation
Diffstat (limited to 'extra/dblp/README.md')
-rw-r--r--extra/dblp/README.md34
1 files changed, 34 insertions, 0 deletions
diff --git a/extra/dblp/README.md b/extra/dblp/README.md
new file mode 100644
index 00000000..d74f8bf9
--- /dev/null
+++ b/extra/dblp/README.md
@@ -0,0 +1,34 @@
+
+This file describes hacks used to import dblp container metadata.
+
+
+## Quick Bootstrap Commands
+
+Starting with a complete dblp.xml (and dblp.dtd) dump, do a dry-run transform
+and dump release entities in JSON; this takes some time:
+
+ ./fatcat_import.py dblp-release /data/dblp/dblp.xml --dump-json-mode > /data/dblp/dblp_releases.json
+
+Next extract the unique set of dblp identifier prefixes, which will be used as
+container identifiers:
+
+ cat /data/dblp/dblp_releases.json | jq ._dblp_prefix | grep -v ^none | sort -u > /data/dblp/prefix_list.txt
+
+Then fetch HTML documents from dblp.org for each prefix:
+
+ mkdir -p journals
+ mkdir -p conf
+ mkdir -p series
+
+ shuf /data/dblp/prefix_list.txt | pv -l | parallel -j1 wget -nc -q "https://dblp.org/db/{}/index.html" -O {}.html
+
+ # clean up any failed/empty files, then re-run the above parallel/wget command
+ find . -empty -type f -delete
+
+Using the python script in this directory, extract metadata from these HTML documents:
+
+ fd .html | ./dblp_html_extract.py | pv -l > dblp_container_meta.json
+
+This can be imported into fatcat using the dblp-container importer:
+
+ ./fatcat_import.py dblp-container --issn-map-file /data/issn/20200323.ISSN-to-ISSN-L.txt --dblp-container-map-file /data/dblp/existing_dblp_containers.tsv --dblp-container-map-output /data/dblp/all_dblp_containers.tsv dblp_container_meta.json