diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2020-12-17 22:41:14 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2020-12-17 23:03:08 -0800 |
commit | 3031aa414932b39f38a6456df2a6f55f0e72dfbe (patch) | |
tree | 8e315544526fde307f9546ab5bf2119617d0f8db /extra/dblp | |
parent | 5c08b407f679674912605d1cece72f916370fe7a (diff) | |
download | fatcat-3031aa414932b39f38a6456df2a6f55f0e72dfbe.tar.gz fatcat-3031aa414932b39f38a6456df2a6f55f0e72dfbe.zip |
dblp: polish HTML scrape/extract pipeline
Diffstat (limited to 'extra/dblp')
-rw-r--r-- | extra/dblp/.gitignore | 3 | ||||
-rw-r--r-- | extra/dblp/Pipfile | 1 | ||||
-rw-r--r-- | extra/dblp/README.md | 15 |
3 files changed, 16 insertions, 3 deletions
diff --git a/extra/dblp/.gitignore b/extra/dblp/.gitignore index 8847a157..a04dd76e 100644 --- a/extra/dblp/.gitignore +++ b/extra/dblp/.gitignore @@ -1,3 +1,6 @@ conf/ journals/ series/ +Pipfile.lock +*.json +*.html diff --git a/extra/dblp/Pipfile b/extra/dblp/Pipfile index b9ba84f6..a191e76f 100644 --- a/extra/dblp/Pipfile +++ b/extra/dblp/Pipfile @@ -4,6 +4,7 @@ verify_ssl = true name = "pypi" [packages] +selectolax = "*" [dev-packages] diff --git a/extra/dblp/README.md b/extra/dblp/README.md index d74f8bf9..f2fd02da 100644 --- a/extra/dblp/README.md +++ b/extra/dblp/README.md @@ -1,6 +1,12 @@ This file describes hacks used to import dblp container metadata. +As of December 2020 this is part of the dblp release metadata import pipeline: +we must have conference and other non-ISSN containers created before running +the release import. dblp does not publish container-level metadata in a +structured format (eg, in their dumps), so scraping the HTML is unfortunately +necessary. + ## Quick Bootstrap Commands @@ -12,9 +18,12 @@ and dump release entities in JSON; this takes some time: Next extract the unique set of dblp identifier prefixes, which will be used as container identifiers: - cat /data/dblp/dblp_releases.json | jq ._dblp_prefix | grep -v ^none | sort -u > /data/dblp/prefix_list.txt + cat /data/dblp/dblp_releases.json | jq ._dblp_prefix -r | grep -v ^null | sort -u > /data/dblp/prefix_list.txt -Then fetch HTML documents from dblp.org for each prefix: +Then fetch HTML documents from dblp.org for each prefix. Note that currently +only single-level containers will download successfully, and only journals, +conf, and series sections. Books, Tech Reports, etc may be nice to include in +the future. mkdir -p journals mkdir -p conf @@ -27,7 +36,7 @@ Then fetch HTML documents from dblp.org for each prefix: Using the python script in this directory, extract metadata from these HTML documents: - fd .html | ./dblp_html_extract.py | pv -l > dblp_container_meta.json + fd html conf/ journals/ series/ | ./dblp_html_extract.py | pv -l > dblp_container_meta.json This can be imported into fatcat using the dblp-container importer: |