aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-04-06 10:17:12 -0700
committerBryan Newbold <bnewbold@archive.org>2020-04-06 10:17:12 -0700
commit853bf11d06615f8fef918566833baee2d1105227 (patch)
tree14d64c48ba9ba12f46477029ff11f757f9ab277e
parent6f556a2f4660c6d5d944be5346d48a451c73e61a (diff)
downloadfatcat-covid19-853bf11d06615f8fef918566833baee2d1105227.tar.gz
fatcat-covid19-853bf11d06615f8fef918566833baee2d1105227.zip
update wanfang fulltext scraping
-rw-r--r--extra/scrape/README.md40
1 files changed, 32 insertions, 8 deletions
diff --git a/extra/scrape/README.md b/extra/scrape/README.md
index 97bb6fe..cf9626f 100644
--- a/extra/scrape/README.md
+++ b/extra/scrape/README.md
@@ -1,10 +1,14 @@
+Need beautiful soup; we aren't using pipenv here:
+
+ sudo apt install python3-bs4 python3-lxml
## CNKI List
Base URL: <http://en.gzbd.cnki.net/GZBT/brief/Default.aspx>
2020-03-29: "Found 1914 articles"
+2020-04-06: "Found 2224 articles"
Uses JS to fetch tables, URLs look like:
@@ -12,11 +16,12 @@ Uses JS to fetch tables, URLs look like:
Fetch a bunch:
- seq 0 64 | parallel http get "http://en.gzbd.cnki.net/gzbt/request/otherhandler.ashx?action=gzbdFlag\&contentID=0\&orderStr=1\&page={}\&grouptype=undefined\&groupvalue=undefined" > cnki_tables.html
+ # bump this seq number based on number of articles divided by 30, round up
+ seq 0 77 | parallel http get "http://en.gzbd.cnki.net/gzbt/request/otherhandler.ashx?action=gzbdFlag\&contentID=0\&orderStr=1\&page={}\&grouptype=undefined\&groupvalue=undefined" > metadata/cnki_tables.`date -I`.html
Parse HTML snippets to JSON:
- ./parse_cnki_tables.py > cnki_metadata.json
+ ./extra/scrape/parse_cnki_tables.py metadata/cnki_tables.`date -I`.html > metadata/cnki_metadata.`date -I`.json
The `info_url` seems to work, but the direct PDF download links don't naively.
Maybe need to set a referer, something like that?
@@ -28,17 +33,36 @@ Maybe need to set a referer, something like that?
mark=34 文献速递 Literature Express
mark=38 中医药防治 Prevention and treatment of traditional Chinese medicine
- wget 'http://subject.med.wanfangdata.com.cn/Channel/7?mark=32' -O wanfang_guidance.2020-03-29.html
- wget 'http://subject.med.wanfangdata.com.cn/Channel/7?mark=34' -O wanfang_papers.2020-03-29.html
+ wget 'http://subject.med.wanfangdata.com.cn/Channel/7?mark=32' -O metadata/wanfang_guidance.`date -I`.html
+ wget 'http://subject.med.wanfangdata.com.cn/Channel/7?mark=34' -O metadata/wanfang_papers.`date -I`.html
- ./parse_wanfang_html.py wanfang_papers.2020-03-29.html > wanfang_papers.2020-03-29.json
- ./parse_wanfang_html.py wanfang_guidance.2020-03-29.html > wanfang_guidance.2020-03-29.json
+ ./extra/scrape/parse_wanfang_html.py metadata/wanfang_papers.`date -I`.html > metadata/wanfang_papers.`date -I`.json
+ ./extra/scrape/parse_wanfang_html.py metadata/wanfang_guidance.`date -I`.html > metadata/wanfang_guidance.`date -I`.json
Download PDFs (without clobbering existing):
- cat wanfang_papers.2020-03-29.json wanfang_guidance.2020-03-29.json | jq .url -r | parallel wget -P fulltext_wanfang --no-clobber {}
+ cat metadata/wanfang_papers.`date -I`.json metadata/wanfang_guidance.`date -I`.json | jq .url -r | shuf | parallel wget -P fulltext_wanfang --no-clobber {}
+
+Rename based on mimetype:
+
+ ./bin/fix_extensions.sh fulltext_wanfang
+
+What did we get?
- file fulltext_wanfang/* | cut -f2 -d' ' | sort | uniq -c
+ file fulltext_wanfang/* | awk '{print $2}' | sort | uniq -c
144 HTML
609 PDF
+2020-04-06:
+
+ wc -l metadata/wanfang_*.json
+ 200 metadata/wanfang_guidance.2020-04-06.json
+ 739 metadata/wanfang_papers.2020-04-06.json
+ 939 total
+
+ 144 HTML
+ 679 PDF
+
+ ls fulltext_wanfang | wc -l
+ 823
+