diff options
Diffstat (limited to 'scrape/README.md')
-rw-r--r-- | scrape/README.md | 36 |
1 files changed, 36 insertions, 0 deletions
diff --git a/scrape/README.md b/scrape/README.md new file mode 100644 index 0000000..bf31fdb --- /dev/null +++ b/scrape/README.md @@ -0,0 +1,36 @@ + + +## CNKI List + +Base URL: <http://en.gzbd.cnki.net/GZBT/brief/Default.aspx> + +2020-03-29: "Found 1914 articles" + +Uses JS to fetch tables, URLs look like: + + http://en.gzbd.cnki.net/gzbt/request/otherhandler.ashx?action=gzbdFlag&contentID=0&orderStr=1&page=1&grouptype=undefined&groupvalue=undefined + +Fetch a bunch: + + seq 0 64 | parallel http get "http://en.gzbd.cnki.net/gzbt/request/otherhandler.ashx?action=gzbdFlag\&contentID=0\&orderStr=1\&page={}\&grouptype=undefined\&groupvalue=undefined" > cnki_tables.html + +Parse HTML snippets to JSON: + + ./parse_cnki_tables.py > cnki_metadata.json + +The `info_url` seems to work, but the direct PDF download links don't naively. +Maybe need to set a referer, something like that? + + +## Wanfang Data + + mark=32 指南与共识 Guidelines and consensus + mark=34 文献速递 Literature Express + mark=38 中医药防治 Prevention and treatment of traditional Chinese medicine + + wget 'http://subject.med.wanfangdata.com.cn/Channel/7?mark=32' -O wanfang_guidance.2020-03-29.html + wget 'http://subject.med.wanfangdata.com.cn/Channel/7?mark=34' -O wanfang_papers.2020-03-29.html + + ./parse_wanfang_html.py wanfang_papers.2020-03-29.html > wanfang_papers.2020-03-29.json + ./parse_wanfang_html.py wanfang_guidance.2020-03-29.html > wanfang_guidance.2020-03-29.json + |