aboutsummaryrefslogtreecommitdiffstats
path: root/scrape/README.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-03-30 09:49:04 -0700
committerBryan Newbold <bnewbold@archive.org>2020-03-30 09:49:04 -0700
commit0a2c5e5c71d920cd2e7634040561a044d9e40d58 (patch)
tree93d5d4be52b54dabcf28384b33ff6705fdc1323c /scrape/README.md
parent0cf608debcd672f9a3c54cb8d4ac1caf686ce2e3 (diff)
downloadfatcat-covid19-0a2c5e5c71d920cd2e7634040561a044d9e40d58.tar.gz
fatcat-covid19-0a2c5e5c71d920cd2e7634040561a044d9e40d58.zip
update wanfang scrape
Diffstat (limited to 'scrape/README.md')
-rw-r--r--scrape/README.md8
1 files changed, 8 insertions, 0 deletions
diff --git a/scrape/README.md b/scrape/README.md
index bf31fdb..97bb6fe 100644
--- a/scrape/README.md
+++ b/scrape/README.md
@@ -34,3 +34,11 @@ Maybe need to set a referer, something like that?
./parse_wanfang_html.py wanfang_papers.2020-03-29.html > wanfang_papers.2020-03-29.json
./parse_wanfang_html.py wanfang_guidance.2020-03-29.html > wanfang_guidance.2020-03-29.json
+Download PDFs (without clobbering existing):
+
+ cat wanfang_papers.2020-03-29.json wanfang_guidance.2020-03-29.json | jq .url -r | parallel wget -P fulltext_wanfang --no-clobber {}
+
+ file fulltext_wanfang/* | cut -f2 -d' ' | sort | uniq -c
+ 144 HTML
+ 609 PDF
+