allow <meta property=citation_pdf_url>

at least researchgate does this (!)
author: Bryan Newbold <bnewbold@archive.org> 2020-02-18 23:08:06 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2020-02-18 23:08:09 -0800
commit: e6f2a585868b0277145659b9d653a0288f76f5b6 (patch)
tree: 418a3ce46fa0398a0776eca23c550cbf745edb4a /python
parent: 3d663242e2dc4128bd4613657870e8dd42cac570 (diff)
download: sandcrawler-e6f2a585868b0277145659b9d653a0288f76f5b6.tar.gz
sandcrawler-e6f2a585868b0277145659b9d653a0288f76f5b6.zip
1 files changed, 3 insertions, 0 deletions
diff --git a/python/sandcrawler/html.py b/python/sandcrawler/html.py
index e6f0f69..8e9eb1f 100644
--- a/python/sandcrawler/html.py
+++ b/python/sandcrawler/html.py
@@ -44,6 +44,9 @@ def extract_fulltext_url(html_url, html_body):
     meta = soup.find('meta', attrs={"name":"citation_pdf_url"})
     if not meta:
         meta = soup.find('meta', attrs={"name":"bepress_citation_pdf_url"})
+    if not meta:
+        # researchgate does this; maybe others also?
+        meta = soup.find('meta', attrs={"property":"citation_pdf_url"})
     # wiley has a weird almost-blank page we don't want to loop on
     if meta and not "://onlinelibrary.wiley.com/doi/pdf/" in html_url:
         url = meta['content'].strip()
author	Bryan Newbold <bnewbold@archive.org>	2020-02-18 23:08:06 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2020-02-18 23:08:09 -0800
commit	e6f2a585868b0277145659b9d653a0288f76f5b6 (patch)
tree	418a3ce46fa0398a0776eca23c550cbf745edb4a /python
parent	3d663242e2dc4128bd4613657870e8dd42cac570 (diff)
download	sandcrawler-e6f2a585868b0277145659b9d653a0288f76f5b6.tar.gz sandcrawler-e6f2a585868b0277145659b9d653a0288f76f5b6.zip