hotfix for html meta extract codepath

Didn't test last commit before pushing; bad Bryan!
author: Bryan Newbold <bnewbold@archive.org> 2020-05-03 19:38:19 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2020-05-03 19:38:22 -0700
commit: 748678bc88ea31a362ec5e896fd991b3c8dcbe58 (patch)
tree: e444a2f57564697c95bc2628a44bbd4068854139
parent: a61dd0c429b9e6d24987e14cd5d66057adb498da (diff)
download: sandcrawler-748678bc88ea31a362ec5e896fd991b3c8dcbe58.tar.gz
sandcrawler-748678bc88ea31a362ec5e896fd991b3c8dcbe58.zip
1 files changed, 1 insertions, 1 deletions
diff --git a/python/sandcrawler/html.py b/python/sandcrawler/html.py
index 6e346e7..3eadc7b 100644
--- a/python/sandcrawler/html.py
+++ b/python/sandcrawler/html.py
@@ -55,7 +55,7 @@ def extract_fulltext_url(html_url, html_body):
         # researchgate does this; maybe others also?
         meta = soup.find('meta', attrs={"property":"citation_pdf_url"})
     # if tag is only partially populated
-    if not meta.get('content'):
+    if meta and not meta.get('content'):
         meta = None
     # wiley has a weird almost-blank page we don't want to loop on
     if meta and not "://onlinelibrary.wiley.com/doi/pdf/" in html_url:
author	Bryan Newbold <bnewbold@archive.org>	2020-05-03 19:38:19 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2020-05-03 19:38:22 -0700
commit	748678bc88ea31a362ec5e896fd991b3c8dcbe58 (patch)
tree	e444a2f57564697c95bc2628a44bbd4068854139
parent	a61dd0c429b9e6d24987e14cd5d66057adb498da (diff)
download	sandcrawler-748678bc88ea31a362ec5e896fd991b3c8dcbe58.tar.gz sandcrawler-748678bc88ea31a362ec5e896fd991b3c8dcbe58.zip