From 9592902ee082b9590d34db6b905bc57bdfeb3c00 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Tue, 10 Nov 2020 18:53:04 -0800 Subject: add implementation notes about HTML ingest --- notes/html_ingest_notes.md | 248 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 248 insertions(+) create mode 100644 notes/html_ingest_notes.md (limited to 'notes') diff --git a/notes/html_ingest_notes.md b/notes/html_ingest_notes.md new file mode 100644 index 0000000..6bb876a --- /dev/null +++ b/notes/html_ingest_notes.md @@ -0,0 +1,248 @@ + +## Current Plan + +- selectolax to extract metadata and quickly filter (speed) + => eg, differentiate landing pages from fulltext + => also embed URLs? +- trafilatura for fulltext body extract +- no solution yet for reference parsing + => maybe trafilatura XML-TEI parsing, then GROBID? + => especially if DOI/identifier/URL is in the reference + + + +TODO: +x print/wrap error condition better +x serialize dates (pydantic) +x CDX lookup "closest" to capture datetime (or by month) +x firstmonday no extracted fulltext/XML +x apply URL base fixup to fulltext URLs +x XML alternative detection +- basic ingest worker, kafka topics, persist workers, sql table, etc +- ingest worker: landing page to actual fulltext (eg, OJS) +- broken? https://betterexplained.com/articles/colorized-math-equations/ + +Ponder: +- CDX lookup older successful captures + http://www.altdevblogaday.com/2011/05/17/understanding-the-fourier-transform/ + => optional filter by status? "reduce" by month/year? +- detect scope heuristically + bepress_is_article_cover_page 1 + citation_fulltext_world_readable "" (eg, distill) +- non-success subresource fetches + https://www.europenowjournal.org/2020/10/11/a-social-history-of-early-rock-n-roll-in-germany-hamburg-from-burlesque-to-the-beatles-1956-1969/ +- redirects: keep start URL? + +Later: +- XML URL extraction + https://www.scielo.br/scielo.php?script=sci_arttext&pid=S0100-19652002000200001&lng=en&nrm=iso&tlng=pt + +- selectolax bug? hangs: `css_first("meta['thing']")` +- youtube embed + => download/include actual video file? +- parse references in citation headers +- try parsing references in HTML fulltext + +## Testing URLs + +- PLOS + https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0093949 + TODO: "May 9, 2014" + TODO: appendix +- peerj + https://peerj.com/articles/4375/ +- scielo + http://scielo.iics.una.py/scielo.php?script=sci_arttext&pid=S1683-98032020000200081&lng=en&nrm=iso&tlng=es + bunch of little icon .png, but ok + redirect of an image not saved in webcapture +- wordpress + https://www.europenowjournal.org/2020/10/11/a-social-history-of-early-rock-n-roll-in-germany-hamburg-from-burlesque-to-the-beatles-1956-1969/ + no HTML meta? hrm +- old OJS + (pdf only) http://rjh.folium.ru/index.php/rjh/article/view/1511 +- new OJS + https://firstmonday.org/ojs/index.php/fm/article/view/10274/9729 +- plain HTML + http://journal.sjdm.org/12/12627/jdm12627.html +- blogs/essays + http://symbolflux.com/lodessay/ + https://betterexplained.com/articles/colorized-math-equations/ + https://web.archive.org/web/20120418231513/http://www.altdevblogaday.com/2011/05/17/understanding-the-fourier-transform/ + https://research.google.com/bigpicture/attacking-discrimination-in-ml/ + http://www.econgraphs.org/ +- journal homepage (not fulltext) +- OJS new landing page (not fulltext) +- OJS old (not fulltext) + http://rjh.folium.ru/index.php/rjh/index + http://rjh.folium.ru/index.php/rjh/issue/view/106 + http://rjh.folium.ru/index.php/rjh/article/view/382 +- distill + https://distill.pub/2020/bayesian-optimization/ + https://distill.pub/2018/feature-wise-transformations/ +- youtube video embed + http://www.cond.org/persalog.html +- youtube video direct? +- github: project README? +- wikipedia + +## Background Research + +- scrapy (?) +- requests-html: can run javascript + => good for metadata extraction? +- selectolax +- scrapely: give HTML and extracted text, it builds the parser + => good for difficult one-off cases? +- https://rushter.com/blog/python-fast-html-parser/ +- WET generation from WARC, a la common crawl +- https://towardsdatascience.com/categorizing-world-wide-web-c130abd9b717 + +Other random stuff: +- distilBERT: most BERT accuracy, 0.4 factor latency (faster)? + https://medium.com/huggingface/distilbert-8cf3380435b5 +- htmldate: finds "date of publication" for a document +- adblockparser + => good as a filter in HTML ingest +- w3lib: utility library. unicode conversion; cleanups; etc +- courlan: clean/normalize/sample large URL lists + => https://github.com/adbar/courlan + +### Main Text Extraction + +Things to try: + +- newspaper3k + => basic article extraction. lxml +- trafilatura + => TEI-XML output! + => looks very promising + => falls back to readability and justext +- python-readability + => improved vs newspaper? +- dragnet +- eatiht +- jusText +- inscriptis + => emphasis on shape/readability of text output? compare with lynx +- Goose3 + => metadata and article text +- news-please + => very full-featured. build on scrapy, newspaper, readability + => can iterate over common crawl? +- html2text + => actually HTML-to-markdown; no or little "boilerplate removal" +- boilerpipe (Java) + boilerpipe3 (wrapper) + boilerpy3 (port) + +Comparisons and articles: + +- https://www.diffbot.com/benefits/comparison/ +- https://github.com/scrapinghub/article-extraction-benchmark + - https://github.com/scrapinghub/article-extraction-benchmark/releases/download/v1.0.0/paper-v1.0.0.pdf +- https://github.com/rundimeco/waddle + +- https://moz.com/devblog/benchmarking-python-content-extraction-algorithms-dragnet-readability-goose-and-eatiht +- https://hal.archives-ouvertes.fr/hal-02768510v3/document (fr; June 2020) + https://translate.google.com/translate?sl=auto&tl=en&u=https%3A%2F%2Fhal.archives-ouvertes.fr%2Fhal-02768510v3%2Fdocument +- http://eprints.fri.uni-lj.si/1718/1/Kovacic-1.pdf (2012) +- "Generic Web Content Extraction with Open-Source Software" (2020; trafilatura) +- "Out-of-the-Box and Into the Ditch? Multilingual Evaluation of Generic Text Extraction Tools" + https://hal.archives-ouvertes.fr/hal-02732851/document + very on-topic +- https://cloud.google.com/blog/products/gcp/problem-solving-with-ml-automatic-document-classification + +### Reference/Citation Extraction + +"Locating and parsing bibliographic references in HTML medical articles" +https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2903768/ + +cb2bib (in debian/ubuntu) + + +### Metadata Extraction + +OJS 3.x seems to have `citation_fulltext_html_url`. Annoyingly, has an iframe. + +http://documents.clockss.org/index.php/LOCKSS:_Extracting_Bibliographic_Metadata + +https://blog.dshr.org/2013/04/talk-on-lockss-metadata-extraction-at.html + +"OXPath": declaritive XPath extension for scraping metadata +https://journal.code4lib.org/articles/13007 + + +## newspaper3k experimentation + + import newspaper + + import nltk + nltk.download('punkt') + + # first mondays (OJS) fulltext + monday = newspaper.Article("https://firstmonday.org/ojs/index.php/fm/article/download/10274/9729?inline=1") + # => ugh, iframe + monday.download() + monday.parse() # several seconds + + monday.title + # Surveillance, stigma and sociotechnical design for HIV + monday.text + # reasonable; similar to pdftotext? + monday.authors + # empty + monday.images + # reasonable? + + nih = newspaper.Article('https://www.nlm.nih.gov/pubs/techbull/ja02/ja02_locatorplus_merge.html') + nih.download() + nih.parse() + nih.nlp() + + nih.title + # Migration of Monographic Citations to LocatorPlus: Merge Project. NLM Technical Bulletin. Jul-Aug 2002 + # duplicate journal name in title + nih.authors + # none + nih.text + # Ok. missing first character, weirdly + + genders = newspaper.Article('https://web.archive.org/web/20141230080932id_/http://www.genders.org/g58/g58_fairlie.html') + genders.download() + genders.parse() + + genders.title + # Presenting innovative theories in art, literature, history, music, TV and film. + # nope: this is title of the journal + + genders.text + # Ok. includes title and author in the body. + + dlib = newspaper.Article('http://www.dlib.org/dlib/may17/vanhyning/05vanhyning.html') + dlib.download() + dlib.parse() + + dlib.title + # Transforming Libraries and Archives through Crowdsourcing + dlib.authors() + # none + dlib.text + # some other junk, but main body there + +## trafilatura experimentation + + trafilatura --json -u 'http://www.dlib.org/dlib/may17/vanhyning/05vanhyning.html' | jq . + + trafilatura --xmltei -u 'http://www.dlib.org/dlib/may17/vanhyning/05vanhyning.html' + +Does not work with `first_monday_ojs_inline`? + +May need to test/compare more. + +Examples/bugs: + + http://web.archive.org/web/20081120141035id_/http://www.mundanebehavior.org/issues/v5n1/jones.htm + poor title detection + + generally, author detection not great. + not, apparently, using detection of dc.authors etc -- cgit v1.2.3