aboutsummaryrefslogtreecommitdiffstats
path: root/notes/html_ingest_notes.md
diff options
context:
space:
mode:
Diffstat (limited to 'notes/html_ingest_notes.md')
-rw-r--r--notes/html_ingest_notes.md318
1 files changed, 318 insertions, 0 deletions
diff --git a/notes/html_ingest_notes.md b/notes/html_ingest_notes.md
new file mode 100644
index 0000000..a1a91f3
--- /dev/null
+++ b/notes/html_ingest_notes.md
@@ -0,0 +1,318 @@
+
+## Current Plan
+
+- selectolax to extract metadata and quickly filter (speed)
+ => eg, differentiate landing pages from fulltext
+ => also embed URLs?
+- trafilatura for fulltext body extract
+- no solution yet for reference parsing
+ => maybe trafilatura XML-TEI parsing, then GROBID?
+ => especially if DOI/identifier/URL is in the reference
+
+
+
+TODO:
+x print/wrap error condition better
+x serialize dates (pydantic)
+x CDX lookup "closest" to capture datetime (or by month)
+x firstmonday no extracted fulltext/XML
+x apply URL base fixup to fulltext URLs
+x XML alternative detection
+x basic ingest worker, kafka topics, persist workers, sql table, etc
+- ingest worker: landing page to actual fulltext (eg, OJS)
+- broken? https://betterexplained.com/articles/colorized-math-equations/
+
+Ponder:
+- CDX lookup older successful captures
+ http://www.altdevblogaday.com/2011/05/17/understanding-the-fourier-transform/
+ => optional filter by status? "reduce" by month/year?
+- detect scope heuristically
+ bepress_is_article_cover_page 1
+ citation_fulltext_world_readable "" (eg, distill)
+- non-success subresource fetches
+ https://www.europenowjournal.org/2020/10/11/a-social-history-of-early-rock-n-roll-in-germany-hamburg-from-burlesque-to-the-beatles-1956-1969/
+- redirects: keep start URL?
+
+Later:
+- XML URL extraction
+ https://www.scielo.br/scielo.php?script=sci_arttext&pid=S0100-19652002000200001&lng=en&nrm=iso&tlng=pt
+ <a href="http://www.scielo.br/scieloOrg/php/articleXML.php?pid=S0100-19652002000200001&amp;lang=en" rel="nofollow" target="xml">
+- selectolax bug? hangs: `css_first("meta['thing']")`
+- youtube embed
+ => download/include actual video file?
+- parse references in citation headers
+- try parsing references in HTML fulltext
+
+## Testing URLs
+
+- PLOS
+ https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0093949
+ TODO: "May 9, 2014"
+ TODO: appendix
+- peerj
+ https://peerj.com/articles/4375/
+- scielo
+ http://scielo.iics.una.py/scielo.php?script=sci_arttext&pid=S1683-98032020000200081&lng=en&nrm=iso&tlng=es
+ bunch of little icon .png, but ok
+ redirect of an image not saved in webcapture
+- wordpress
+ https://www.europenowjournal.org/2020/10/11/a-social-history-of-early-rock-n-roll-in-germany-hamburg-from-burlesque-to-the-beatles-1956-1969/
+ no HTML meta? hrm
+- old OJS
+ (pdf only) http://rjh.folium.ru/index.php/rjh/article/view/1511
+- new OJS
+ https://firstmonday.org/ojs/index.php/fm/article/view/10274/9729
+- plain HTML
+ http://journal.sjdm.org/12/12627/jdm12627.html
+- blogs/essays
+ http://symbolflux.com/lodessay/
+ https://betterexplained.com/articles/colorized-math-equations/
+ https://web.archive.org/web/20120418231513/http://www.altdevblogaday.com/2011/05/17/understanding-the-fourier-transform/
+ https://research.google.com/bigpicture/attacking-discrimination-in-ml/
+ http://www.econgraphs.org/
+- journal homepage (not fulltext)
+- OJS new landing page (not fulltext)
+- OJS old (not fulltext)
+ http://rjh.folium.ru/index.php/rjh/index
+ http://rjh.folium.ru/index.php/rjh/issue/view/106
+ http://rjh.folium.ru/index.php/rjh/article/view/382
+- distill
+ https://distill.pub/2020/bayesian-optimization/
+ https://distill.pub/2018/feature-wise-transformations/
+- youtube video embed
+ http://www.cond.org/persalog.html
+- youtube video direct?
+- github: project README?
+- wikipedia
+
+## Background Research
+
+- scrapy (?)
+- requests-html: can run javascript
+ => good for metadata extraction?
+- selectolax
+- scrapely: give HTML and extracted text, it builds the parser
+ => good for difficult one-off cases?
+- https://rushter.com/blog/python-fast-html-parser/
+- WET generation from WARC, a la common crawl
+- https://towardsdatascience.com/categorizing-world-wide-web-c130abd9b717
+
+Other random stuff:
+- distilBERT: most BERT accuracy, 0.4 factor latency (faster)?
+ https://medium.com/huggingface/distilbert-8cf3380435b5
+- htmldate: finds "date of publication" for a document
+- adblockparser
+ => good as a filter in HTML ingest
+- w3lib: utility library. unicode conversion; cleanups; etc
+- courlan: clean/normalize/sample large URL lists
+ => https://github.com/adbar/courlan
+
+### Main Text Extraction
+
+Things to try:
+
+- newspaper3k
+ => basic article extraction. lxml
+- trafilatura
+ => TEI-XML output!
+ => looks very promising
+ => falls back to readability and justext
+- python-readability
+ => improved vs newspaper?
+- dragnet
+- eatiht
+- jusText
+- inscriptis
+ => emphasis on shape/readability of text output? compare with lynx
+- Goose3
+ => metadata and article text
+- news-please
+ => very full-featured. build on scrapy, newspaper, readability
+ => can iterate over common crawl?
+- html2text
+ => actually HTML-to-markdown; no or little "boilerplate removal"
+- boilerpipe (Java)
+ boilerpipe3 (wrapper)
+ boilerpy3 (port)
+
+Comparisons and articles:
+
+- https://www.diffbot.com/benefits/comparison/
+- https://github.com/scrapinghub/article-extraction-benchmark
+ - https://github.com/scrapinghub/article-extraction-benchmark/releases/download/v1.0.0/paper-v1.0.0.pdf
+- https://github.com/rundimeco/waddle
+
+- https://moz.com/devblog/benchmarking-python-content-extraction-algorithms-dragnet-readability-goose-and-eatiht
+- https://hal.archives-ouvertes.fr/hal-02768510v3/document (fr; June 2020)
+ https://translate.google.com/translate?sl=auto&tl=en&u=https%3A%2F%2Fhal.archives-ouvertes.fr%2Fhal-02768510v3%2Fdocument
+- http://eprints.fri.uni-lj.si/1718/1/Kovacic-1.pdf (2012)
+- "Generic Web Content Extraction with Open-Source Software" (2020; trafilatura)
+- "Out-of-the-Box and Into the Ditch? Multilingual Evaluation of Generic Text Extraction Tools"
+ https://hal.archives-ouvertes.fr/hal-02732851/document
+ very on-topic
+- https://cloud.google.com/blog/products/gcp/problem-solving-with-ml-automatic-document-classification
+
+### Reference/Citation Extraction
+
+"Locating and parsing bibliographic references in HTML medical articles"
+https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2903768/
+
+cb2bib (in debian/ubuntu)
+
+
+### Metadata Extraction
+
+OJS 3.x seems to have `citation_fulltext_html_url`. Annoyingly, has an iframe.
+
+http://documents.clockss.org/index.php/LOCKSS:_Extracting_Bibliographic_Metadata
+
+https://blog.dshr.org/2013/04/talk-on-lockss-metadata-extraction-at.html
+
+"OXPath": declaritive XPath extension for scraping metadata
+https://journal.code4lib.org/articles/13007
+
+
+## newspaper3k experimentation
+
+ import newspaper
+
+ import nltk
+ nltk.download('punkt')
+
+ # first mondays (OJS) fulltext
+ monday = newspaper.Article("https://firstmonday.org/ojs/index.php/fm/article/download/10274/9729?inline=1")
+ # => ugh, iframe
+ monday.download()
+ monday.parse() # several seconds
+
+ monday.title
+ # Surveillance, stigma and sociotechnical design for HIV
+ monday.text
+ # reasonable; similar to pdftotext?
+ monday.authors
+ # empty
+ monday.images
+ # reasonable?
+
+ nih = newspaper.Article('https://www.nlm.nih.gov/pubs/techbull/ja02/ja02_locatorplus_merge.html')
+ nih.download()
+ nih.parse()
+ nih.nlp()
+
+ nih.title
+ # Migration of Monographic Citations to LocatorPlus: Merge Project. NLM Technical Bulletin. Jul-Aug 2002
+ # duplicate journal name in title
+ nih.authors
+ # none
+ nih.text
+ # Ok. missing first character, weirdly
+
+ genders = newspaper.Article('https://web.archive.org/web/20141230080932id_/http://www.genders.org/g58/g58_fairlie.html')
+ genders.download()
+ genders.parse()
+
+ genders.title
+ # Presenting innovative theories in art, literature, history, music, TV and film.
+ # nope: this is title of the journal
+
+ genders.text
+ # Ok. includes title and author in the body.
+
+ dlib = newspaper.Article('http://www.dlib.org/dlib/may17/vanhyning/05vanhyning.html')
+ dlib.download()
+ dlib.parse()
+
+ dlib.title
+ # Transforming Libraries and Archives through Crowdsourcing
+ dlib.authors()
+ # none
+ dlib.text
+ # some other junk, but main body there
+
+## trafilatura experimentation
+
+ trafilatura --json -u 'http://www.dlib.org/dlib/may17/vanhyning/05vanhyning.html' | jq .
+
+ trafilatura --xmltei -u 'http://www.dlib.org/dlib/may17/vanhyning/05vanhyning.html'
+
+Does not work with `first_monday_ojs_inline`?
+
+May need to test/compare more.
+
+Examples/bugs:
+
+ http://web.archive.org/web/20081120141035id_/http://www.mundanebehavior.org/issues/v5n1/jones.htm
+ poor title detection
+
+ generally, author detection not great.
+ not, apparently, using detection of dc.authors etc
+
+
+## Prod Deployment Notes (2020-12-14)
+
+Created `html_meta` table in `sandcrawler-db`.
+
+Updated ansible roles to deploy persist and import workers. Then ran the roles
+and enabled:
+
+- sandcrawler database (aitio)
+ - sandcrawler-persist-ingest-file-worker@1: restarted
+- blobs (wbgrp-svc169)
+ - sandcrawler-persist-html-teixml-worker@1: started and enabled
+ - sandcrawler-persist-xml-doc-worker@1: started and enabled
+- fatcat prod worker (wbgrp-svc502)
+ - fatcat-import-ingest-web-worker: started and enabled
+
+Test some d-lib and first monday ingests:
+
+ # dlib
+ ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --ingest-type html --limit 50 container --container-id ugbiirfvufgcjkx33r3cmemcuu
+ => Counter({'estimate': 803, 'ingest_request': 50, 'elasticsearch_release': 50, 'kafka': 50})
+
+ # first monday
+ ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --ingest-type html --limit 50 container --container-id svz5ul6qozdjhjhk7d627avuja
+
+Starting:
+
+ d-lib: 253 / 1056 preserved (https://fatcat.wiki/container/ugbiirfvufgcjkx33r3cmemcuu/coverage)
+
+Initially, `fatcat-import-ingest-web-worker` is seeing these but doesn't seem
+to be importing.
+
+ # postgresql shell
+ select sha1hex, updated, status, scope, has_teixml, has_thumbnail, word_count from html_meta;
+ => initially has_teixml is false for all
+ => fixed in an update
+
+ # weed shell
+ > fs.ls /buckets/sandcrawler/html_body
+ [...]
+ > fs.cat /buckets/sandcrawler/html_body/77/75/7775adf8c7e19151bbe887bfa08a575483291d7c.tei.xml
+ [looks like fine TEI-XML]
+
+Going to debug ingest issue by dumping results to disk and importing manually
+(best way to see counts):
+
+ kafkacat -C -b wbgrp-svc284.us.archive.org:9092 -t sandcrawler-prod.ingest-file-results -o -10 | rg html | head -n10 | jq . -c > web_ingest_results.json
+
+ export FATCAT_AUTH_WORKER_CRAWL=[...]
+ ./fatcat_import.py ingest-web-results web_ingest_results.json
+ => Counter({'total': 10, 'skip-update-disabled': 9, 'skip': 1, 'skip-hit': 1, 'insert': 0, 'update': 0, 'exists': 0})
+
+ # did some patching (f7a75a01), then re-ran twice and got:
+ => Counter({'total': 10, 'insert': 9, 'skip': 1, 'skip-hit': 1, 'update': 0, 'exists': 0})
+ => Counter({'total': 10, 'exists': 9, 'skip': 1, 'skip-hit': 1, 'insert': 0, 'update': 0})
+
+ # looks good!
+
+Re-ingesting all of d-lib:
+
+ ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --ingest-type html container --container-id ugbiirfvufgcjkx33r3cmemcuu
+ => Expecting 803 release objects in search queries
+ => Counter({'ingest_request': 803, 'elasticsearch_release': 803, 'estimate': 803, 'kafka': 803})
+
+TODO:
+
+- release ES transform isn't counting these as `in_ia` or preserved (code-only change)
+- no indication in search results (ES schema change)
+- ingest tool should probably look at `in_ia_html` or `in_ia_pdf` for PDF/XML queries (or a `types_in_ia` list?)