aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/html_metadata.py
Commit message (Expand)AuthorAgeFilesLines
* ingest: more generic OJS support, including pre-printsBryan Newbold2022-10-241-6/+22
* ingest: more generic PDF fulltext URL patternsBryan Newbold2022-10-241-0/+14
* html: worldscientific PDF URL extractionBryan Newbold2022-10-241-0/+16
* ingest: more PDF fulltext tricksBryan Newbold2022-07-201-0/+29
* ingest: more PDF fulltext URL patternsBryan Newbold2022-07-201-0/+42
* html: mangled JSON-in-URL patternBryan Newbold2022-07-151-0/+1
* html: fulltext URL prefixes to skip; also fix broken pattern matchingBryan Newbold2022-07-151-4/+19
* HTML ingest: most sub-resource patterns to skipBryan Newbold2022-07-151-1/+13
* ingest: random site PDF link patternBryan Newbold2022-07-121-0/+7
* ingest: doaj.org article landing page access linksBryan Newbold2022-07-121-0/+12
* sandcrawler: additional extracts, mostly OJSBryan Newbold2022-01-131-1/+23
* ingest: PDF pattern for integrityresjournals.orgBryan Newbold2022-01-131-0/+8
* codespell typos in python (comments)Bryan Newbold2021-11-241-1/+1
* html_meta: actual typo in code (CSS selector) caught by codespellBryan Newbold2021-11-241-1/+1
* make fmt (black 21.9b0)Bryan Newbold2021-10-271-62/+71
* lint collection membership (last lint for now)Bryan Newbold2021-10-261-9/+9
* more progress on type annotations and lintingBryan Newbold2021-10-261-12/+13
* start handling trivial lint cleanups: unused imports, 'is None', etcBryan Newbold2021-10-261-3/+3
* make fmtBryan Newbold2021-10-261-21/+16
* python: isort all importsBryan Newbold2021-10-261-5/+4
* component ingest support for dataverse files (individual)Bryan Newbold2021-10-151-13/+27
* pdf ingest: journals.uchicago.edu patternBryan Newbold2021-10-111-0/+8
* ingest: basic 'component' and 'src' supportBryan Newbold2021-10-041-0/+15
* yet more PDF URL patternsBryan Newbold2021-09-031-0/+48
* HTML ingest: several more PDF fulltext URL patternsBryan Newbold2021-09-031-0/+87
* HTML ingest: skip noisy print() statementBryan Newbold2021-09-031-1/+1
* HTML ingest: more meta-URI prefixesBryan Newbold2021-08-241-2/+8
* html ingest: skip 'about:blank'Bryan Newbold2021-08-161-0/+3
* ingest PDF extraction updatesBryan Newbold2021-05-211-0/+54
* html ingest: remove whitespace around relative URLs (eg, for d-lib)Bryan Newbold2021-05-211-1/+1
* ingest: handle current degruyter PDF link patternBryan Newbold2021-03-261-0/+8
* html: more conservative parsing of element attrBryan Newbold2020-11-201-2/+4
* html biblio: handle 'content not in attrs' caseBryan Newbold2020-11-121-2/+2
* html: more adblockBryan Newbold2020-11-081-1/+3
* move fuzzy URL match method to miscBryan Newbold2020-11-081-0/+2
* move some PDF URL extraction into declarative formatBryan Newbold2020-11-081-9/+149
* html: more extraction patterns; bugfix; skip more crossmarkBryan Newbold2020-11-081-1/+24
* html: small ingest improvementsBryan Newbold2020-11-081-0/+15
* html: pdf and html extract similar to XMLBryan Newbold2020-11-061-20/+30
* initial implementation of HTML ingest in existing workerBryan Newbold2020-11-041-0/+5
* html: improve XML fulltext extraction for scieloBryan Newbold2020-11-031-4/+17
* html: some refactoringBryan Newbold2020-11-031-10/+40
* html: syntax fixes; resolve relative URLs; extract more XML fulltext URLsBryan Newbold2020-10-301-5/+12
* html: more ingest improvementsBryan Newbold2020-10-301-0/+2
* html: more biblio selectors; resource extractionBryan Newbold2020-10-291-0/+102
* HTML meta: more from online hunting/researchBryan Newbold2020-10-271-3/+54
* HTML metadata: fix type warningsBryan Newbold2020-10-271-1/+3
* start HTML metadata extraction codeBryan Newbold2020-10-271-0/+230