sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	wrap up previous renaming work	Bryan Newbold	2021-10-15	1	-1/+1
\|
*	refactor and expand wall/block/cookie URL patterns	Bryan Newbold	2021-09-03	1	-0/+14
\|
*	move some PDF URL extraction into declarative format	Bryan Newbold	2020-11-08	2	-9/+3
\|
*	xml: re-encode XML docs into UTF-8 for persisting	Bryan Newbold	2020-11-03	2	-0/+354
\|
*	html: some refactoring	Bryan Newbold	2020-11-03	1	-1/+1
\|
*	html: syntax fixes; resolve relative URLs; extract more XML fulltext URLs	Bryan Newbold	2020-10-30	1	-7/+8
\|
*	html: work around firstmonday DOCTYPE issue	Bryan Newbold	2020-10-30	2	-0/+455
\|
*	tests: fix conditional on poppler version check	Bryan Newbold	2020-10-30	1	-1/+1
\|
*	improve test running and config	Bryan Newbold	2020-10-29	1	-0/+2
\|
*	html: more metadata tests	Bryan Newbold	2020-10-29	2	-0/+2453
\|
*	HTML metadata: fix type warnings	Bryan Newbold	2020-10-27	1	-1/+2
\|
*	start HTML metadata extraction code	Bryan Newbold	2020-10-27	5	-0/+2628
\|
*	check for simple URL patterns that are usually paywalls or loginwalls	Bryan Newbold	2020-08-11	1	-0/+18
\|
*	fix tests passing str as HTML	Bryan Newbold	2020-08-08	1	-3/+3
\|
*	another bad/non PDF test; catch correct error	Bryan Newbold	2020-06-25	1	-0/+5
\| \| \| \| \| \|	This test doesn't actually catch the error. I'm not sure why type checks don't discover the "LockedDocumentError not part of poppler" issue though.
*	pdfextract support in ingest worker	Bryan Newbold	2020-06-25	1	-0/+7
\|
*	fix tests for page0_height/width	Bryan Newbold	2020-06-25	1	-2/+2
\|
*	lint fixes	Bryan Newbold	2020-06-17	1	-1/+1
\|
*	rename pdf tools to pdfextract	Bryan Newbold	2020-06-17	1	-0/+0
\|
*	partial test coverage of pdf extract worker	Bryan Newbold	2020-06-17	1	-0/+61
\|
*	remove unused common.py	Bryan Newbold	2020-06-17	1	-40/+0
\|
*	url cleaning (canonicalization) for ingest base_url	Bryan Newbold	2020-03-10	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \|	As mentioned in comment, this first version does not re-write the URL in the `base_url` field. If we did so, then ingest_request rows would not SQL JOIN to ingest_file_result rows, which we wouldn't want. In the future, behaviour should maybe be to refuse to process URLs that aren't clean (eg, if base_url != clean_url(base_url)) and return a 'bad-url' status or soemthing. Then we would only accept clean URLs in both tables, and clear out all old/bad URLs with a cleanup script.
*	ingest: add URL blocklist feature	Bryan Newbold	2020-01-17	1	-0/+17
\| \| \| \|	And, temporarily, block zenodo and figshare.
*	clarify ingest result schema and semantics	Bryan Newbold	2020-01-15	2	-3/+21
\|
*	add postgrest checks to test mocks	Bryan Newbold	2020-01-14	1	-1/+9
\|
*	tests: don't use localhost as a responses mock host	Bryan Newbold	2020-01-14	2	-6/+6
\|
*	SPNv2 doesn't support FTP; add a live test for non-revist FTP	Bryan Newbold	2020-01-14	1	-0/+16
\|
*	more ftp status 226 support	Bryan Newbold	2020-01-14	3	-3/+9
\|
*	add live tests for ftp, revisits	Bryan Newbold	2020-01-14	1	-1/+36
\|
*	more live tests (for regressions)	Bryan Newbold	2020-01-10	1	-0/+41
\|
*	refactor ingest to a loop, allowing multiple hops	Bryan Newbold	2020-01-09	1	-2/+9
\|
*	add (skipped) live tests for wayback services	Bryan Newbold	2020-01-09	1	-0/+73
\|
*	add ingest test file	Bryan Newbold	2020-01-09	1	-0/+120
\| \| \| \|	Forgot to commit earlier!
*	lots of progress on wayback refactoring	Bryan Newbold	2020-01-09	1	-1/+7
\| \| \| \| \| \|	- too much to list - canonical flags to control crawling - cdx_to_dict helper
*	location comes as a string, not list	Bryan Newbold	2020-01-09	1	-4/+4
\|
*	wrap up basic (locally testable) ingest refactor	Bryan Newbold	2020-01-09	1	-4/+48
\|
*	basic elife+plos extraction tests	Bryan Newbold	2020-01-09	3	-0/+4842
\| \| \| \| \|	Ripped out some HTML, but these could have been minimized even further to keep repository from growing large.
*	fix grobid test (ISO-8859-1 encoding)	Bryan Newbold	2020-01-09	1	-6/+4
\| \| \| \|	Also changes for wayback refactor
*	fix grobid tests for new wayback refactors	Bryan Newbold	2020-01-09	2	-12/+14
\|
*	more wayback and SPN tests and fixes	Bryan Newbold	2020-01-09	2	-13/+67
\|
*	refactor CdxApiClient, add tests	Bryan Newbold	2020-01-08	1	-0/+110
\| \| \| \| \| \|	- always use auth token and get full CDX rows - simplify to "fetch" (exact url/dt match) and "lookup_best" methods - all redirect stuff will be moved to a higher level
*	refactor SavePaperNowClient and add test	Bryan Newbold	2020-01-07	1	-0/+160
\| \| \| \| \| \|	- response as a namedtuple - "remote" errors (aka, SPN API was HTTP 200 but returned error) aren't an exception
*	teixml2json test update for skipping null JSON keys	Bryan Newbold	2020-01-02	1	-10/+1
\|
*	grobid2json: language_code	Bryan Newbold	2019-10-04	1	-1/+2
\|
*	python tests for pusher classes	Bryan Newbold	2019-10-02	2	-0/+28
\|
*	add tests for affiliation extraction	Bryan Newbold	2019-10-02	2	-1/+25
\|
*	lots of grobid tool implementation (still WIP)	Bryan Newbold	2019-09-26	2	-7/+29
\|
*	test of GROBID client	Bryan Newbold	2019-09-25	1	-0/+53
\|
*	refactor old python hadoop code into new directory	Bryan Newbold	2019-09-25	4	-591/+0
\|
*	re-write parse_cdx_line for sandcrawler lib	Bryan Newbold	2019-09-25	1	-1/+31
\|