sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	shorten default HTTP backoff factor	Bryan Newbold	2022-07-13	1	-1/+1
\| \| \| \| \|	The existing factor was resulting in many-minute long backoffs, and Kafka timeouts
*	ingest: random site PDF link pattern	Bryan Newbold	2022-07-12	1	-0/+7
\|
*	ingest: doaj.org article landing page access links	Bryan Newbold	2022-07-12	2	-1/+12
\|
*	ingest: IEEE domain is blocking us	Bryan Newbold	2022-07-07	1	-1/+2
\|
*	ingest: catch more ConnectionErrors (SPN, replay fetch, GROBID)	Bryan Newbold	2022-05-16	2	-4/+19
\|
*	ingest: skip arxiv.org DOIs, we already direct-ingest	Bryan Newbold	2022-05-11	1	-0/+1
\|
*	ingest spn2: fix tests	Bryan Newbold	2022-05-05	2	-1/+2
\|
*	ingest: more loginwall patterns	Bryan Newbold	2022-05-05	1	-0/+3
\|
*	SPNv2: several fixes for prod throughput	Bryan Newbold	2022-04-26	1	-11/+34
\| \| \| \| \| \| \| \| \| \|	Most importantly, for some API flags, if the value is not true-thy, do not set the flag at all. Setting any flag was resulting in screenshots and outlinks actually getting created/captured, which was a huge slowdown. Also, check per-user SPNv2 slots available, using API, before requesting an actual capture.
*	make fmt	Bryan Newbold	2022-04-26	1	-2/+5
\|
*	block isiarticles.com from future PDF crawls	Bryan Newbold	2022-04-20	1	-0/+2
\|
*	ingest: drive.google.com ingest support	Bryan Newbold	2022-04-04	1	-0/+8
\|
*	filesets: fix archive.org path naming	Bryan Newbold	2022-03-29	1	-7/+8
\|
*	bugfix: sha1/md5 typo	Bryan Newbold	2022-03-23	1	-1/+1
\| \| \| \|	Caught this prepping to ingest in to fatcat. Derp!
*	file ingest: don't 'backoff' on spn2 backoff error	Bryan Newbold	2022-03-22	2	-0/+8
\| \| \| \| \| \| \| \|	The intent of this is to try and get through the daily ingest requests faster, so we can loop and retry if needed. A 200 second delay, usually resulting in a kafka topic reshuffle, really slows things down. This will presumably result in a bunch of spn2-backoff status requests, but we can just retry those.
*	small lint/typo/fmt fixes	Bryan Newbold	2022-02-24	3	-5/+5
\|
*	another bad PDF sha1	Bryan Newbold	2022-02-23	1	-0/+1
\|
*	ingest: fix mistakenly commented except block (?)	Bryan Newbold	2022-02-18	1	-4/+3
\|
*	ingest: handle more fileset failure modes	Bryan Newbold	2022-02-18	2	-3/+30
\|
*	yet another bad PDF sha1	Bryan Newbold	2022-02-08	1	-0/+1
\|
*	sandcrawler: additional extracts, mostly OJS	Bryan Newbold	2022-01-13	1	-1/+23
\|
*	filesets: more figshare URL patterns	Bryan Newbold	2022-01-13	1	-0/+13
\|
*	fileset ingest: better verification of resources	Bryan Newbold	2022-01-13	1	-7/+23
\|
*	ingest: PDF pattern for integrityresjournals.org	Bryan Newbold	2022-01-13	1	-0/+8
\|
*	null-body -> empty-blob	Bryan Newbold	2022-01-13	3	-4/+8
\|
*	spn: handle blocked-url (etc) better	Bryan Newbold	2022-01-11	1	-0/+10
\|
*	filesets: handle weird figshare link-only case better	Bryan Newbold	2021-12-16	1	-1/+4
\|
*	lint ('not in')	Bryan Newbold	2021-12-15	1	-2/+2
\|
*	more fileset ingest tweaks	Bryan Newbold	2021-12-15	2	-0/+7
\|
*	fileset ingest: more requests timeouts, sessions	Bryan Newbold	2021-12-15	3	-37/+68
\|
*	fileset ingest: create tmp subdirectories if needed	Bryan Newbold	2021-12-15	1	-0/+5
\|
*	fileset ingest: configure IA session from env	Bryan Newbold	2021-12-15	1	-1/+6
\| \| \| \| \|	Note that this doesn't currently work for `upload()`, and as a work-around I created `~/.config/ia.ini` manually on the worker VM.
*	fileset ingest: actually use spn2 CLI flag	Bryan Newbold	2021-12-11	2	-3/+4
\|
*	grobid: set a maximum file size (256 MByte)	Bryan Newbold	2021-12-07	1	-0/+8
\|
*	codespell typos in python (comments)	Bryan Newbold	2021-11-24	4	-4/+4
\|
*	html_meta: actual typo in code (CSS selector) caught by codespell	Bryan Newbold	2021-11-24	1	-1/+1
\|
*	make fmt	Bryan Newbold	2021-11-16	1	-1/+1
\|
*	SPNv2: make 'resources' optional	Bryan Newbold	2021-11-16	1	-1/+1
\| \| \| \| \| \| \| \|	This was always present previously. A change was made to SPNv2 API recently that borked it a bit, though in theory should be present on new captures. I'm not seeing it for some captures, so pushing this work around. It seems like we don't actually use this field anyways, at least for ingest pipeline.
*	grobid: handle XML parsing errors, and have them recorded in sandcrawler-db	Bryan Newbold	2021-11-12	1	-1/+5
\|
*	ingest_file: more efficient GROBID metadata copy	Bryan Newbold	2021-11-12	1	-3/+3
\|
*	ingest: start re-processing GROBID with newer version	Bryan Newbold	2021-11-10	1	-2/+6
\|
*	simple persist worker/tool to backfill grobid_refs	Bryan Newbold	2021-11-10	1	-0/+40
\|
*	grobid: extract more metadata in document TEI-XML	Bryan Newbold	2021-11-10	1	-0/+5
\|
*	grobid: update 'TODO' comment based on review	Bryan Newbold	2021-11-04	1	-3/+0
\|
*	crossref grobid refs: another error case (ReadTimeout)	Bryan Newbold	2021-11-04	2	-5/+11
\| \| \| \| \|	With this last exception handled, was about to get through millions of rows of references, with only a few dozen errors (mostly invalid XML).
*	db (postgrest): actually use an HTTP session	Bryan Newbold	2021-11-04	1	-12/+24
\| \| \| \|	Not as important with GET as POST, I think, but still best practice.
*	grobid: use requests session	Bryan Newbold	2021-11-04	1	-3/+4
\| \| \| \| \| \|	This should fix an embarassing bug with exhausting local ports: requests.exceptions.ConnectionError: HTTPConnectionPool(host='wbgrp-svc096.us.archive.org', port=8070): Max retries exceeded with url: /api/processCitationList (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8dfc24e250>: Failed to establish a new connection: [Errno 99] Cannot assign requested address'))
*	grobid crossref refs: try to handle HTTP 5xx and XML parse errors	Bryan Newbold	2021-11-04	2	-5/+33
\|
*	grobid: handle weird whitespace unstructured from crossref	Bryan Newbold	2021-11-04	1	-1/+10
\| \| \| \|	See also: https://github.com/kermitt2/grobid/issues/849
*	crossref persist: make GROBID ref parsing an option (not default)	Bryan Newbold	2021-11-04	1	-7/+16
\|