sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	grobid: handle XML parsing errors, and have them recorded in sandcrawler-db	Bryan Newbold	2021-11-12	1	-1/+5
\|
*	ingest_file: more efficient GROBID metadata copy	Bryan Newbold	2021-11-12	1	-3/+3
\|
*	grobid_tool: helper to process a single file	Bryan Newbold	2021-11-10	1	-0/+15
\|
*	ingest: start re-processing GROBID with newer version	Bryan Newbold	2021-11-10	1	-2/+6
\|
*	simple persist worker/tool to backfill grobid_refs	Bryan Newbold	2021-11-10	2	-0/+62
\|
*	grobid: extract more metadata in document TEI-XML	Bryan Newbold	2021-11-10	1	-0/+5
\|
*	grobid: update 'TODO' comment based on review	Bryan Newbold	2021-11-04	1	-3/+0
\|
*	crossref grobid refs: another error case (ReadTimeout)	Bryan Newbold	2021-11-04	2	-5/+11
\| \| \| \| \|	With this last exception handled, was about to get through millions of rows of references, with only a few dozen errors (mostly invalid XML).
*	db (postgrest): actually use an HTTP session	Bryan Newbold	2021-11-04	1	-12/+24
\| \| \| \|	Not as important with GET as POST, I think, but still best practice.
*	grobid: use requests session	Bryan Newbold	2021-11-04	1	-3/+4
\| \| \| \| \| \|	This should fix an embarassing bug with exhausting local ports: requests.exceptions.ConnectionError: HTTPConnectionPool(host='wbgrp-svc096.us.archive.org', port=8070): Max retries exceeded with url: /api/processCitationList (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8dfc24e250>: Failed to establish a new connection: [Errno 99] Cannot assign requested address'))
*	grobid crossref refs: try to handle HTTP 5xx and XML parse errors	Bryan Newbold	2021-11-04	2	-5/+33
\|
*	grobid: handle weird whitespace unstructured from crossref	Bryan Newbold	2021-11-04	1	-1/+10
\| \| \| \|	See also: https://github.com/kermitt2/grobid/issues/849
*	crossref persist: batch size depends on whether parsing refs	Bryan Newbold	2021-11-04	2	-2/+8
\|
*	crossref persist: make GROBID ref parsing an option (not default)	Bryan Newbold	2021-11-04	3	-9/+33
\|
*	glue, utils, and worker code for crossref and grobid_refs	Bryan Newbold	2021-11-04	4	-5/+212
\|
*	iterated GROBID citation cleaning and processing	Bryan Newbold	2021-11-04	1	-27/+45
\| \| \| \|	Switched to using just 'key'/'id' for downstream matching.
*	grobid citations: first pass at cleaning unstructured	Bryan Newbold	2021-11-04	1	-2/+34
\|
*	initial crossref-refs via GROBID helper routine	Bryan Newbold	2021-11-04	7	-6/+839
\|
*	pipenv: bump grobid_tei_xml version to 0.1.2	Bryan Newbold	2021-11-04	2	-11/+11
\|
*	pdftrio client: use HTTP session for POSTs	Bryan Newbold	2021-11-03	1	-1/+1
\|
*	workers: use HTTP session for archive.org fetches	Bryan Newbold	2021-11-03	1	-3/+3
\|
*	IA (wayback): actually use an HTTP session for replay fetches	Bryan Newbold	2021-11-03	1	-2/+3
\| \| \| \| \| \| \| \|	I am embarassed this wasn't actually the case already! Looks like I had even instantiated a session but wasn't using it. Hopefully this change, which adds extra retries and better backoff behavior, will improve sandcrawler ingest throughput.
*	updates/corrections to old small.json GROBID metadata example file	Bryan Newbold	2021-10-27	1	-6/+1
\|
*	remove grobid2json helper file, replace with grobid_tei_xml	Bryan Newbold	2021-10-27	7	-224/+22
\|
*	small type annotation things from additional packages	Bryan Newbold	2021-10-27	2	-5/+14
\|
*	toolchain config updates	Bryan Newbold	2021-10-27	3	-10/+6
\|
*	make fmt (black 21.9b0)	Bryan Newbold	2021-10-27	57	-3126/+3991
\|
*	pipenv: flipflop from yapf back to black; more type packages; bump ↵	Bryan Newbold	2021-10-27	2	-27/+112
\| \| \| \|	grobid_tei_xml
*	fileset: refactor out tables of helpers	Bryan Newbold	2021-10-27	3	-21/+19
\| \| \| \| \| \| \|	Having these objects invoked in tables resulted in a whole bunch of objects (including children) getting initialized, which seems like the wrong thing to do. Defer this until the actual ingest fileset worker is initialized.
*	fix type annotations for petabox body fetch helper	Bryan Newbold	2021-10-26	5	-8/+11
\|
*	small type annotation hack	Bryan Newbold	2021-10-26	1	-1/+1
\|
*	fileset: fix field renaming bug (caught by mypy)	Bryan Newbold	2021-10-26	1	-2/+2
\|
*	fileset ingest: fix table name typo (via mypy)	Bryan Newbold	2021-10-26	1	-1/+1
\|
*	update 'XXX' notes from fileset ingest development	Bryan Newbold	2021-10-26	2	-9/+6
\|
*	bugfix: setting html_biblio on ingest results	Bryan Newbold	2021-10-26	2	-2/+2
\| \| \| \|	This was caught during lint cleanup
*	lint collection membership (last lint for now)	Bryan Newbold	2021-10-26	7	-32/+32
\|
*	commit updated flake8 lint configuration	Bryan Newbold	2021-10-26	1	-6/+10
\|
*	ingest fileset: fix silly import typo	Bryan Newbold	2021-10-26	1	-1/+1
\|
*	type annotations for persist workers; required some work	Bryan Newbold	2021-10-26	1	-66/+59
\| \| \| \| \|	Had to re-structure and filter things a bit, Should be better behavior, but might be some small changes.
*	ingest file HTTP API: fixes from type checking	Bryan Newbold	2021-10-26	1	-3/+3
\| \| \| \| \|	This code is deprecated and should be removed anyways, but still interesting to see the fixes
*	more progress on type annotations	Bryan Newbold	2021-10-26	8	-34/+55
\|
*	grobid: fix a bug with consolidate_mode header, exposed by type annotations	Bryan Newbold	2021-10-26	1	-1/+2
\|
*	grobid: type annotations	Bryan Newbold	2021-10-26	1	-9/+19
\|
*	type annotations on SandcrawlerWorker	Bryan Newbold	2021-10-26	1	-46/+57
\| \| \| \| \|	These annoations have a broad impact! Being conservative to start: Any-to-Any for process(), etc.
*	more progress on type annotations and linting	Bryan Newbold	2021-10-26	11	-55/+87
\|
*	live tests: FTP wayback replay now returns 200, not 226	Bryan Newbold	2021-10-26	1	-2/+2
\|
*	ia: more tweaks to delicate code to satisfy type checker	Bryan Newbold	2021-10-26	1	-10/+12
\| \| \| \| \|	Ran the 'live' wayback tests after this commit as a check, and worked (once FTP status code behavior change is fixed)
*	ia helpers: enforce max_redirects count correctly	Bryan Newbold	2021-10-26	1	-1/+1
\| \| \| \| \|	AKA, should run fetch even if max_redirects = 0; the first loop iteration is not a redirect.
*	set CDX request params are str, not int or datetime	Bryan Newbold	2021-10-26	1	-3/+6
\| \| \| \|	This might be a bugfix, changing CDX lookup behavior?
*	bugfix: was setting 'from' parameter as a tuple, not a string	Bryan Newbold	2021-10-26	1	-1/+1
\|