sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	make fmt	Bryan Newbold	2021-11-16	1	-1/+1
\|
*	SPNv2: make 'resources' optional	Bryan Newbold	2021-11-16	1	-1/+1
\| \| \| \| \| \| \| \|	This was always present previously. A change was made to SPNv2 API recently that borked it a bit, though in theory should be present on new captures. I'm not seeing it for some captures, so pushing this work around. It seems like we don't actually use this field anyways, at least for ingest pipeline.
*	grobid: handle XML parsing errors, and have them recorded in sandcrawler-db	Bryan Newbold	2021-11-12	1	-1/+5
\|
*	ingest_file: more efficient GROBID metadata copy	Bryan Newbold	2021-11-12	1	-3/+3
\|
*	wrap up crossref refs backfill notes	Bryan Newbold	2021-11-10	1	-0/+47
\|
*	grobid_tool: helper to process a single file	Bryan Newbold	2021-11-10	1	-0/+15
\|
*	ingest: start re-processing GROBID with newer version	Bryan Newbold	2021-11-10	1	-2/+6
\|
*	simple persist worker/tool to backfill grobid_refs	Bryan Newbold	2021-11-10	2	-0/+62
\|
*	grobid: extract more metadata in document TEI-XML	Bryan Newbold	2021-11-10	1	-0/+5
\|
*	grobid: update 'TODO' comment based on review	Bryan Newbold	2021-11-04	1	-3/+0
\|
*	update crossref/grobid refs generation notes	Bryan Newbold	2021-11-04	1	-4/+96
\|
*	crossref grobid refs: another error case (ReadTimeout)	Bryan Newbold	2021-11-04	2	-5/+11
\| \| \| \| \|	With this last exception handled, was about to get through millions of rows of references, with only a few dozen errors (mostly invalid XML).
*	db (postgrest): actually use an HTTP session	Bryan Newbold	2021-11-04	1	-12/+24
\| \| \| \|	Not as important with GET as POST, I think, but still best practice.
*	grobid: use requests session	Bryan Newbold	2021-11-04	1	-3/+4
\| \| \| \| \| \|	This should fix an embarassing bug with exhausting local ports: requests.exceptions.ConnectionError: HTTPConnectionPool(host='wbgrp-svc096.us.archive.org', port=8070): Max retries exceeded with url: /api/processCitationList (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8dfc24e250>: Failed to establish a new connection: [Errno 99] Cannot assign requested address'))
*	grobid crossref refs: try to handle HTTP 5xx and XML parse errors	Bryan Newbold	2021-11-04	2	-5/+33
\|
*	grobid: handle weird whitespace unstructured from crossref	Bryan Newbold	2021-11-04	1	-1/+10
\| \| \| \|	See also: https://github.com/kermitt2/grobid/issues/849
*	crossref persist: batch size depends on whether parsing refs	Bryan Newbold	2021-11-04	2	-2/+8
\|
*	sql: grobid_refs table JSON as 'JSON' not 'JSONB'	Bryan Newbold	2021-11-04	2	-3/+3
\| \| \| \| \|	I keep flip-flopping on this, but our disk usage is really large, and if 'JSON' is smaller than 'JSONB' in postgresql at all it is worth it.
*	grobid refs backfill progress	Bryan Newbold	2021-11-04	1	-1/+43
\|
*	record SQL table sizes at start of crossref re-ingest	Bryan Newbold	2021-11-04	1	-0/+19
\|
*	start notes on crossref refs backfill	Bryan Newbold	2021-11-04	1	-0/+54
\|
*	crossref persist: make GROBID ref parsing an option (not default)	Bryan Newbold	2021-11-04	3	-9/+33
\|
*	add grobid_refs and crossref_with_refs to sandcrawler-db SQL schema	Bryan Newbold	2021-11-04	1	-0/+21
\|
*	glue, utils, and worker code for crossref and grobid_refs	Bryan Newbold	2021-11-04	4	-5/+212
\|
*	update grobid refs proposal	Bryan Newbold	2021-11-04	1	-10/+72
\|
*	iterated GROBID citation cleaning and processing	Bryan Newbold	2021-11-04	1	-27/+45
\| \| \| \|	Switched to using just 'key'/'id' for downstream matching.
*	grobid citations: first pass at cleaning unstructured	Bryan Newbold	2021-11-04	1	-2/+34
\|
*	initial proposal for GROBID refs table and pipeline	Bryan Newbold	2021-11-04	1	-0/+63
\|
*	initial crossref-refs via GROBID helper routine	Bryan Newbold	2021-11-04	7	-6/+839
\|
*	pipenv: bump grobid_tei_xml version to 0.1.2	Bryan Newbold	2021-11-04	2	-11/+11
\|
*	pdftrio client: use HTTP session for POSTs	Bryan Newbold	2021-11-03	1	-1/+1
\|
*	workers: use HTTP session for archive.org fetches	Bryan Newbold	2021-11-03	1	-3/+3
\|
*	IA (wayback): actually use an HTTP session for replay fetches	Bryan Newbold	2021-11-03	1	-2/+3
\| \| \| \| \| \| \| \|	I am embarassed this wasn't actually the case already! Looks like I had even instantiated a session but wasn't using it. Hopefully this change, which adds extra retries and better backoff behavior, will improve sandcrawler ingest throughput.
*	SPN reingest: 6 hour minimum, 6 month max	Bryan Newbold	2021-11-03	1	-2/+2
\|
*	sql: fix typo in quarterly (not weekly) script	Bryan Newbold	2021-11-03	1	-1/+1
\|
*	sql: fixes to ingest_fileset_platform schema (from table creation)	Bryan Newbold	2021-11-01	2	-12/+12
\|
*	updates/corrections to old small.json GROBID metadata example file	Bryan Newbold	2021-10-27	1	-6/+1
\|
*	remove grobid2json helper file, replace with grobid_tei_xml	Bryan Newbold	2021-10-27	7	-224/+22
\|
*	small type annotation things from additional packages	Bryan Newbold	2021-10-27	2	-5/+14
\|
*	toolchain config updates	Bryan Newbold	2021-10-27	3	-10/+6
\|
*	make fmt (black 21.9b0)	Bryan Newbold	2021-10-27	57	-3126/+3991
\|
*	pipenv: flipflop from yapf back to black; more type packages; bump ↵	Bryan Newbold	2021-10-27	2	-27/+112
\| \| \| \|	grobid_tei_xml
*	fileset: refactor out tables of helpers	Bryan Newbold	2021-10-27	3	-21/+19
\| \| \| \| \| \| \|	Having these objects invoked in tables resulted in a whole bunch of objects (including children) getting initialized, which seems like the wrong thing to do. Defer this until the actual ingest fileset worker is initialized.
*	gitlab-ci: copy env var in to place for tests	Bryan Newbold	2021-10-27	1	-0/+1
\|
*	fix type annotations for petabox body fetch helper	Bryan Newbold	2021-10-26	5	-8/+11
\|
*	small type annotation hack	Bryan Newbold	2021-10-26	1	-1/+1
\|
*	fileset: fix field renaming bug (caught by mypy)	Bryan Newbold	2021-10-26	1	-2/+2
\|
*	fileset ingest: fix table name typo (via mypy)	Bryan Newbold	2021-10-26	1	-1/+1
\|
*	update 'XXX' notes from fileset ingest development	Bryan Newbold	2021-10-26	2	-9/+6
\|
*	bugfix: setting html_biblio on ingest results	Bryan Newbold	2021-10-26	2	-2/+2
\| \| \| \|	This was caught during lint cleanup