sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	grobid_tool: helper to process a single file	Bryan Newbold	2021-11-10	1	-0/+15
\|
*	ingest: start re-processing GROBID with newer version	Bryan Newbold	2021-11-10	1	-2/+6
\|
*	simple persist worker/tool to backfill grobid_refs	Bryan Newbold	2021-11-10	2	-0/+62
\|
*	grobid: extract more metadata in document TEI-XML	Bryan Newbold	2021-11-10	1	-0/+5
\|
*	grobid: update 'TODO' comment based on review	Bryan Newbold	2021-11-04	1	-3/+0
\|
*	update crossref/grobid refs generation notes	Bryan Newbold	2021-11-04	1	-4/+96
\|
*	crossref grobid refs: another error case (ReadTimeout)	Bryan Newbold	2021-11-04	2	-5/+11
\| \| \| \| \|	With this last exception handled, was about to get through millions of rows of references, with only a few dozen errors (mostly invalid XML).
*	db (postgrest): actually use an HTTP session	Bryan Newbold	2021-11-04	1	-12/+24
\| \| \| \|	Not as important with GET as POST, I think, but still best practice.
*	grobid: use requests session	Bryan Newbold	2021-11-04	1	-3/+4
\| \| \| \| \| \|	This should fix an embarassing bug with exhausting local ports: requests.exceptions.ConnectionError: HTTPConnectionPool(host='wbgrp-svc096.us.archive.org', port=8070): Max retries exceeded with url: /api/processCitationList (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8dfc24e250>: Failed to establish a new connection: [Errno 99] Cannot assign requested address'))
*	grobid crossref refs: try to handle HTTP 5xx and XML parse errors	Bryan Newbold	2021-11-04	2	-5/+33
\|
*	grobid: handle weird whitespace unstructured from crossref	Bryan Newbold	2021-11-04	1	-1/+10
\| \| \| \|	See also: https://github.com/kermitt2/grobid/issues/849
*	crossref persist: batch size depends on whether parsing refs	Bryan Newbold	2021-11-04	2	-2/+8
\|
*	sql: grobid_refs table JSON as 'JSON' not 'JSONB'	Bryan Newbold	2021-11-04	2	-3/+3
\| \| \| \| \|	I keep flip-flopping on this, but our disk usage is really large, and if 'JSON' is smaller than 'JSONB' in postgresql at all it is worth it.
*	grobid refs backfill progress	Bryan Newbold	2021-11-04	1	-1/+43
\|
*	record SQL table sizes at start of crossref re-ingest	Bryan Newbold	2021-11-04	1	-0/+19
\|
*	start notes on crossref refs backfill	Bryan Newbold	2021-11-04	1	-0/+54
\|
*	crossref persist: make GROBID ref parsing an option (not default)	Bryan Newbold	2021-11-04	3	-9/+33
\|
*	add grobid_refs and crossref_with_refs to sandcrawler-db SQL schema	Bryan Newbold	2021-11-04	1	-0/+21
\|
*	glue, utils, and worker code for crossref and grobid_refs	Bryan Newbold	2021-11-04	4	-5/+212
\|
*	update grobid refs proposal	Bryan Newbold	2021-11-04	1	-10/+72
\|
*	iterated GROBID citation cleaning and processing	Bryan Newbold	2021-11-04	1	-27/+45
\| \| \| \|	Switched to using just 'key'/'id' for downstream matching.
*	grobid citations: first pass at cleaning unstructured	Bryan Newbold	2021-11-04	1	-2/+34
\|
*	initial proposal for GROBID refs table and pipeline	Bryan Newbold	2021-11-04	1	-0/+63
\|
*	initial crossref-refs via GROBID helper routine	Bryan Newbold	2021-11-04	7	-6/+839
\|
*	pipenv: bump grobid_tei_xml version to 0.1.2	Bryan Newbold	2021-11-04	2	-11/+11
\|
*	pdftrio client: use HTTP session for POSTs	Bryan Newbold	2021-11-03	1	-1/+1
\|
*	workers: use HTTP session for archive.org fetches	Bryan Newbold	2021-11-03	1	-3/+3
\|
*	IA (wayback): actually use an HTTP session for replay fetches	Bryan Newbold	2021-11-03	1	-2/+3
\| \| \| \| \| \| \| \|	I am embarassed this wasn't actually the case already! Looks like I had even instantiated a session but wasn't using it. Hopefully this change, which adds extra retries and better backoff behavior, will improve sandcrawler ingest throughput.
*	SPN reingest: 6 hour minimum, 6 month max	Bryan Newbold	2021-11-03	1	-2/+2
\|
*	sql: fix typo in quarterly (not weekly) script	Bryan Newbold	2021-11-03	1	-1/+1
\|
*	sql: fixes to ingest_fileset_platform schema (from table creation)	Bryan Newbold	2021-11-01	2	-12/+12
\|
*	updates/corrections to old small.json GROBID metadata example file	Bryan Newbold	2021-10-27	1	-6/+1
\|
*	remove grobid2json helper file, replace with grobid_tei_xml	Bryan Newbold	2021-10-27	7	-224/+22
\|
*	small type annotation things from additional packages	Bryan Newbold	2021-10-27	2	-5/+14
\|
*	toolchain config updates	Bryan Newbold	2021-10-27	3	-10/+6
\|
*	make fmt (black 21.9b0)	Bryan Newbold	2021-10-27	57	-3126/+3991
\|
*	pipenv: flipflop from yapf back to black; more type packages; bump ↵	Bryan Newbold	2021-10-27	2	-27/+112
\| \| \| \|	grobid_tei_xml
*	fileset: refactor out tables of helpers	Bryan Newbold	2021-10-27	3	-21/+19
\| \| \| \| \| \| \|	Having these objects invoked in tables resulted in a whole bunch of objects (including children) getting initialized, which seems like the wrong thing to do. Defer this until the actual ingest fileset worker is initialized.
*	gitlab-ci: copy env var in to place for tests	Bryan Newbold	2021-10-27	1	-0/+1
\|
*	fix type annotations for petabox body fetch helper	Bryan Newbold	2021-10-26	5	-8/+11
\|
*	small type annotation hack	Bryan Newbold	2021-10-26	1	-1/+1
\|
*	fileset: fix field renaming bug (caught by mypy)	Bryan Newbold	2021-10-26	1	-2/+2
\|
*	fileset ingest: fix table name typo (via mypy)	Bryan Newbold	2021-10-26	1	-1/+1
\|
*	update 'XXX' notes from fileset ingest development	Bryan Newbold	2021-10-26	2	-9/+6
\|
*	bugfix: setting html_biblio on ingest results	Bryan Newbold	2021-10-26	2	-2/+2
\| \| \| \|	This was caught during lint cleanup
*	lint collection membership (last lint for now)	Bryan Newbold	2021-10-26	7	-32/+32
\|
*	commit updated flake8 lint configuration	Bryan Newbold	2021-10-26	1	-6/+10
\|
*	ingest fileset: fix silly import typo	Bryan Newbold	2021-10-26	1	-1/+1
\|
*	type annotations for persist workers; required some work	Bryan Newbold	2021-10-26	1	-66/+59
\| \| \| \| \|	Had to re-structure and filter things a bit, Should be better behavior, but might be some small changes.
*	ingest file HTTP API: fixes from type checking	Bryan Newbold	2021-10-26	1	-3/+3
\| \| \| \| \|	This code is deprecated and should be removed anyways, but still interesting to see the fixes