sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	move top-level RFC to proposals dir	Bryan Newbold	2022-12-23	1	-0/+0
\|
*	update README for Dec 2022	Bryan Newbold	2022-12-23	1	-24/+36
\|
*	old notes on possible places to ingest from	Bryan Newbold	2022-12-23	1	-0/+15
\|
*	old notes on domains to ingest from	Bryan Newbold	2022-12-23	1	-0/+294
\|
*	notes: old examples	Bryan Newbold	2022-12-23	4	-0/+307
\|
*	old notes on dryad datasets	Bryan Newbold	2022-12-23	1	-0/+17
\|
*	commit 2022-11-23 table sizes	Bryan Newbold	2022-12-23	1	-0/+21
\|
*	bad pdf hash	Bryan Newbold	2022-12-16	1	-0/+1
\|
*	2022 OAI-PMH crawl notes update	Bryan Newbold	2022-11-23	1	-0/+48
\|
*	sql: Makefile for SQL dumps/uploads	Bryan Newbold	2022-11-23	1	-0/+35
\|
*	notes: manually request cleanups	Bryan Newbold	2022-11-21	1	-0/+132
\|
*	sandcrawler: try to handle weird CDX API response	Bryan Newbold	2022-11-01	1	-0/+5
\| \| \| \|	Hard to debug this because sentry is broken.
*	ingest: more generic OJS support, including pre-prints	Bryan Newbold	2022-10-24	1	-6/+22
\| \| \| \| \|	There were some '/article/view/' patterns which can also be, eg, '/preprint/view/'.
*	ingest: more generic PDF fulltext URL patterns	Bryan Newbold	2022-10-24	1	-0/+14
\|
*	ingest: another wall pattern, and check for walls in more places	Bryan Newbold	2022-10-24	1	-1/+14
\|
*	ingest: don't prefer WARC over SPN so strongly	Bryan Newbold	2022-10-24	1	-1/+2
\| \| \| \| \| \| \| \| \| \|	We generally prefer an older WARC record over an SPN record, because the lookup is easier. But, this was causing problems with repeated ingest, so demote it. We may want to make this more configurable in the future, so things like HTML sub-resource lookups or bulk ingest won't prefer random new SPN captures.
*	html: worldscientific PDF URL extraction	Bryan Newbold	2022-10-24	1	-0/+16
\|
*	html: pubpub platform detection	Bryan Newbold	2022-10-24	1	-0/+2
\|
*	OAI-PMH updates	Bryan Newbold	2022-10-07	3	-2/+391
\|
*	reingests: update scripts and SQL	Bryan Newbold	2022-10-03	7	-6/+127
\|
*	persist: skip huge URLs	Bryan Newbold	2022-09-28	1	-0/+4
\| \| \| \|	and fix some minor doc typos
*	filesets: handle unknown file sizes (mypy lint fix)	Bryan Newbold	2022-09-28	1	-1/+1
\|
*	update oai-pmh ingest request transform script	Bryan Newbold	2022-09-28	1	-2/+38
\|
*	pytest: supress another deprecationwarning	Bryan Newbold	2022-09-14	1	-0/+1
\|
*	spn2: fix tests by not retrying on HTTP 500	Bryan Newbold	2022-09-14	1	-1/+3
\|
*	catch poppler 'ValueError' when parsing PDFs	Bryan Newbold	2022-09-14	1	-1/+2
\| \| \| \| \|	Seeing a spike in bad PDFs in the past week or so, while processing old failed ingests. Should really switch from poppler to muPDF.
*	bad PDF sha1	Bryan Newbold	2022-09-12	1	-0/+4
\|
*	bad PDF sha1	Bryan Newbold	2022-09-11	1	-0/+2
\|
*	another bad PDF sha1	Bryan Newbold	2022-09-09	1	-0/+1
\|
*	yet more bad PDF hashes	Bryan Newbold	2022-09-08	1	-0/+4
\|
*	pipenv: removed unused deps; re-lock deps	Bryan Newbold	2022-09-07	2	-783/+767
\|
*	sandcrawler SQL-based status (sept 2022)	Bryan Newbold	2022-09-07	1	-0/+438
\|
*	summer 2022 ingest notes	Bryan Newbold	2022-09-06	3	-0/+389
\|
*	html ingest: handle TEI-XML parse error	Bryan Newbold	2022-07-28	1	-1/+4
\|
*	yet another bad PDF sha1	Bryan Newbold	2022-07-27	1	-0/+1
\|
*	CDX: skip sha-256 digests	Bryan Newbold	2022-07-25	1	-1/+5
\|
*	yet another bad SHA1 PDF hash	Bryan Newbold	2022-07-24	1	-0/+1
\|
*	misc ingest fixes	Bryan Newbold	2022-07-21	1	-0/+831
\|
*	ingest: bump max-hops from 6 to 8	Bryan Newbold	2022-07-20	1	-1/+1
\|
*	ingest: more PDF fulltext tricks	Bryan Newbold	2022-07-20	2	-0/+36
\|
*	ingest: more PDF fulltext URL patterns	Bryan Newbold	2022-07-20	1	-0/+42
\|
*	doaj and unpaywall transforms: more domains to skip	Bryan Newbold	2022-07-20	2	-3/+1
\|
*	ingest: record bad GZIP transfer decode, instead of crashing (HTML)	Bryan Newbold	2022-07-18	1	-1/+4
\|
*	make fmt	Bryan Newbold	2022-07-18	1	-1/+0
\|
*	cdx: tweak CDX lookups and resolution (sort)	Bryan Newbold	2022-07-16	1	-4/+7
\|
*	html ingest: allow fuzzy CDX sha1 match based on encoding/not-encoding	Bryan Newbold	2022-07-16	1	-3/+10
\|
*	HTML: no longer extracting citation_pdf_url in main extract function	Bryan Newbold	2022-07-16	1	-24/+0
\|
*	html: mangled JSON-in-URL pattern	Bryan Newbold	2022-07-15	1	-0/+1
\|
*	html: remove old citation_pdf_url code path	Bryan Newbold	2022-07-15	1	-32/+1
\| \| \| \| \|	This code path doesn't check for 'skip' patterns, resulting in a bunch of bad CDX checks/errors
*	wayback: use same 5xx/4xx-allowing tricks for replay body fetch as for ↵	Bryan Newbold	2022-07-15	1	-7/+7
\| \| \| \|	replay redirect