sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
...
* \|	commit sept 2020 scielo ingest notes	Bryan Newbold	2020-12-08	1	-0/+21
\| \|
* \|	handle more wayback error conditions	Bryan Newbold	2020-11-20	1	-0/+6
\| \|
* \|	kafka docs for rolling back a consumer group	Bryan Newbold	2020-11-20	1	-0/+9
\| \|
* \|	html: more conservative parsing of element attr	Bryan Newbold	2020-11-20	1	-2/+4
\| \|
* \|	xml: catch parse error	Bryan Newbold	2020-11-19	1	-3/+8
\| \|
* \|	SQL: more ingest monitoring	Bryan Newbold	2020-11-16	3	-1/+660
\| \|
* \|	spn 'forbidden' status code	Bryan Newbold	2020-11-12	1	-1/+1
\| \|
* \|	html biblio: handle 'content not in attrs' case	Bryan Newbold	2020-11-12	1	-2/+2
\| \|
* \|	add implementation notes about HTML ingest	Bryan Newbold	2020-11-10	1	-0/+248
\| \|
* \|	fuzzy matching notes	Bryan Newbold	2020-11-10	1	-0/+148
\| \|
* \|	blacklist -> denylist	Bryan Newbold	2020-11-10	2	-9/+9
\| \|
* \|	pipenv: updates (mostly for trafilatura 0.6.0)	Bryan Newbold	2020-11-10	1	-25/+32
\| \|
* \|	DOAJ and HTML ingest tweaks from QA run	Bryan Newbold	2020-11-10	2	-3/+3
\| \|
* \|	html: handle more traf error cases	Bryan Newbold	2020-11-08	1	-2/+2
\| \|
* \|	html: more adblock	Bryan Newbold	2020-11-08	1	-1/+3
\| \|
* \|	ingest: small html_bibli typo	Bryan Newbold	2020-11-08	1	-1/+1
\| \|
* \|	html: most small platform tweaks	Bryan Newbold	2020-11-08	1	-5/+4
\| \|
* \|	move fuzzy URL match method to misc	Bryan Newbold	2020-11-08	3	-19/+20
\| \|
* \|	move some PDF URL extraction into declarative format	Bryan Newbold	2020-11-08	5	-143/+177
\| \|
* \|	basic DOAJ ingest request conversion script	Bryan Newbold	2020-11-08	1	-0/+139
\| \|
* \|	ingest: default to html_biblio for PDF URL extraction	Bryan Newbold	2020-11-08	1	-24/+17
\| \|
* \|	ingest: shorted scope+platform keys; use html_biblio extraction for PDFs	Bryan Newbold	2020-11-08	1	-15/+35
\| \|
* \|	html: more robust ingest; better platform and scope detection	Bryan Newbold	2020-11-08	1	-32/+96
\| \|
* \|	html: more extraction patterns; bugfix; skip more crossmark	Bryan Newbold	2020-11-08	1	-1/+24
\| \|
* \|	ingest html: return better status based on sniffed scope	Bryan Newbold	2020-11-08	1	-9/+31
\| \|
* \|	ingest tool: more ingest control args	Bryan Newbold	2020-11-08	1	-1/+10
\| \|
* \|	spn2-internal-server-error is a problem with remote server, not SPN2	Bryan Newbold	2020-11-08	1	-0/+2
\| \|
* \|	ingest: better non-full URL fixup	Bryan Newbold	2020-11-08	1	-4/+3
\| \|
* \|	html: small ingest improvements	Bryan Newbold	2020-11-08	2	-0/+19
\| \|
* \|	html: start improving scope detection	Bryan Newbold	2020-11-08	2	-5/+49
\| \|
* \|	ingest: retain html_biblio through hops; all ingest types	Bryan Newbold	2020-11-08	1	-1/+13
\| \|
* \|	ingest tool: flag for HTML quick mode (CDX-only)	Bryan Newbold	2020-11-08	2	-1/+6
\| \|
* \|	html: try to detect and mark XHTML (vs. HTML or XML)	Bryan Newbold	2020-11-08	2	-4/+6
\| \|
* \|	gen_file_metadata: allow empty/null bodies (if flag set)	Bryan Newbold	2020-11-08	2	-3/+5
\| \| \| \| \| \| \| \|	This is for HTML sub-resources, which can validly be empty (I think)
* \|	html: missing fetch is wayback-content-error, not wayback-error	Bryan Newbold	2020-11-08	1	-2/+2
\| \|
* \|	direct some more warnings to sys.stderr, not stdout	Bryan Newbold	2020-11-08	1	-2/+2
\| \|
* \|	html: handle no-capture for sub-resources	Bryan Newbold	2020-11-08	3	-9/+13
\| \|
* \|	ingest tool: consistency about ingest-type arg	Bryan Newbold	2020-11-08	1	-2/+2
\| \|
* \|	ingest: fix null-body case	Bryan Newbold	2020-11-08	2	-0/+6
\| \| \| \| \| \| \| \|	Broke this in earlier refactor.
* \|	remove unused pytype tool	Bryan Newbold	2020-11-06	3	-76/+25
\| \| \| \| \| \| \| \| \| \|	Having trouble getting this to install on Xenial, and we aren't even using it in tests/lint yet. Can revisit after Focal upgrade.
* \|	gitlab CI: upgrade pip (pip3) in environment	Bryan Newbold	2020-11-06	1	-2/+3
\| \|
* \|	many bad PDF sha1 from prod	Bryan Newbold	2020-11-06	1	-0/+36
\| \|
* \|	Merge branch 'bnewbold-html-ingest'	Bryan Newbold	2020-11-06	40	-549/+8227
\|\ \
\| * \|	html: update proposal (docs)	Bryan Newbold	2020-11-06	1	-19/+49
\| \| \|
\| * \|	html: catch and report exceptions at process_hit() stage	Bryan Newbold	2020-11-06	1	-4/+27
\| \| \|
\| * \|	html: pdf and html extract similar to XML	Bryan Newbold	2020-11-06	2	-22/+55
\| \| \| \| \| \| \| \| \| \| \| \|	Note that the primary PDF URL extraction path is a separate code path.
\| * \|	html: refactors/tweaks from testing	Bryan Newbold	2020-11-06	3	-17/+23
\| \| \|
\| * \|	ia: use newer gwb (petabox) loading class	Bryan Newbold	2020-11-04	1	-5/+8
\| \| \| \| \| \| \| \| \| \| \| \|	This fixes zstandard WARC reading.
\| * \|	pipenv: fix lock file; add zstandard; update wayback+gwb deps	Bryan Newbold	2020-11-04	2	-27/+1069
\| \| \|
\| * \|	persist: fix worker API/typing hacks (raw_key, key, key_str)	Bryan Newbold	2020-11-04	1	-9/+9
\| \| \|