index
:
sandcrawler
bnewbold-args
bnewbold-backfill
bnewbold-persist-grobid-errors
bnewbold-refactor-loggging
master
trawler
[no description]
about
summary
refs
log
tree
commit
diff
stats
log msg
author
committer
range
path:
root
/
python
/
sandcrawler
/
ingest.py
Commit message (
Expand
)
Author
Age
Files
Lines
*
xml: catch parse error
Bryan Newbold
2020-11-19
1
-3
/
+8
*
ingest: small html_bibli typo
Bryan Newbold
2020-11-08
1
-1
/
+1
*
move some PDF URL extraction into declarative format
Bryan Newbold
2020-11-08
1
-9
/
+7
*
ingest: default to html_biblio for PDF URL extraction
Bryan Newbold
2020-11-08
1
-24
/
+17
*
ingest: shorted scope+platform keys; use html_biblio extraction for PDFs
Bryan Newbold
2020-11-08
1
-15
/
+35
*
ingest html: return better status based on sniffed scope
Bryan Newbold
2020-11-08
1
-9
/
+31
*
html: start improving scope detection
Bryan Newbold
2020-11-08
1
-1
/
+1
*
ingest: retain html_biblio through hops; all ingest types
Bryan Newbold
2020-11-08
1
-1
/
+13
*
ingest tool: flag for HTML quick mode (CDX-only)
Bryan Newbold
2020-11-08
1
-1
/
+2
*
html: try to detect and mark XHTML (vs. HTML or XML)
Bryan Newbold
2020-11-08
1
-2
/
+2
*
html: handle no-capture for sub-resources
Bryan Newbold
2020-11-08
1
-1
/
+5
*
ingest: fix null-body case
Bryan Newbold
2020-11-08
1
-0
/
+4
*
html: catch and report exceptions at process_hit() stage
Bryan Newbold
2020-11-06
1
-4
/
+27
*
html: pdf and html extract similar to XML
Bryan Newbold
2020-11-06
1
-2
/
+25
*
html: refactors/tweaks from testing
Bryan Newbold
2020-11-06
1
-4
/
+5
*
html: actually publish HTML TEI-XML to body; fix dataflow though ingest a bit
Bryan Newbold
2020-11-04
1
-5
/
+25
*
initial implementation of HTML ingest in existing worker
Bryan Newbold
2020-11-04
1
-5
/
+50
*
small fixes from local testing for XML ingest
Bryan Newbold
2020-11-03
1
-1
/
+1
*
xml: re-encode XML docs into UTF-8 for persisting
Bryan Newbold
2020-11-03
1
-1
/
+3
*
ingest: handle publishing XML docs to kafka
Bryan Newbold
2020-11-03
1
-3
/
+21
*
basic support for XML ingest in worker
Bryan Newbold
2020-11-03
1
-23
/
+40
*
ingest: cleanups, typing, start generalizing to xml and html
Bryan Newbold
2020-11-03
1
-122
/
+118
*
ingest: tweak debug printing alignment
Bryan Newbold
2020-11-03
1
-3
/
+3
*
ingest: add more IA domains
Bryan Newbold
2020-11-03
1
-0
/
+2
*
ingest: skip JSTOR DOI prefixes
Bryan Newbold
2020-10-23
1
-0
/
+3
*
ingest: fix WaybackContentError typo
Bryan Newbold
2020-10-21
1
-1
/
+1
*
ingest: add a check for blocked-cookie before trying PDF url extraction
Bryan Newbold
2020-10-21
1
-0
/
+11
*
differential wayback-error from wayback-content-error
Bryan Newbold
2020-10-21
1
-1
/
+5
*
ingest: add a cdx-error slowdown delay
Bryan Newbold
2020-10-19
1
-0
/
+3
*
ingest: fix old_failure datetime
Bryan Newbold
2020-10-19
1
-1
/
+1
*
ingest: try SPNv2 for no-capture and old failures
Bryan Newbold
2020-10-19
1
-1
/
+5
*
ingest: disable soft404 and non-hit SPNv2 retries
Bryan Newbold
2020-10-19
1
-4
/
+5
*
store no-capture URLs in terminal_url
Bryan Newbold
2020-10-12
1
-2
/
+2
*
ingest: small bugfix to print pdfextract status on SUCCESS
Bryan Newbold
2020-09-17
1
-1
/
+1
*
ingest: treat text/xml as XHTML in pdf ingest
Bryan Newbold
2020-09-14
1
-1
/
+1
*
additional loginwall patterns
Bryan Newbold
2020-08-11
1
-0
/
+2
*
ingest: actually use force_get flag with SPN
Bryan Newbold
2020-08-11
1
-0
/
+13
*
check for simple URL patterns that are usually paywalls or loginwalls
Bryan Newbold
2020-08-11
1
-0
/
+11
*
ingest: check for URL blocklist and cookie URL patterns on every hop
Bryan Newbold
2020-08-11
1
-0
/
+13
*
refactor: force_get -> force_simple_get
Bryan Newbold
2020-08-11
1
-3
/
+3
*
add hkvalidate.perfdrive.com to domain blocklist
Bryan Newbold
2020-08-08
1
-0
/
+3
*
pdfextract support in ingest worker
Bryan Newbold
2020-06-25
1
-1
/
+35
*
workers: refactor to pass key to process()
Bryan Newbold
2020-06-17
1
-2
/
+2
*
ingest: don't 'want' non-PDF ingest
Bryan Newbold
2020-04-30
1
-0
/
+5
*
timeout message implementation for GROBID and ingest workers
Bryan Newbold
2020-04-27
1
-0
/
+9
*
ingest: block another large domain (and DOI prefix)
Bryan Newbold
2020-03-27
1
-0
/
+2
*
ingest: clean_url() in more places
Bryan Newbold
2020-03-23
1
-0
/
+1
*
implement (unused) force_get flag for SPN2
Bryan Newbold
2020-03-18
1
-1
/
+15
*
url cleaning (canonicalization) for ingest base_url
Bryan Newbold
2020-03-10
1
-2
/
+6
*
ingest: make content-decoding more robust
Bryan Newbold
2020-03-03
1
-1
/
+2
[next]