aboutsummaryrefslogtreecommitdiffstats
BranchCommit messageAuthorAge
masterhtml ingest: handle TEI-XML parse errorBryan Newbold2 months
bnewbold-argsmake hbase_table and zookeeper_hosts CLI argsBryan Newbold4 years
bnewbold-backfillmake hbase_table and zookeeper_hosts CLI argsBryan Newbold4 years
 
 
AgeCommit messageAuthorFilesLines
2022-07-28html ingest: handle TEI-XML parse errorHEADmasterBryan Newbold1-1/+4
2022-07-27yet another bad PDF sha1Bryan Newbold1-0/+1
2022-07-25CDX: skip sha-256 digestsBryan Newbold1-1/+5
2022-07-24yet another bad SHA1 PDF hashBryan Newbold1-0/+1
2022-07-21misc ingest fixesBryan Newbold1-0/+831
2022-07-20ingest: bump max-hops from 6 to 8Bryan Newbold1-1/+1
2022-07-20ingest: more PDF fulltext tricksBryan Newbold2-0/+36
2022-07-20ingest: more PDF fulltext URL patternsBryan Newbold1-0/+42
2022-07-20doaj and unpaywall transforms: more domains to skipBryan Newbold2-3/+1
2022-07-18ingest: record bad GZIP transfer decode, instead of crashing (HTML)Bryan Newbold1-1/+4
[...]
 
Clone
git@git.bnewbold.net:sandcrawler
https://git.bnewbold.net/sandcrawler
git://git.bnewbold.net/sandcrawler