aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* spn: avoid 'None' job_idBryan Newbold2021-10-111-2/+2
| | | | | | Thanks Vanglis for reporting these. Not sure this commit fixes *all* instances of the problem.
* Merge branch 'bnewbold-backfill' into 'master'bnewbold2021-10-043-0/+384
|\ | | | | | | | | CDX Backfill (scalding version) See merge request webgroup/sandcrawler!12
| * temporary please option for scala backfillBryan Newbold2018-07-241-0/+22
| |
| * small CdxBackfillJob refactor (code quality)Bryan Newbold2018-07-241-5/+5
| |
| * do sha1 pattern match correctlyBryan Newbold2018-07-242-3/+18
| |
| * more PDF mimetypes; fix return refactorBryan Newbold2018-07-241-2/+5
| |
| * CdxBackfillJob: comment cleanupBryan Newbold2018-07-241-6/+0
| |
| * CdxBackfillJob: scalastyleBryan Newbold2018-07-241-22/+14
| |
| * address some (but not all) review commentsBryan Newbold2018-07-241-20/+21
| |
| * reference TDsl note in docsBryan Newbold2018-07-241-0/+16
| |
| * fix CdxBackfillJob testsBryan Newbold2018-07-242-6/+13
| |
| * some scalastyle on CdxBackfillJobBryan Newbold2018-07-241-7/+8
| |
| * CdxBackfillJob: implement other fieldsBryan Newbold2018-07-242-19/+84
| |
| * CdxBackfillJob back to HBase; tests workBryan Newbold2018-07-242-15/+13
| |
| * variant of CdxBackfillJob that writes to TSVBryan Newbold2018-07-242-0/+286
| | | | | | | | | | Has the same test failure ("java.lang.IndexOutOfBoundsException: Index: 1, Size: 1")
* | cdx_collection.py: minor lint issueBryan Newbold2021-10-041-1/+1
| |
* | ingest: basic 'component' and 'src' supportBryan Newbold2021-10-044-20/+251
| |
* | old (2020) notes on pdfextract cleanupBryan Newbold2021-10-041-0/+74
| |
* | notes on dumping PDF URL lists for partnersBryan Newbold2021-10-041-0/+66
| |
* | new SQL recent SPN request monitoring queryBryan Newbold2021-10-041-0/+32
| |
* | html ingest: report dt with broken CDX recordsBryan Newbold2021-10-041-1/+1
| |
* | allow through unknown-scope HTML ingests, for possible SPN importBryan Newbold2021-10-011-11/+5
| |
* | html: fix logging of broken CDX URLBryan Newbold2021-10-011-1/+1
| |
* | ingest CDX lookup: weigh year+month of capture against in-petabox-or-notBryan Newbold2021-09-301-0/+1
| | | | | | | | | | | | | | | | This is to try working around an issue where ingests fail because an SPN capture is much newer, but the old sorting preference ignored that. Note that the sorting logic is pretty busted anyways, and we should probably allow returning multiple matching files to try.
* | fix typo with spn_cdx_retry_sec argBryan Newbold2021-09-301-1/+1
| |
* | tune SPN CDX retry/wait depending on mode (priority vs daily)Bryan Newbold2021-09-303-3/+9
| |
* | refactor reingest scriptsBryan Newbold2021-09-306-150/+90
| |
* | yet another bad PDF sha1Bryan Newbold2021-09-301-0/+1
| |
* | new 'daily' and 'priority' ingest request topicsBryan Newbold2021-09-304-5/+17
| | | | | | | | | | | | | | | | | | The old ingest request queue was always getting lopsided, suspect because it was scaled up (additional partitions) at some point in the past, hoping new topics will fix this. New '-priority' queue is like '-bulk', but for smaller-volume SPN-like requests. Eg, interactive mode.
* | kafka: delete unused work-updates topicBryan Newbold2021-09-131-4/+0
| |
* | old HTML extractors: handle null tagBryan Newbold2021-09-081-8/+9
| |
* | ingest: more block patterns, for huge databasesBryan Newbold2021-09-081-1/+4
| |
* | daily OA crawl improvements/notesBryan Newbold2021-09-081-0/+1021
| |
* | yet more PDF sha1 to skipBryan Newbold2021-09-031-0/+5
| |
* | yet more PDF URL patternsBryan Newbold2021-09-031-0/+48
| |
* | ingest: check URL blocklist again after redirectsBryan Newbold2021-09-031-0/+7
| |
* | OAI-PMH patch and ingest improvement notesBryan Newbold2021-09-032-204/+1578
| |
* | commit old patch crawl notes (dec 2020)Bryan Newbold2021-09-031-0/+1
| |
* | commit old arxiv ingest notesBryan Newbold2021-09-031-0/+12
| |
* | kafka re-balancing tweaksBryan Newbold2021-09-031-1/+2
| |
* | refactor and expand wall/block/cookie URL patternsBryan Newbold2021-09-032-6/+39
| |
* | HTML ingest: several more PDF fulltext URL patternsBryan Newbold2021-09-031-0/+87
| |
* | HTML ingest: skip noisy print() statementBryan Newbold2021-09-031-1/+1
| |
* | commit old patch notes (will rework)Bryan Newbold2021-09-031-0/+110
| |
* | MAG post-crawl stats (5m+ new PDFs crawled successfully)Bryan Newbold2021-09-021-0/+124
| |
* | HTML ingest: more meta-URI prefixesBryan Newbold2021-08-241-2/+8
| |
* | html ingest: detect some blog platforms, and allow lower wordcount thresholdBryan Newbold2021-08-161-0/+6
| |
* | html ingest: detect domain homepage (no path) as special caseBryan Newbold2021-08-161-0/+8
| |
* | html ingest: skip 'about:blank'Bryan Newbold2021-08-161-0/+3
| | | | | | | | | | Couldn't get adblock rule matcher to match this, for some reason. maybe a special case?
* | MAG and OAI-PMH crawl/processing notesBryan Newbold2021-08-132-0/+480
| |