aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
...
| * CdxBackfillJob: implement other fieldsBryan Newbold2018-07-242-19/+84
| |
| * CdxBackfillJob back to HBase; tests workBryan Newbold2018-07-242-15/+13
| |
| * variant of CdxBackfillJob that writes to TSVBryan Newbold2018-07-242-0/+286
| | | | | | | | | | Has the same test failure ("java.lang.IndexOutOfBoundsException: Index: 1, Size: 1")
* | cdx_collection.py: minor lint issueBryan Newbold2021-10-041-1/+1
| |
* | ingest: basic 'component' and 'src' supportBryan Newbold2021-10-044-20/+251
| |
* | old (2020) notes on pdfextract cleanupBryan Newbold2021-10-041-0/+74
| |
* | notes on dumping PDF URL lists for partnersBryan Newbold2021-10-041-0/+66
| |
* | new SQL recent SPN request monitoring queryBryan Newbold2021-10-041-0/+32
| |
* | html ingest: report dt with broken CDX recordsBryan Newbold2021-10-041-1/+1
| |
* | allow through unknown-scope HTML ingests, for possible SPN importBryan Newbold2021-10-011-11/+5
| |
* | html: fix logging of broken CDX URLBryan Newbold2021-10-011-1/+1
| |
* | ingest CDX lookup: weigh year+month of capture against in-petabox-or-notBryan Newbold2021-09-301-0/+1
| | | | | | | | | | | | | | | | This is to try working around an issue where ingests fail because an SPN capture is much newer, but the old sorting preference ignored that. Note that the sorting logic is pretty busted anyways, and we should probably allow returning multiple matching files to try.
* | fix typo with spn_cdx_retry_sec argBryan Newbold2021-09-301-1/+1
| |
* | tune SPN CDX retry/wait depending on mode (priority vs daily)Bryan Newbold2021-09-303-3/+9
| |
* | refactor reingest scriptsBryan Newbold2021-09-306-150/+90
| |
* | yet another bad PDF sha1Bryan Newbold2021-09-301-0/+1
| |
* | new 'daily' and 'priority' ingest request topicsBryan Newbold2021-09-304-5/+17
| | | | | | | | | | | | | | | | | | The old ingest request queue was always getting lopsided, suspect because it was scaled up (additional partitions) at some point in the past, hoping new topics will fix this. New '-priority' queue is like '-bulk', but for smaller-volume SPN-like requests. Eg, interactive mode.
* | kafka: delete unused work-updates topicBryan Newbold2021-09-131-4/+0
| |
* | old HTML extractors: handle null tagBryan Newbold2021-09-081-8/+9
| |
* | ingest: more block patterns, for huge databasesBryan Newbold2021-09-081-1/+4
| |
* | daily OA crawl improvements/notesBryan Newbold2021-09-081-0/+1021
| |
* | yet more PDF sha1 to skipBryan Newbold2021-09-031-0/+5
| |
* | yet more PDF URL patternsBryan Newbold2021-09-031-0/+48
| |
* | ingest: check URL blocklist again after redirectsBryan Newbold2021-09-031-0/+7
| |
* | OAI-PMH patch and ingest improvement notesBryan Newbold2021-09-032-204/+1578
| |
* | commit old patch crawl notes (dec 2020)Bryan Newbold2021-09-031-0/+1
| |
* | commit old arxiv ingest notesBryan Newbold2021-09-031-0/+12
| |
* | kafka re-balancing tweaksBryan Newbold2021-09-031-1/+2
| |
* | refactor and expand wall/block/cookie URL patternsBryan Newbold2021-09-032-6/+39
| |
* | HTML ingest: several more PDF fulltext URL patternsBryan Newbold2021-09-031-0/+87
| |
* | HTML ingest: skip noisy print() statementBryan Newbold2021-09-031-1/+1
| |
* | commit old patch notes (will rework)Bryan Newbold2021-09-031-0/+110
| |
* | MAG post-crawl stats (5m+ new PDFs crawled successfully)Bryan Newbold2021-09-021-0/+124
| |
* | HTML ingest: more meta-URI prefixesBryan Newbold2021-08-241-2/+8
| |
* | html ingest: detect some blog platforms, and allow lower wordcount thresholdBryan Newbold2021-08-161-0/+6
| |
* | html ingest: detect domain homepage (no path) as special caseBryan Newbold2021-08-161-0/+8
| |
* | html ingest: skip 'about:blank'Bryan Newbold2021-08-161-0/+3
| | | | | | | | | | Couldn't get adblock rule matcher to match this, for some reason. maybe a special case?
* | MAG and OAI-PMH crawl/processing notesBryan Newbold2021-08-132-0/+480
| |
* | 2021-07 unpaywall crawl wrap-up notesBryan Newbold2021-07-301-12/+108
| |
* | more bad PDF hashesBryan Newbold2021-07-261-0/+2
| |
* | ingest: fix postgrest lookup bug (double get of GROBID)Bryan Newbold2021-07-261-1/+1
| |
* | reingest: skip spn2 'unknown' errorsBryan Newbold2021-07-212-0/+2
| |
* | more blocked-cookie patterns; fix old typoBryan Newbold2021-07-141-2/+2
| |
* | unpaywall 2021-07 crawl partial notesBryan Newbold2021-07-141-0/+224
| |
* | CI: wget used in pig CI scriptsBryan Newbold2021-07-131-1/+1
| |
* | CI: new sbt debian repositoryBryan Newbold2021-07-131-4/+3
| |
* | Revert "CI: sbt bintray is gone, but ubuntu focal version should work"Bryan Newbold2021-07-131-0/+5
| | | | | | | | This reverts commit 9aebb88aac2b620e15756b4e1531989d9d32cd43.
* | CI: sbt bintray is gone, but ubuntu focal version should workBryan Newbold2021-07-131-5/+0
| |
* | another bad PDF sha1Bryan Newbold2021-07-131-0/+1
| |
* | crawl: SPN2 non-200 success code pathBryan Newbold2021-07-131-11/+25
| |