Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | Merge branch 'bnewbold-backfill' into 'master' | bnewbold | 2021-10-04 | 3 | -0/+384 |
|\ | | | | | | | | | CDX Backfill (scalding version) See merge request webgroup/sandcrawler!12 | ||||
| * | temporary please option for scala backfill | Bryan Newbold | 2018-07-24 | 1 | -0/+22 |
| | | |||||
| * | small CdxBackfillJob refactor (code quality) | Bryan Newbold | 2018-07-24 | 1 | -5/+5 |
| | | |||||
| * | do sha1 pattern match correctly | Bryan Newbold | 2018-07-24 | 2 | -3/+18 |
| | | |||||
| * | more PDF mimetypes; fix return refactor | Bryan Newbold | 2018-07-24 | 1 | -2/+5 |
| | | |||||
| * | CdxBackfillJob: comment cleanup | Bryan Newbold | 2018-07-24 | 1 | -6/+0 |
| | | |||||
| * | CdxBackfillJob: scalastyle | Bryan Newbold | 2018-07-24 | 1 | -22/+14 |
| | | |||||
| * | address some (but not all) review comments | Bryan Newbold | 2018-07-24 | 1 | -20/+21 |
| | | |||||
| * | reference TDsl note in docs | Bryan Newbold | 2018-07-24 | 1 | -0/+16 |
| | | |||||
| * | fix CdxBackfillJob tests | Bryan Newbold | 2018-07-24 | 2 | -6/+13 |
| | | |||||
| * | some scalastyle on CdxBackfillJob | Bryan Newbold | 2018-07-24 | 1 | -7/+8 |
| | | |||||
| * | CdxBackfillJob: implement other fields | Bryan Newbold | 2018-07-24 | 2 | -19/+84 |
| | | |||||
| * | CdxBackfillJob back to HBase; tests work | Bryan Newbold | 2018-07-24 | 2 | -15/+13 |
| | | |||||
| * | variant of CdxBackfillJob that writes to TSV | Bryan Newbold | 2018-07-24 | 2 | -0/+286 |
| | | | | | | | | | | Has the same test failure ("java.lang.IndexOutOfBoundsException: Index: 1, Size: 1") | ||||
* | | cdx_collection.py: minor lint issue | Bryan Newbold | 2021-10-04 | 1 | -1/+1 |
| | | |||||
* | | ingest: basic 'component' and 'src' support | Bryan Newbold | 2021-10-04 | 4 | -20/+251 |
| | | |||||
* | | old (2020) notes on pdfextract cleanup | Bryan Newbold | 2021-10-04 | 1 | -0/+74 |
| | | |||||
* | | notes on dumping PDF URL lists for partners | Bryan Newbold | 2021-10-04 | 1 | -0/+66 |
| | | |||||
* | | new SQL recent SPN request monitoring query | Bryan Newbold | 2021-10-04 | 1 | -0/+32 |
| | | |||||
* | | html ingest: report dt with broken CDX records | Bryan Newbold | 2021-10-04 | 1 | -1/+1 |
| | | |||||
* | | allow through unknown-scope HTML ingests, for possible SPN import | Bryan Newbold | 2021-10-01 | 1 | -11/+5 |
| | | |||||
* | | html: fix logging of broken CDX URL | Bryan Newbold | 2021-10-01 | 1 | -1/+1 |
| | | |||||
* | | ingest CDX lookup: weigh year+month of capture against in-petabox-or-not | Bryan Newbold | 2021-09-30 | 1 | -0/+1 |
| | | | | | | | | | | | | | | | | This is to try working around an issue where ingests fail because an SPN capture is much newer, but the old sorting preference ignored that. Note that the sorting logic is pretty busted anyways, and we should probably allow returning multiple matching files to try. | ||||
* | | fix typo with spn_cdx_retry_sec arg | Bryan Newbold | 2021-09-30 | 1 | -1/+1 |
| | | |||||
* | | tune SPN CDX retry/wait depending on mode (priority vs daily) | Bryan Newbold | 2021-09-30 | 3 | -3/+9 |
| | | |||||
* | | refactor reingest scripts | Bryan Newbold | 2021-09-30 | 6 | -150/+90 |
| | | |||||
* | | yet another bad PDF sha1 | Bryan Newbold | 2021-09-30 | 1 | -0/+1 |
| | | |||||
* | | new 'daily' and 'priority' ingest request topics | Bryan Newbold | 2021-09-30 | 4 | -5/+17 |
| | | | | | | | | | | | | | | | | | | The old ingest request queue was always getting lopsided, suspect because it was scaled up (additional partitions) at some point in the past, hoping new topics will fix this. New '-priority' queue is like '-bulk', but for smaller-volume SPN-like requests. Eg, interactive mode. | ||||
* | | kafka: delete unused work-updates topic | Bryan Newbold | 2021-09-13 | 1 | -4/+0 |
| | | |||||
* | | old HTML extractors: handle null tag | Bryan Newbold | 2021-09-08 | 1 | -8/+9 |
| | | |||||
* | | ingest: more block patterns, for huge databases | Bryan Newbold | 2021-09-08 | 1 | -1/+4 |
| | | |||||
* | | daily OA crawl improvements/notes | Bryan Newbold | 2021-09-08 | 1 | -0/+1021 |
| | | |||||
* | | yet more PDF sha1 to skip | Bryan Newbold | 2021-09-03 | 1 | -0/+5 |
| | | |||||
* | | yet more PDF URL patterns | Bryan Newbold | 2021-09-03 | 1 | -0/+48 |
| | | |||||
* | | ingest: check URL blocklist again after redirects | Bryan Newbold | 2021-09-03 | 1 | -0/+7 |
| | | |||||
* | | OAI-PMH patch and ingest improvement notes | Bryan Newbold | 2021-09-03 | 2 | -204/+1578 |
| | | |||||
* | | commit old patch crawl notes (dec 2020) | Bryan Newbold | 2021-09-03 | 1 | -0/+1 |
| | | |||||
* | | commit old arxiv ingest notes | Bryan Newbold | 2021-09-03 | 1 | -0/+12 |
| | | |||||
* | | kafka re-balancing tweaks | Bryan Newbold | 2021-09-03 | 1 | -1/+2 |
| | | |||||
* | | refactor and expand wall/block/cookie URL patterns | Bryan Newbold | 2021-09-03 | 2 | -6/+39 |
| | | |||||
* | | HTML ingest: several more PDF fulltext URL patterns | Bryan Newbold | 2021-09-03 | 1 | -0/+87 |
| | | |||||
* | | HTML ingest: skip noisy print() statement | Bryan Newbold | 2021-09-03 | 1 | -1/+1 |
| | | |||||
* | | commit old patch notes (will rework) | Bryan Newbold | 2021-09-03 | 1 | -0/+110 |
| | | |||||
* | | MAG post-crawl stats (5m+ new PDFs crawled successfully) | Bryan Newbold | 2021-09-02 | 1 | -0/+124 |
| | | |||||
* | | HTML ingest: more meta-URI prefixes | Bryan Newbold | 2021-08-24 | 1 | -2/+8 |
| | | |||||
* | | html ingest: detect some blog platforms, and allow lower wordcount threshold | Bryan Newbold | 2021-08-16 | 1 | -0/+6 |
| | | |||||
* | | html ingest: detect domain homepage (no path) as special case | Bryan Newbold | 2021-08-16 | 1 | -0/+8 |
| | | |||||
* | | html ingest: skip 'about:blank' | Bryan Newbold | 2021-08-16 | 1 | -0/+3 |
| | | | | | | | | | | Couldn't get adblock rule matcher to match this, for some reason. maybe a special case? | ||||
* | | MAG and OAI-PMH crawl/processing notes | Bryan Newbold | 2021-08-13 | 2 | -0/+480 |
| | | |||||
* | | 2021-07 unpaywall crawl wrap-up notes | Bryan Newbold | 2021-07-30 | 1 | -12/+108 |
| | |