Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | off-by-one error in batch sizes | Bryan Newbold | 2019-09-26 | 1 | -1/+1 |
| | |||||
* | small improvements to GROBID tool | Bryan Newbold | 2019-09-26 | 2 | -2/+10 |
| | |||||
* | lots of grobid tool implementation (still WIP) | Bryan Newbold | 2019-09-26 | 8 | -16/+748 |
| | |||||
* | pylint as part of pytest; update lint config | Bryan Newbold | 2019-09-25 | 2 | -1/+17 |
| | |||||
* | pipfile update | Bryan Newbold | 2019-09-25 | 2 | -241/+70 |
| | | | | | | | - remove hadoop stuff (mrjob, happybase, etc) - add flask - add pytest-pylint plugin - reformat (automatic by newer pipenv) | ||||
* | test of GROBID client | Bryan Newbold | 2019-09-25 | 1 | -0/+53 |
| | |||||
* | move a bunch of random old scripts to subdir | Bryan Newbold | 2019-09-25 | 9 | -0/+0 |
| | |||||
* | get rid of old xml2json | Bryan Newbold | 2019-09-25 | 1 | -7/+0 |
| | |||||
* | refactor old python hadoop code into new directory | Bryan Newbold | 2019-09-25 | 10 | -1590/+0 |
| | |||||
* | re-write parse_cdx_line for sandcrawler lib | Bryan Newbold | 2019-09-25 | 3 | -2/+116 |
| | |||||
* | fix test grobid2json test | Bryan Newbold | 2019-09-25 | 1 | -1/+4 |
| | | | | For new extra fields | ||||
* | commit WIP on file ingest script | Bryan Newbold | 2019-09-23 | 1 | -0/+386 |
| | |||||
* | start refactoring sandcrawler python common code | Bryan Newbold | 2019-09-23 | 5 | -0/+131 |
| | |||||
* | update Pipfile with additional libraries | Bryan Newbold | 2019-09-23 | 2 | -77/+293 |
| | |||||
* | grobid2json: extract fatcat identifier | Bryan Newbold | 2019-09-20 | 1 | -1/+5 |
| | |||||
* | add filter_groupworks.py | Bryan Newbold | 2019-09-04 | 1 | -0/+144 |
| | | | | For use with new release grouping/matching jobs. | ||||
* | large pipfile update | Bryan Newbold | 2019-09-04 | 1 | -375/+402 |
| | | | | | | | | | Covers some security changes, but might need to revert if this breaks things. Should use version locking in Pipefile better to prevent unintentional large upgrades, especially when we don't have good test coverage in this repo. | ||||
* | ia_pdf_match.py bugfix | Bryan Newbold | 2019-07-07 | 1 | -4/+4 |
| | |||||
* | create deliver_gwb_to_disk.py | Bryan Newbold | 2019-07-07 | 1 | -0/+166 |
| | |||||
* | petabox journal files ingest updates | Bryan Newbold | 2019-06-20 | 1 | -0/+108 |
| | |||||
* | update grobid2json to include given_name/surname | Bryan Newbold | 2019-05-13 | 2 | -6/+10 |
| | |||||
* | deliver_dumpgrobid_to_s3: allow heritrix-style SHA-1 format | Bryan Newbold | 2019-05-10 | 1 | -0/+12 |
| | |||||
* | deliver_dumpgrobid_to_s3: storage class config | Bryan Newbold | 2019-05-09 | 1 | -1/+7 |
| | |||||
* | deliver_dumpgrobid_to_s3.py | Bryan Newbold | 2019-04-15 | 1 | -0/+106 |
| | |||||
* | python test fixes | Bryan Newbold | 2019-02-21 | 4 | -5/+8 |
| | |||||
* | backport GWB fetch improvements to extraction/kafka workers | Bryan Newbold | 2019-02-21 | 3 | -18/+50 |
| | | | | *Really* need to refactor out these common methods into a base class. | ||||
* | don't print secret, and MRO pylint skip | Bryan Newbold | 2019-02-21 | 1 | -4/+6 |
| | |||||
* | update Pipefile | Bryan Newbold | 2019-02-21 | 2 | -266/+220 |
| | |||||
* | include file size in S3 uploads | Bryan Newbold | 2019-02-20 | 1 | -3/+3 |
| | |||||
* | delivery gwb counter tweaks | Bryan Newbold | 2019-02-20 | 1 | -2/+8 |
| | |||||
* | silly typo | Bryan Newbold | 2019-02-19 | 1 | -1/+1 |
| | |||||
* | fix empty blob errors | Bryan Newbold | 2019-02-19 | 1 | -1/+5 |
| | |||||
* | make PETABOX_WEBDATA_SECRET explicit | Bryan Newbold | 2019-02-19 | 1 | -1/+9 |
| | | | | | TODO: port this change to other workers; or better yet make GWB access a mixin or something | ||||
* | deliver python tweaks | Bryan Newbold | 2019-02-19 | 1 | -5/+8 |
| | |||||
* | add GWB-to-S3 delivery pipeline script | Bryan Newbold | 2019-02-19 | 2 | -0/+162 |
| | |||||
* | crank hbase GROBID worker memory usage down | Bryan Newbold | 2018-12-10 | 1 | -1/+1 |
| | |||||
* | increase message size (kafka-grobid-hbase) | Bryan Newbold | 2018-12-10 | 1 | -0/+2 |
| | |||||
* | add python-snappy dep | Bryan Newbold | 2018-12-10 | 2 | -84/+96 |
| | |||||
* | ah, right, it's more like extract/3sec, not 30sec | Bryan Newbold | 2018-12-03 | 1 | -4/+4 |
| | |||||
* | tweak grobid worker producer settings | Bryan Newbold | 2018-12-03 | 1 | -2/+2 |
| | | | | | Python CPU utilization shot way up; this is an attempt to bring it back down. | ||||
* | tweak kafka config significantly | Bryan Newbold | 2018-12-03 | 2 | -3/+18 |
| | |||||
* | more sentry tags when extracting | Bryan Newbold | 2018-12-03 | 1 | -1/+6 |
| | |||||
* | improvements to Kafka GROBID worker logging | Bryan Newbold | 2018-12-03 | 2 | -11/+22 |
| | |||||
* | work around kafka topic/group mistakes | Bryan Newbold | 2018-12-01 | 1 | -1/+1 |
| | |||||
* | fix error var typo | Bryan Newbold | 2018-11-27 | 1 | -1/+1 |
| | |||||
* | catch more wayback error types | Bryan Newbold | 2018-11-26 | 1 | -1/+11 |
| | |||||
* | fix ungrobid extraction tests | Bryan Newbold | 2018-11-22 | 1 | -2/+4 |
| | |||||
* | better default consumergroup name | Bryan Newbold | 2018-11-21 | 1 | -1/+1 |
| | |||||
* | many improvements to kafka HBase inserter | Bryan Newbold | 2018-11-21 | 1 | -29/+29 |
| | |||||
* | cherry-pick: correct HBase column filtering | Bryan Newbold | 2018-11-21 | 1 | -1/+1 |
| |