Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | large pipfile update | Bryan Newbold | 2019-09-04 | 1 | -375/+402 |
| | | | | | | | | | Covers some security changes, but might need to revert if this breaks things. Should use version locking in Pipefile better to prevent unintentional large upgrades, especially when we don't have good test coverage in this repo. | ||||
* | ia_pdf_match.py bugfix | Bryan Newbold | 2019-07-07 | 1 | -4/+4 |
| | |||||
* | create deliver_gwb_to_disk.py | Bryan Newbold | 2019-07-07 | 1 | -0/+166 |
| | |||||
* | petabox journal files ingest updates | Bryan Newbold | 2019-06-20 | 1 | -0/+108 |
| | |||||
* | update grobid2json to include given_name/surname | Bryan Newbold | 2019-05-13 | 2 | -6/+10 |
| | |||||
* | deliver_dumpgrobid_to_s3: allow heritrix-style SHA-1 format | Bryan Newbold | 2019-05-10 | 1 | -0/+12 |
| | |||||
* | deliver_dumpgrobid_to_s3: storage class config | Bryan Newbold | 2019-05-09 | 1 | -1/+7 |
| | |||||
* | deliver_dumpgrobid_to_s3.py | Bryan Newbold | 2019-04-15 | 1 | -0/+106 |
| | |||||
* | python test fixes | Bryan Newbold | 2019-02-21 | 4 | -5/+8 |
| | |||||
* | backport GWB fetch improvements to extraction/kafka workers | Bryan Newbold | 2019-02-21 | 3 | -18/+50 |
| | | | | *Really* need to refactor out these common methods into a base class. | ||||
* | don't print secret, and MRO pylint skip | Bryan Newbold | 2019-02-21 | 1 | -4/+6 |
| | |||||
* | update Pipefile | Bryan Newbold | 2019-02-21 | 2 | -266/+220 |
| | |||||
* | include file size in S3 uploads | Bryan Newbold | 2019-02-20 | 1 | -3/+3 |
| | |||||
* | delivery gwb counter tweaks | Bryan Newbold | 2019-02-20 | 1 | -2/+8 |
| | |||||
* | silly typo | Bryan Newbold | 2019-02-19 | 1 | -1/+1 |
| | |||||
* | fix empty blob errors | Bryan Newbold | 2019-02-19 | 1 | -1/+5 |
| | |||||
* | make PETABOX_WEBDATA_SECRET explicit | Bryan Newbold | 2019-02-19 | 1 | -1/+9 |
| | | | | | TODO: port this change to other workers; or better yet make GWB access a mixin or something | ||||
* | deliver python tweaks | Bryan Newbold | 2019-02-19 | 1 | -5/+8 |
| | |||||
* | add GWB-to-S3 delivery pipeline script | Bryan Newbold | 2019-02-19 | 2 | -0/+162 |
| | |||||
* | crank hbase GROBID worker memory usage down | Bryan Newbold | 2018-12-10 | 1 | -1/+1 |
| | |||||
* | increase message size (kafka-grobid-hbase) | Bryan Newbold | 2018-12-10 | 1 | -0/+2 |
| | |||||
* | add python-snappy dep | Bryan Newbold | 2018-12-10 | 2 | -84/+96 |
| | |||||
* | ah, right, it's more like extract/3sec, not 30sec | Bryan Newbold | 2018-12-03 | 1 | -4/+4 |
| | |||||
* | tweak grobid worker producer settings | Bryan Newbold | 2018-12-03 | 1 | -2/+2 |
| | | | | | Python CPU utilization shot way up; this is an attempt to bring it back down. | ||||
* | tweak kafka config significantly | Bryan Newbold | 2018-12-03 | 2 | -3/+18 |
| | |||||
* | more sentry tags when extracting | Bryan Newbold | 2018-12-03 | 1 | -1/+6 |
| | |||||
* | improvements to Kafka GROBID worker logging | Bryan Newbold | 2018-12-03 | 2 | -11/+22 |
| | |||||
* | work around kafka topic/group mistakes | Bryan Newbold | 2018-12-01 | 1 | -1/+1 |
| | |||||
* | fix error var typo | Bryan Newbold | 2018-11-27 | 1 | -1/+1 |
| | |||||
* | catch more wayback error types | Bryan Newbold | 2018-11-26 | 1 | -1/+11 |
| | |||||
* | fix ungrobid extraction tests | Bryan Newbold | 2018-11-22 | 1 | -2/+4 |
| | |||||
* | better default consumergroup name | Bryan Newbold | 2018-11-21 | 1 | -1/+1 |
| | |||||
* | many improvements to kafka HBase inserter | Bryan Newbold | 2018-11-21 | 1 | -29/+29 |
| | |||||
* | cherry-pick: correct HBase column filtering | Bryan Newbold | 2018-11-21 | 1 | -1/+1 |
| | |||||
* | fixes to hbase worker | Bryan Newbold | 2018-11-21 | 1 | -1/+13 |
| | |||||
* | fix kafka grobid command line topic parsing | Bryan Newbold | 2018-11-21 | 2 | -3/+9 |
| | |||||
* | kafka_grobid_hbase (not 'ed') | Bryan Newbold | 2018-11-21 | 1 | -0/+0 |
| | |||||
* | kafka_grobid fixes and hbase WIP | Bryan Newbold | 2018-11-21 | 2 | -2/+179 |
| | |||||
* | small kafka_grobid tweaks | Bryan Newbold | 2018-11-21 | 1 | -1/+2 |
| | |||||
* | updated Pipfile.lock (VERY SLOW) | Bryan Newbold | 2018-11-21 | 1 | -548/+431 |
| | |||||
* | kafka_grobid tweaks for deployment; delay insert decision | Bryan Newbold | 2018-11-21 | 1 | -21/+9 |
| | |||||
* | initial work on kafka_grobid worker | Bryan Newbold | 2018-11-20 | 2 | -0/+296 |
| | |||||
* | one more lint ignore | Bryan Newbold | 2018-10-30 | 1 | -1/+1 |
| | |||||
* | squelch some more lint warnings | Bryan Newbold | 2018-10-30 | 1 | -1/+1 |
| | |||||
* | several bugs and lint issues in import_grobid_metadata | Bryan Newbold | 2018-10-30 | 1 | -9/+10 |
| | |||||
* | some progress on a crude grobid metadata filter | Bryan Newbold | 2018-09-26 | 2 | -7/+151 |
| | |||||
* | longtail grobid metadata parse/filter WIP | Bryan Newbold | 2018-09-22 | 3 | -0/+114 |
| | |||||
* | fix sha1/doi_list confusion in filter_scored_matches | Bryan Newbold | 2018-09-22 | 1 | -2/+2 |
| | |||||
* | pylint can be insufferable | Bryan Newbold | 2018-09-20 | 1 | -1/+1 |
| | |||||
* | gitignore in python dir | Bryan Newbold | 2018-09-18 | 1 | -0/+3 |
| |