aboutsummaryrefslogtreecommitdiffstats
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
* test of GROBID clientBryan Newbold2019-09-251-0/+53
|
* move a bunch of random old scripts to subdirBryan Newbold2019-09-259-0/+0
|
* get rid of old xml2jsonBryan Newbold2019-09-251-7/+0
|
* refactor old python hadoop code into new directoryBryan Newbold2019-09-2510-1590/+0
|
* re-write parse_cdx_line for sandcrawler libBryan Newbold2019-09-253-2/+116
|
* fix test grobid2json testBryan Newbold2019-09-251-1/+4
| | | | For new extra fields
* commit WIP on file ingest scriptBryan Newbold2019-09-231-0/+386
|
* start refactoring sandcrawler python common codeBryan Newbold2019-09-235-0/+131
|
* update Pipfile with additional librariesBryan Newbold2019-09-232-77/+293
|
* grobid2json: extract fatcat identifierBryan Newbold2019-09-201-1/+5
|
* add filter_groupworks.pyBryan Newbold2019-09-041-0/+144
| | | | For use with new release grouping/matching jobs.
* large pipfile updateBryan Newbold2019-09-041-375/+402
| | | | | | | | | Covers some security changes, but might need to revert if this breaks things. Should use version locking in Pipefile better to prevent unintentional large upgrades, especially when we don't have good test coverage in this repo.
* ia_pdf_match.py bugfixBryan Newbold2019-07-071-4/+4
|
* create deliver_gwb_to_disk.pyBryan Newbold2019-07-071-0/+166
|
* petabox journal files ingest updatesBryan Newbold2019-06-201-0/+108
|
* update grobid2json to include given_name/surnameBryan Newbold2019-05-132-6/+10
|
* deliver_dumpgrobid_to_s3: allow heritrix-style SHA-1 formatBryan Newbold2019-05-101-0/+12
|
* deliver_dumpgrobid_to_s3: storage class configBryan Newbold2019-05-091-1/+7
|
* deliver_dumpgrobid_to_s3.pyBryan Newbold2019-04-151-0/+106
|
* python test fixesBryan Newbold2019-02-214-5/+8
|
* backport GWB fetch improvements to extraction/kafka workersBryan Newbold2019-02-213-18/+50
| | | | *Really* need to refactor out these common methods into a base class.
* don't print secret, and MRO pylint skipBryan Newbold2019-02-211-4/+6
|
* update PipefileBryan Newbold2019-02-212-266/+220
|
* include file size in S3 uploadsBryan Newbold2019-02-201-3/+3
|
* delivery gwb counter tweaksBryan Newbold2019-02-201-2/+8
|
* silly typoBryan Newbold2019-02-191-1/+1
|
* fix empty blob errorsBryan Newbold2019-02-191-1/+5
|
* make PETABOX_WEBDATA_SECRET explicitBryan Newbold2019-02-191-1/+9
| | | | | TODO: port this change to other workers; or better yet make GWB access a mixin or something
* deliver python tweaksBryan Newbold2019-02-191-5/+8
|
* add GWB-to-S3 delivery pipeline scriptBryan Newbold2019-02-192-0/+162
|
* crank hbase GROBID worker memory usage downBryan Newbold2018-12-101-1/+1
|
* increase message size (kafka-grobid-hbase)Bryan Newbold2018-12-101-0/+2
|
* add python-snappy depBryan Newbold2018-12-102-84/+96
|
* ah, right, it's more like extract/3sec, not 30secBryan Newbold2018-12-031-4/+4
|
* tweak grobid worker producer settingsBryan Newbold2018-12-031-2/+2
| | | | | Python CPU utilization shot way up; this is an attempt to bring it back down.
* tweak kafka config significantlyBryan Newbold2018-12-032-3/+18
|
* more sentry tags when extractingBryan Newbold2018-12-031-1/+6
|
* improvements to Kafka GROBID worker loggingBryan Newbold2018-12-032-11/+22
|
* work around kafka topic/group mistakesBryan Newbold2018-12-011-1/+1
|
* fix error var typoBryan Newbold2018-11-271-1/+1
|
* catch more wayback error typesBryan Newbold2018-11-261-1/+11
|
* fix ungrobid extraction testsBryan Newbold2018-11-221-2/+4
|
* better default consumergroup nameBryan Newbold2018-11-211-1/+1
|
* many improvements to kafka HBase inserterBryan Newbold2018-11-211-29/+29
|
* cherry-pick: correct HBase column filteringBryan Newbold2018-11-211-1/+1
|
* fixes to hbase workerBryan Newbold2018-11-211-1/+13
|
* fix kafka grobid command line topic parsingBryan Newbold2018-11-212-3/+9
|
* kafka_grobid_hbase (not 'ed')Bryan Newbold2018-11-211-0/+0
|
* kafka_grobid fixes and hbase WIPBryan Newbold2018-11-212-2/+179
|
* small kafka_grobid tweaksBryan Newbold2018-11-211-1/+2
|