Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | extraction: do want content, not text | Bryan Newbold | 2018-08-21 | 1 | -1/+1 |
| | | | | XML can have non-unicode characters? Who knew. | ||||
* | extraction: status reporting tweaks | Bryan Newbold | 2018-08-21 | 1 | -5/+8 |
| | | | | | Improvements to how the extraction function in the extraction script reports status (in output, hbase, and counters) | ||||
* | monkey-patch SHA-1 blacklist | Bryan Newbold | 2018-07-05 | 1 | -0/+8 |
| | |||||
* | actually fix oversize inserts | Bryan Newbold | 2018-05-08 | 1 | -7/+7 |
| | |||||
* | XML size limit | Bryan Newbold | 2018-04-26 | 1 | -0/+6 |
| | |||||
* | force_existing flag for extraction | Bryan Newbold | 2018-04-19 | 1 | -1/+5 |
| | |||||
* | NLineInputFormat requires RawProtocol | Bryan Newbold | 2018-04-19 | 1 | -1/+2 |
| | | | | | Should make this a command line argument or something. Want one in HADOOP, the other for local/tests/inline/etc. | ||||
* | use NLineInputFormat so we can control split size | Bryan Newbold | 2018-04-11 | 1 | -0/+1 |
| | |||||
* | Merge branch 'bnewbold-sentry' | Bryan Newbold | 2018-04-10 | 1 | -0/+9 |
|\ | |||||
| * | prototype sentry integration | Bryan Newbold | 2018-04-10 | 1 | -0/+9 |
| | | |||||
* | | don't try to decode GROBID output | Bryan Newbold | 2018-04-11 | 1 | -2/+2 |
|/ | |||||
* | partially lint extraction_cdx_grobid.py | Bryan Newbold | 2018-04-10 | 1 | -8/+6 |
| | |||||
* | yet more test improvements | Bryan Newbold | 2018-04-10 | 1 | -4/+12 |
| | |||||
* | cleanup tests; add one for double-processing | Bryan Newbold | 2018-04-10 | 1 | -5/+5 |
| | |||||
* | wayback 404 test | Bryan Newbold | 2018-04-10 | 1 | -1/+2 |
| | |||||
* | extraction test fixes | Bryan Newbold | 2018-04-10 | 1 | -23/+27 |
| | |||||
* | bug fixes | Bryan Newbold | 2018-04-06 | 1 | -7/+14 |
| | |||||
* | renamed do_tei | Bryan Newbold | 2018-04-06 | 1 | -3/+3 |
| | |||||
* | temporarily skip pylint on extraction | Bryan Newbold | 2018-04-06 | 1 | -0/+3 |
| | |||||
* | small grobid2json test | Bryan Newbold | 2018-04-06 | 1 | -0/+1 |
| | |||||
* | make happybase mock injection slightly less horrible | Bryan Newbold | 2018-04-05 | 1 | -16/+12 |
| | |||||
* | progress on extractor | Bryan Newbold | 2018-04-05 | 1 | -37/+50 |
| | |||||
* | improve test coverage | Bryan Newbold | 2018-04-05 | 1 | -1/+1 |
| | |||||
* | refactor out some common code | Bryan Newbold | 2018-04-04 | 1 | -46/+10 |
| | |||||
* | extraction -> mapreduce | Bryan Newbold | 2018-04-04 | 1 | -0/+248 |