aboutsummaryrefslogtreecommitdiffstats
path: root/mapreduce/extraction_cdx_grobid.py
Commit message (Collapse)AuthorAgeFilesLines
* extraction: do want content, not textBryan Newbold2018-08-211-1/+1
| | | | XML can have non-unicode characters? Who knew.
* extraction: status reporting tweaksBryan Newbold2018-08-211-5/+8
| | | | | Improvements to how the extraction function in the extraction script reports status (in output, hbase, and counters)
* monkey-patch SHA-1 blacklistBryan Newbold2018-07-051-0/+8
|
* actually fix oversize insertsBryan Newbold2018-05-081-7/+7
|
* XML size limitBryan Newbold2018-04-261-0/+6
|
* force_existing flag for extractionBryan Newbold2018-04-191-1/+5
|
* NLineInputFormat requires RawProtocolBryan Newbold2018-04-191-1/+2
| | | | | Should make this a command line argument or something. Want one in HADOOP, the other for local/tests/inline/etc.
* use NLineInputFormat so we can control split sizeBryan Newbold2018-04-111-0/+1
|
* Merge branch 'bnewbold-sentry'Bryan Newbold2018-04-101-0/+9
|\
| * prototype sentry integrationBryan Newbold2018-04-101-0/+9
| |
* | don't try to decode GROBID outputBryan Newbold2018-04-111-2/+2
|/
* partially lint extraction_cdx_grobid.pyBryan Newbold2018-04-101-8/+6
|
* yet more test improvementsBryan Newbold2018-04-101-4/+12
|
* cleanup tests; add one for double-processingBryan Newbold2018-04-101-5/+5
|
* wayback 404 testBryan Newbold2018-04-101-1/+2
|
* extraction test fixesBryan Newbold2018-04-101-23/+27
|
* bug fixesBryan Newbold2018-04-061-7/+14
|
* renamed do_teiBryan Newbold2018-04-061-3/+3
|
* temporarily skip pylint on extractionBryan Newbold2018-04-061-0/+3
|
* small grobid2json testBryan Newbold2018-04-061-0/+1
|
* make happybase mock injection slightly less horribleBryan Newbold2018-04-051-16/+12
|
* progress on extractorBryan Newbold2018-04-051-37/+50
|
* improve test coverageBryan Newbold2018-04-051-1/+1
|
* refactor out some common codeBryan Newbold2018-04-041-46/+10
|
* extraction -> mapreduceBryan Newbold2018-04-041-0/+248