aboutsummaryrefslogtreecommitdiffstats
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
* add basic sandcrawler worker (kafka)Bryan Newbold2019-11-131-0/+74
|
* note that kafka_grobid.py is deprecatedBryan Newbold2019-11-131-0/+3
|
* rename FileIngestWorkerBryan Newbold2019-11-133-10/+16
|
* refactor consume_topic name out of make_kafka_consumer()Bryan Newbold2019-11-131-5/+5
| | | | Best to do this in wrapping code for full flexibility.
* more progress on file ingestBryan Newbold2019-11-134-17/+75
|
* much progress on file ingest pathBryan Newbold2019-10-226-335/+338
|
* remove spurious debug print from grobid2jsonBryan Newbold2019-10-221-1/+1
|
* we do actually want consolidateHeader=2, not 1Bryan Newbold2019-10-042-4/+4
|
* remove any trailing newlineBryan Newbold2019-10-041-2/+2
|
* grobid: consolidateHeaders typoBryan Newbold2019-10-041-1/+1
|
* grobid_tool: don't wrap multiprocess if we don't need toBryan Newbold2019-10-041-2/+4
|
* disable citation consolidation by defaultBryan Newbold2019-10-041-1/+1
| | | | | | | with this consolidation enabled, the glutton_fatcat elasticsearch server was totally pegged over 90% CPU with only 10 PDF worker threads; the glutton load seemed to be the bottleneck even for this low degree of parallelism. Disabled for now, will debug with GROBID/glutton folks.
* grobid-output-pg, not grobid-output-jsonBryan Newbold2019-10-041-4/+2
|
* grobid_tool: don't always insert multi wrapperBryan Newbold2019-10-041-6/+13
|
* grobid2json: language_codeBryan Newbold2019-10-042-1/+7
|
* fix GROBID POST flagsBryan Newbold2019-10-041-1/+3
|
* workers: better generic batch-size arg handlingBryan Newbold2019-10-031-0/+6
|
* handle GROBID fetch empty blob conditionBryan Newbold2019-10-031-1/+2
|
* grobid_affiliations fix from prod, and usage exampleBryan Newbold2019-10-021-0/+5
|
* deliver_dumpgrobid_to_s3: typo fix from old prodBryan Newbold2019-10-021-3/+4
|
* grobid affiliation extractor (script)Bryan Newbold2019-10-021-0/+47
|
* python tests for pusher classesBryan Newbold2019-10-022-0/+28
|
* have grobidworker error status indicate issues instead of bailingBryan Newbold2019-10-021-4/+13
|
* grobid_tool.py example usage in docstringBryan Newbold2019-10-021-0/+6
|
* add tests for affiliation extractionBryan Newbold2019-10-022-1/+25
|
* have grobid2json extract full names and affiliationsBryan Newbold2019-10-021-5/+27
|
* more counts and bugfixes in grobid_toolBryan Newbold2019-09-263-5/+7
|
* off-by-one error in batch sizesBryan Newbold2019-09-261-1/+1
|
* small improvements to GROBID toolBryan Newbold2019-09-262-2/+10
|
* lots of grobid tool implementation (still WIP)Bryan Newbold2019-09-268-16/+748
|
* pylint as part of pytest; update lint configBryan Newbold2019-09-252-1/+17
|
* pipfile updateBryan Newbold2019-09-252-241/+70
| | | | | | | - remove hadoop stuff (mrjob, happybase, etc) - add flask - add pytest-pylint plugin - reformat (automatic by newer pipenv)
* test of GROBID clientBryan Newbold2019-09-251-0/+53
|
* move a bunch of random old scripts to subdirBryan Newbold2019-09-259-0/+0
|
* get rid of old xml2jsonBryan Newbold2019-09-251-7/+0
|
* refactor old python hadoop code into new directoryBryan Newbold2019-09-2510-1590/+0
|
* re-write parse_cdx_line for sandcrawler libBryan Newbold2019-09-253-2/+116
|
* fix test grobid2json testBryan Newbold2019-09-251-1/+4
| | | | For new extra fields
* commit WIP on file ingest scriptBryan Newbold2019-09-231-0/+386
|
* start refactoring sandcrawler python common codeBryan Newbold2019-09-235-0/+131
|
* update Pipfile with additional librariesBryan Newbold2019-09-232-77/+293
|
* grobid2json: extract fatcat identifierBryan Newbold2019-09-201-1/+5
|
* add filter_groupworks.pyBryan Newbold2019-09-041-0/+144
| | | | For use with new release grouping/matching jobs.
* large pipfile updateBryan Newbold2019-09-041-375/+402
| | | | | | | | | Covers some security changes, but might need to revert if this breaks things. Should use version locking in Pipefile better to prevent unintentional large upgrades, especially when we don't have good test coverage in this repo.
* ia_pdf_match.py bugfixBryan Newbold2019-07-071-4/+4
|
* create deliver_gwb_to_disk.pyBryan Newbold2019-07-071-0/+166
|
* petabox journal files ingest updatesBryan Newbold2019-06-201-0/+108
|
* update grobid2json to include given_name/surnameBryan Newbold2019-05-132-6/+10
|
* deliver_dumpgrobid_to_s3: allow heritrix-style SHA-1 formatBryan Newbold2019-05-101-0/+12
|
* deliver_dumpgrobid_to_s3: storage class configBryan Newbold2019-05-091-1/+7
|