Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | new/additional GWB CDX filter scripts | Bryan Newbold | 2019-10-17 | 7 | -0/+142 |
| | |||||
* | we do actually want consolidateHeader=2, not 1 | Bryan Newbold | 2019-10-04 | 2 | -4/+4 |
| | |||||
* | remove any trailing newline | Bryan Newbold | 2019-10-04 | 1 | -2/+2 |
| | |||||
* | grobid: consolidateHeaders typo | Bryan Newbold | 2019-10-04 | 1 | -1/+1 |
| | |||||
* | grobid_tool: don't wrap multiprocess if we don't need to | Bryan Newbold | 2019-10-04 | 1 | -2/+4 |
| | |||||
* | disable citation consolidation by default | Bryan Newbold | 2019-10-04 | 1 | -1/+1 |
| | | | | | | | with this consolidation enabled, the glutton_fatcat elasticsearch server was totally pegged over 90% CPU with only 10 PDF worker threads; the glutton load seemed to be the bottleneck even for this low degree of parallelism. Disabled for now, will debug with GROBID/glutton folks. | ||||
* | grobid-output-pg, not grobid-output-json | Bryan Newbold | 2019-10-04 | 1 | -4/+2 |
| | |||||
* | grobid_tool: don't always insert multi wrapper | Bryan Newbold | 2019-10-04 | 1 | -6/+13 |
| | |||||
* | grobid2json: language_code | Bryan Newbold | 2019-10-04 | 2 | -1/+7 |
| | |||||
* | fix GROBID POST flags | Bryan Newbold | 2019-10-04 | 1 | -1/+3 |
| | |||||
* | workers: better generic batch-size arg handling | Bryan Newbold | 2019-10-03 | 1 | -0/+6 |
| | |||||
* | handle GROBID fetch empty blob condition | Bryan Newbold | 2019-10-03 | 1 | -1/+2 |
| | |||||
* | update kafka topic listings | Bryan Newbold | 2019-10-03 | 1 | -21/+35 |
| | |||||
* | grobid_affiliations fix from prod, and usage example | Bryan Newbold | 2019-10-02 | 1 | -0/+5 |
| | |||||
* | deliver_dumpgrobid_to_s3: typo fix from old prod | Bryan Newbold | 2019-10-02 | 1 | -3/+4 |
| | |||||
* | grobid affiliation extractor (script) | Bryan Newbold | 2019-10-02 | 1 | -0/+47 |
| | |||||
* | python tests for pusher classes | Bryan Newbold | 2019-10-02 | 2 | -0/+28 |
| | |||||
* | have grobidworker error status indicate issues instead of bailing | Bryan Newbold | 2019-10-02 | 1 | -4/+13 |
| | |||||
* | grobid_tool.py example usage in docstring | Bryan Newbold | 2019-10-02 | 1 | -0/+6 |
| | |||||
* | add tests for affiliation extraction | Bryan Newbold | 2019-10-02 | 2 | -1/+25 |
| | |||||
* | have grobid2json extract full names and affiliations | Bryan Newbold | 2019-10-02 | 1 | -5/+27 |
| | |||||
* | more counts and bugfixes in grobid_tool | Bryan Newbold | 2019-09-26 | 3 | -5/+7 |
| | |||||
* | off-by-one error in batch sizes | Bryan Newbold | 2019-09-26 | 1 | -1/+1 |
| | |||||
* | small improvements to GROBID tool | Bryan Newbold | 2019-09-26 | 2 | -2/+10 |
| | |||||
* | lots of grobid tool implementation (still WIP) | Bryan Newbold | 2019-09-26 | 8 | -16/+748 |
| | |||||
* | gitlab CI: run both python and python_hadoop tests | Bryan Newbold | 2019-09-25 | 1 | -1/+6 |
| | |||||
* | pylint as part of pytest; update lint config | Bryan Newbold | 2019-09-25 | 2 | -1/+17 |
| | |||||
* | pipfile update | Bryan Newbold | 2019-09-25 | 2 | -241/+70 |
| | | | | | | | - remove hadoop stuff (mrjob, happybase, etc) - add flask - add pytest-pylint plugin - reformat (automatic by newer pipenv) | ||||
* | test of GROBID client | Bryan Newbold | 2019-09-25 | 1 | -0/+53 |
| | |||||
* | move a bunch of random old scripts to subdir | Bryan Newbold | 2019-09-25 | 9 | -0/+0 |
| | |||||
* | get rid of old xml2json | Bryan Newbold | 2019-09-25 | 1 | -7/+0 |
| | |||||
* | update README with new folders | Bryan Newbold | 2019-09-25 | 1 | -4/+10 |
| | |||||
* | point 'please' to python_hadoop | Bryan Newbold | 2019-09-25 | 1 | -4/+4 |
| | |||||
* | refactor old python hadoop code into new directory | Bryan Newbold | 2019-09-25 | 21 | -0/+3523 |
| | |||||
* | re-write parse_cdx_line for sandcrawler lib | Bryan Newbold | 2019-09-25 | 3 | -2/+116 |
| | |||||
* | fix test grobid2json test | Bryan Newbold | 2019-09-25 | 1 | -1/+4 |
| | | | | For new extra fields | ||||
* | commit WIP on file ingest script | Bryan Newbold | 2019-09-23 | 1 | -0/+386 |
| | |||||
* | rename postgrest directory sql | Bryan Newbold | 2019-09-23 | 9 | -0/+0 |
| | |||||
* | start refactoring sandcrawler python common code | Bryan Newbold | 2019-09-23 | 5 | -0/+131 |
| | |||||
* | update Pipfile with additional libraries | Bryan Newbold | 2019-09-23 | 2 | -77/+293 |
| | |||||
* | more matching examples | Bryan Newbold | 2019-09-20 | 1 | -0/+4 |
| | |||||
* | unpaywall blobs.fatcat.wiki backfill notes | Bryan Newbold | 2019-09-20 | 1 | -0/+59 |
| | |||||
* | grobid2json: extract fatcat identifier | Bryan Newbold | 2019-09-20 | 1 | -1/+5 |
| | |||||
* | fatcat-blobs nginx config example | Bryan Newbold | 2019-09-20 | 1 | -0/+51 |
| | |||||
* | old groupworks job log | Bryan Newbold | 2019-09-20 | 1 | -0/+8 |
| | |||||
* | update service docs | Bryan Newbold | 2019-09-20 | 3 | -1/+16 |
| | |||||
* | add filter_groupworks.py | Bryan Newbold | 2019-09-04 | 1 | -0/+144 |
| | | | | For use with new release grouping/matching jobs. | ||||
* | large pipfile update | Bryan Newbold | 2019-09-04 | 1 | -375/+402 |
| | | | | | | | | | Covers some security changes, but might need to revert if this breaks things. Should use version locking in Pipefile better to prevent unintentional large upgrades, especially when we don't have good test coverage in this repo. | ||||
* | GroupFatcatWorksSubsetJob | Bryan Newbold | 2019-08-26 | 3 | -0/+111 |
| | | | | | | | | | | | | This is a hack-y variant of GroupFatcatWorksSubsetJob which allows setting different left and right sides of the join. The initial application is to re-run work merging with only longtail-oa works on the "left", with the goal of hard-merging these releases into existing releases with actual identifiers (instead of just grouping into works). As a refactor, the normal GroupFatcatWorksJob could just be this with the same file passed as both left and right, though that requires twice as much JSON parsing/filtering. | ||||
* | update shadow sandcrawler schema | Bryan Newbold | 2019-08-26 | 1 | -3/+4 |
| |