Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | off-by-one error in batch sizes | Bryan Newbold | 2019-09-26 | 1 | -1/+1 |
| | |||||
* | small improvements to GROBID tool | Bryan Newbold | 2019-09-26 | 2 | -2/+10 |
| | |||||
* | lots of grobid tool implementation (still WIP) | Bryan Newbold | 2019-09-26 | 8 | -16/+748 |
| | |||||
* | gitlab CI: run both python and python_hadoop tests | Bryan Newbold | 2019-09-25 | 1 | -1/+6 |
| | |||||
* | pylint as part of pytest; update lint config | Bryan Newbold | 2019-09-25 | 2 | -1/+17 |
| | |||||
* | pipfile update | Bryan Newbold | 2019-09-25 | 2 | -241/+70 |
| | | | | | | | - remove hadoop stuff (mrjob, happybase, etc) - add flask - add pytest-pylint plugin - reformat (automatic by newer pipenv) | ||||
* | test of GROBID client | Bryan Newbold | 2019-09-25 | 1 | -0/+53 |
| | |||||
* | move a bunch of random old scripts to subdir | Bryan Newbold | 2019-09-25 | 9 | -0/+0 |
| | |||||
* | get rid of old xml2json | Bryan Newbold | 2019-09-25 | 1 | -7/+0 |
| | |||||
* | update README with new folders | Bryan Newbold | 2019-09-25 | 1 | -4/+10 |
| | |||||
* | point 'please' to python_hadoop | Bryan Newbold | 2019-09-25 | 1 | -4/+4 |
| | |||||
* | refactor old python hadoop code into new directory | Bryan Newbold | 2019-09-25 | 21 | -0/+3523 |
| | |||||
* | re-write parse_cdx_line for sandcrawler lib | Bryan Newbold | 2019-09-25 | 3 | -2/+116 |
| | |||||
* | fix test grobid2json test | Bryan Newbold | 2019-09-25 | 1 | -1/+4 |
| | | | | For new extra fields | ||||
* | commit WIP on file ingest script | Bryan Newbold | 2019-09-23 | 1 | -0/+386 |
| | |||||
* | rename postgrest directory sql | Bryan Newbold | 2019-09-23 | 9 | -0/+0 |
| | |||||
* | start refactoring sandcrawler python common code | Bryan Newbold | 2019-09-23 | 5 | -0/+131 |
| | |||||
* | update Pipfile with additional libraries | Bryan Newbold | 2019-09-23 | 2 | -77/+293 |
| | |||||
* | more matching examples | Bryan Newbold | 2019-09-20 | 1 | -0/+4 |
| | |||||
* | unpaywall blobs.fatcat.wiki backfill notes | Bryan Newbold | 2019-09-20 | 1 | -0/+59 |
| | |||||
* | grobid2json: extract fatcat identifier | Bryan Newbold | 2019-09-20 | 1 | -1/+5 |
| | |||||
* | fatcat-blobs nginx config example | Bryan Newbold | 2019-09-20 | 1 | -0/+51 |
| | |||||
* | old groupworks job log | Bryan Newbold | 2019-09-20 | 1 | -0/+8 |
| | |||||
* | update service docs | Bryan Newbold | 2019-09-20 | 3 | -1/+16 |
| | |||||
* | add filter_groupworks.py | Bryan Newbold | 2019-09-04 | 1 | -0/+144 |
| | | | | For use with new release grouping/matching jobs. | ||||
* | large pipfile update | Bryan Newbold | 2019-09-04 | 1 | -375/+402 |
| | | | | | | | | | Covers some security changes, but might need to revert if this breaks things. Should use version locking in Pipefile better to prevent unintentional large upgrades, especially when we don't have good test coverage in this repo. | ||||
* | GroupFatcatWorksSubsetJob | Bryan Newbold | 2019-08-26 | 3 | -0/+111 |
| | | | | | | | | | | | | This is a hack-y variant of GroupFatcatWorksSubsetJob which allows setting different left and right sides of the join. The initial application is to re-run work merging with only longtail-oa works on the "left", with the goal of hard-merging these releases into existing releases with actual identifiers (instead of just grouping into works). As a refactor, the normal GroupFatcatWorksJob could just be this with the same file passed as both left and right, though that requires twice as much JSON parsing/filtering. | ||||
* | update shadow sandcrawler schema | Bryan Newbold | 2019-08-26 | 1 | -3/+4 |
| | |||||
* | please command for groupworksfatcat | Bryan Newbold | 2019-08-10 | 2 | -1/+64 |
| | |||||
* | FatcatScorable and ScoreSelfFatcat job | Bryan Newbold | 2019-08-10 | 3 | -0/+334 |
| | |||||
* | add fatcat ident fields in prep for self-scoring job | Bryan Newbold | 2019-08-10 | 2 | -3/+24 |
| | |||||
* | postgrest backfill updates | Bryan Newbold | 2019-08-10 | 1 | -1/+19 |
| | |||||
* | sandcrawler HTTP nginx configs | Bryan Newbold | 2019-08-09 | 3 | -0/+153 |
| | |||||
* | move postgres/rest directory | Bryan Newbold | 2019-08-09 | 8 | -0/+0 |
| | |||||
* | SQL backfill notes and python scripts | Bryan Newbold | 2019-08-09 | 6 | -0/+506 |
| | |||||
* | more tweaks to sql schema | Bryan Newbold | 2019-08-09 | 1 | -1/+2 |
| | |||||
* | sandcrawler SQL schema more idempotent-ish | Bryan Newbold | 2019-08-08 | 1 | -8/+8 |
| | |||||
* | minio README | Bryan Newbold | 2019-08-08 | 1 | -0/+24 |
| | |||||
* | update sandcrawler_schema.sql | Bryan Newbold | 2019-08-08 | 1 | -7/+7 |
| | |||||
* | start of postgres/postgrest notes and schema | Bryan Newbold | 2019-08-01 | 2 | -0/+177 |
| | |||||
* | more kafka topics | Bryan Newbold | 2019-07-07 | 1 | -0/+9 |
| | |||||
* | ia_pdf_match.py bugfix | Bryan Newbold | 2019-07-07 | 1 | -4/+4 |
| | |||||
* | please: add staging config (commented out) | Bryan Newbold | 2019-07-07 | 1 | -0/+4 |
| | |||||
* | create deliver_gwb_to_disk.py | Bryan Newbold | 2019-07-07 | 1 | -0/+166 |
| | |||||
* | petabox journal files ingest updates | Bryan Newbold | 2019-06-20 | 2 | -0/+133 |
| | |||||
* | new release schema kafka topic | Bryan Newbold | 2019-05-24 | 1 | -2/+3 |
| | |||||
* | Merge remote-tracking branch 'github/master' | Bryan Newbold | 2019-05-13 | 0 | -0/+0 |
|\ | |||||
| * | more fatcat update topics | Bryan Newbold | 2019-03-04 | 1 | -0/+3 |
| | | |||||
* | | update grobid2json to include given_name/surname | Bryan Newbold | 2019-05-13 | 2 | -6/+10 |
| | | |||||
* | | deliver_dumpgrobid_to_s3: allow heritrix-style SHA-1 format | Bryan Newbold | 2019-05-10 | 1 | -0/+12 |
| | |