Commit message (Collapse) | Author | Age | Files | Lines | ||
---|---|---|---|---|---|---|
... | ||||||
* | add filter_groupworks.py | Bryan Newbold | 2019-09-04 | 1 | -0/+144 | |
| | | | | For use with new release grouping/matching jobs. | |||||
* | large pipfile update | Bryan Newbold | 2019-09-04 | 1 | -375/+402 | |
| | | | | | | | | | Covers some security changes, but might need to revert if this breaks things. Should use version locking in Pipefile better to prevent unintentional large upgrades, especially when we don't have good test coverage in this repo. | |||||
* | GroupFatcatWorksSubsetJob | Bryan Newbold | 2019-08-26 | 3 | -0/+111 | |
| | | | | | | | | | | | | This is a hack-y variant of GroupFatcatWorksSubsetJob which allows setting different left and right sides of the join. The initial application is to re-run work merging with only longtail-oa works on the "left", with the goal of hard-merging these releases into existing releases with actual identifiers (instead of just grouping into works). As a refactor, the normal GroupFatcatWorksJob could just be this with the same file passed as both left and right, though that requires twice as much JSON parsing/filtering. | |||||
* | update shadow sandcrawler schema | Bryan Newbold | 2019-08-26 | 1 | -3/+4 | |
| | ||||||
* | please command for groupworksfatcat | Bryan Newbold | 2019-08-10 | 2 | -1/+64 | |
| | ||||||
* | FatcatScorable and ScoreSelfFatcat job | Bryan Newbold | 2019-08-10 | 3 | -0/+334 | |
| | ||||||
* | add fatcat ident fields in prep for self-scoring job | Bryan Newbold | 2019-08-10 | 2 | -3/+24 | |
| | ||||||
* | postgrest backfill updates | Bryan Newbold | 2019-08-10 | 1 | -1/+19 | |
| | ||||||
* | sandcrawler HTTP nginx configs | Bryan Newbold | 2019-08-09 | 3 | -0/+153 | |
| | ||||||
* | move postgres/rest directory | Bryan Newbold | 2019-08-09 | 8 | -0/+0 | |
| | ||||||
* | SQL backfill notes and python scripts | Bryan Newbold | 2019-08-09 | 6 | -0/+506 | |
| | ||||||
* | more tweaks to sql schema | Bryan Newbold | 2019-08-09 | 1 | -1/+2 | |
| | ||||||
* | sandcrawler SQL schema more idempotent-ish | Bryan Newbold | 2019-08-08 | 1 | -8/+8 | |
| | ||||||
* | minio README | Bryan Newbold | 2019-08-08 | 1 | -0/+24 | |
| | ||||||
* | update sandcrawler_schema.sql | Bryan Newbold | 2019-08-08 | 1 | -7/+7 | |
| | ||||||
* | start of postgres/postgrest notes and schema | Bryan Newbold | 2019-08-01 | 2 | -0/+177 | |
| | ||||||
* | more kafka topics | Bryan Newbold | 2019-07-07 | 1 | -0/+9 | |
| | ||||||
* | ia_pdf_match.py bugfix | Bryan Newbold | 2019-07-07 | 1 | -4/+4 | |
| | ||||||
* | please: add staging config (commented out) | Bryan Newbold | 2019-07-07 | 1 | -0/+4 | |
| | ||||||
* | create deliver_gwb_to_disk.py | Bryan Newbold | 2019-07-07 | 1 | -0/+166 | |
| | ||||||
* | petabox journal files ingest updates | Bryan Newbold | 2019-06-20 | 2 | -0/+133 | |
| | ||||||
* | new release schema kafka topic | Bryan Newbold | 2019-05-24 | 1 | -2/+3 | |
| | ||||||
* | Merge remote-tracking branch 'github/master' | Bryan Newbold | 2019-05-13 | 0 | -0/+0 | |
|\ | ||||||
| * | more fatcat update topics | Bryan Newbold | 2019-03-04 | 1 | -0/+3 | |
| | | ||||||
* | | update grobid2json to include given_name/surname | Bryan Newbold | 2019-05-13 | 2 | -6/+10 | |
| | | ||||||
* | | deliver_dumpgrobid_to_s3: allow heritrix-style SHA-1 format | Bryan Newbold | 2019-05-10 | 1 | -0/+12 | |
| | | ||||||
* | | clearer CDX munge notes | Bryan Newbold | 2019-05-09 | 1 | -1/+1 | |
| | | ||||||
* | | deliver_dumpgrobid_to_s3: storage class config | Bryan Newbold | 2019-05-09 | 1 | -1/+7 | |
| | | ||||||
* | | deliver_dumpgrobid_to_s3.py | Bryan Newbold | 2019-04-15 | 1 | -0/+106 | |
| | | ||||||
* | | schema notes on deeper file metadata | Bryan Newbold | 2019-04-12 | 1 | -0/+8 | |
| | | ||||||
* | | update TODO | Bryan Newbold | 2019-04-12 | 1 | -1/+22 | |
| | | ||||||
* | | scalding dump-grobid-status-code job | Bryan Newbold | 2019-04-12 | 2 | -0/+58 | |
| | | ||||||
* | | add ojs and dspace as in-domain patterns to look for in heuristic CDX PDF filter | Bryan Newbold | 2019-04-12 | 1 | -1/+1 | |
| | | ||||||
* | | more fatcat update topics | Bryan Newbold | 2019-04-12 | 1 | -0/+3 | |
| | | ||||||
* | | set long timeout on HBaseStatusCountJob | Bryan Newbold | 2019-02-26 | 1 | -1/+3 | |
|/ | ||||||
* | python test fixes | Bryan Newbold | 2019-02-21 | 4 | -5/+8 | |
| | ||||||
* | backport GWB fetch improvements to extraction/kafka workers | Bryan Newbold | 2019-02-21 | 3 | -18/+50 | |
| | | | | *Really* need to refactor out these common methods into a base class. | |||||
* | don't print secret, and MRO pylint skip | Bryan Newbold | 2019-02-21 | 1 | -4/+6 | |
| | ||||||
* | update Pipefile | Bryan Newbold | 2019-02-21 | 2 | -266/+220 | |
| | ||||||
* | include file size in S3 uploads | Bryan Newbold | 2019-02-20 | 1 | -3/+3 | |
| | ||||||
* | delivery gwb counter tweaks | Bryan Newbold | 2019-02-20 | 1 | -2/+8 | |
| | ||||||
* | silly typo | Bryan Newbold | 2019-02-19 | 1 | -1/+1 | |
| | ||||||
* | fix empty blob errors | Bryan Newbold | 2019-02-19 | 1 | -1/+5 | |
| | ||||||
* | make PETABOX_WEBDATA_SECRET explicit | Bryan Newbold | 2019-02-19 | 1 | -1/+9 | |
| | | | | | TODO: port this change to other workers; or better yet make GWB access a mixin or something | |||||
* | deliver python tweaks | Bryan Newbold | 2019-02-19 | 1 | -5/+8 | |
| | ||||||
* | add GWB-to-S3 delivery pipeline script | Bryan Newbold | 2019-02-19 | 2 | -0/+162 | |
| | ||||||
* | give sort way more RAM by default | Bryan Newbold | 2019-02-01 | 3 | -6/+6 | |
| | ||||||
* | update (internal) journal-infra link | Bryan Newbold | 2019-01-03 | 1 | -1/+1 | |
| | ||||||
* | match_filter_enrich notes | Bryan Newbold | 2019-01-03 | 1 | -0/+12 | |
| | ||||||
* | remove old/redundant python CDX directory | Bryan Newbold | 2019-01-03 | 3 | -103/+0 | |
| | | | | This was code from Vinay; it lives on in git history. |