Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | re-write parse_cdx_line for sandcrawler lib | Bryan Newbold | 2019-09-25 | 3 | -2/+116 |
| | |||||
* | fix test grobid2json test | Bryan Newbold | 2019-09-25 | 1 | -1/+4 |
| | | | | For new extra fields | ||||
* | commit WIP on file ingest script | Bryan Newbold | 2019-09-23 | 1 | -0/+386 |
| | |||||
* | rename postgrest directory sql | Bryan Newbold | 2019-09-23 | 9 | -0/+0 |
| | |||||
* | start refactoring sandcrawler python common code | Bryan Newbold | 2019-09-23 | 5 | -0/+131 |
| | |||||
* | update Pipfile with additional libraries | Bryan Newbold | 2019-09-23 | 2 | -77/+293 |
| | |||||
* | more matching examples | Bryan Newbold | 2019-09-20 | 1 | -0/+4 |
| | |||||
* | unpaywall blobs.fatcat.wiki backfill notes | Bryan Newbold | 2019-09-20 | 1 | -0/+59 |
| | |||||
* | grobid2json: extract fatcat identifier | Bryan Newbold | 2019-09-20 | 1 | -1/+5 |
| | |||||
* | fatcat-blobs nginx config example | Bryan Newbold | 2019-09-20 | 1 | -0/+51 |
| | |||||
* | old groupworks job log | Bryan Newbold | 2019-09-20 | 1 | -0/+8 |
| | |||||
* | update service docs | Bryan Newbold | 2019-09-20 | 3 | -1/+16 |
| | |||||
* | add filter_groupworks.py | Bryan Newbold | 2019-09-04 | 1 | -0/+144 |
| | | | | For use with new release grouping/matching jobs. | ||||
* | large pipfile update | Bryan Newbold | 2019-09-04 | 1 | -375/+402 |
| | | | | | | | | | Covers some security changes, but might need to revert if this breaks things. Should use version locking in Pipefile better to prevent unintentional large upgrades, especially when we don't have good test coverage in this repo. | ||||
* | GroupFatcatWorksSubsetJob | Bryan Newbold | 2019-08-26 | 3 | -0/+111 |
| | | | | | | | | | | | | This is a hack-y variant of GroupFatcatWorksSubsetJob which allows setting different left and right sides of the join. The initial application is to re-run work merging with only longtail-oa works on the "left", with the goal of hard-merging these releases into existing releases with actual identifiers (instead of just grouping into works). As a refactor, the normal GroupFatcatWorksJob could just be this with the same file passed as both left and right, though that requires twice as much JSON parsing/filtering. | ||||
* | update shadow sandcrawler schema | Bryan Newbold | 2019-08-26 | 1 | -3/+4 |
| | |||||
* | please command for groupworksfatcat | Bryan Newbold | 2019-08-10 | 2 | -1/+64 |
| | |||||
* | FatcatScorable and ScoreSelfFatcat job | Bryan Newbold | 2019-08-10 | 3 | -0/+334 |
| | |||||
* | add fatcat ident fields in prep for self-scoring job | Bryan Newbold | 2019-08-10 | 2 | -3/+24 |
| | |||||
* | postgrest backfill updates | Bryan Newbold | 2019-08-10 | 1 | -1/+19 |
| | |||||
* | sandcrawler HTTP nginx configs | Bryan Newbold | 2019-08-09 | 3 | -0/+153 |
| | |||||
* | move postgres/rest directory | Bryan Newbold | 2019-08-09 | 8 | -0/+0 |
| | |||||
* | SQL backfill notes and python scripts | Bryan Newbold | 2019-08-09 | 6 | -0/+506 |
| | |||||
* | more tweaks to sql schema | Bryan Newbold | 2019-08-09 | 1 | -1/+2 |
| | |||||
* | sandcrawler SQL schema more idempotent-ish | Bryan Newbold | 2019-08-08 | 1 | -8/+8 |
| | |||||
* | minio README | Bryan Newbold | 2019-08-08 | 1 | -0/+24 |
| | |||||
* | update sandcrawler_schema.sql | Bryan Newbold | 2019-08-08 | 1 | -7/+7 |
| | |||||
* | start of postgres/postgrest notes and schema | Bryan Newbold | 2019-08-01 | 2 | -0/+177 |
| | |||||
* | more kafka topics | Bryan Newbold | 2019-07-07 | 1 | -0/+9 |
| | |||||
* | ia_pdf_match.py bugfix | Bryan Newbold | 2019-07-07 | 1 | -4/+4 |
| | |||||
* | please: add staging config (commented out) | Bryan Newbold | 2019-07-07 | 1 | -0/+4 |
| | |||||
* | create deliver_gwb_to_disk.py | Bryan Newbold | 2019-07-07 | 1 | -0/+166 |
| | |||||
* | petabox journal files ingest updates | Bryan Newbold | 2019-06-20 | 2 | -0/+133 |
| | |||||
* | new release schema kafka topic | Bryan Newbold | 2019-05-24 | 1 | -2/+3 |
| | |||||
* | Merge remote-tracking branch 'github/master' | Bryan Newbold | 2019-05-13 | 0 | -0/+0 |
|\ | |||||
| * | more fatcat update topics | Bryan Newbold | 2019-03-04 | 1 | -0/+3 |
| | | |||||
* | | update grobid2json to include given_name/surname | Bryan Newbold | 2019-05-13 | 2 | -6/+10 |
| | | |||||
* | | deliver_dumpgrobid_to_s3: allow heritrix-style SHA-1 format | Bryan Newbold | 2019-05-10 | 1 | -0/+12 |
| | | |||||
* | | clearer CDX munge notes | Bryan Newbold | 2019-05-09 | 1 | -1/+1 |
| | | |||||
* | | deliver_dumpgrobid_to_s3: storage class config | Bryan Newbold | 2019-05-09 | 1 | -1/+7 |
| | | |||||
* | | deliver_dumpgrobid_to_s3.py | Bryan Newbold | 2019-04-15 | 1 | -0/+106 |
| | | |||||
* | | schema notes on deeper file metadata | Bryan Newbold | 2019-04-12 | 1 | -0/+8 |
| | | |||||
* | | update TODO | Bryan Newbold | 2019-04-12 | 1 | -1/+22 |
| | | |||||
* | | scalding dump-grobid-status-code job | Bryan Newbold | 2019-04-12 | 2 | -0/+58 |
| | | |||||
* | | add ojs and dspace as in-domain patterns to look for in heuristic CDX PDF filter | Bryan Newbold | 2019-04-12 | 1 | -1/+1 |
| | | |||||
* | | more fatcat update topics | Bryan Newbold | 2019-04-12 | 1 | -0/+3 |
| | | |||||
* | | set long timeout on HBaseStatusCountJob | Bryan Newbold | 2019-02-26 | 1 | -1/+3 |
|/ | |||||
* | python test fixes | Bryan Newbold | 2019-02-21 | 4 | -5/+8 |
| | |||||
* | backport GWB fetch improvements to extraction/kafka workers | Bryan Newbold | 2019-02-21 | 3 | -18/+50 |
| | | | | *Really* need to refactor out these common methods into a base class. | ||||
* | don't print secret, and MRO pylint skip | Bryan Newbold | 2019-02-21 | 1 | -4/+6 |
| |