aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* add filter_groupworks.pyBryan Newbold2019-09-041-0/+144
| | | | For use with new release grouping/matching jobs.
* large pipfile updateBryan Newbold2019-09-041-375/+402
| | | | | | | | | Covers some security changes, but might need to revert if this breaks things. Should use version locking in Pipefile better to prevent unintentional large upgrades, especially when we don't have good test coverage in this repo.
* GroupFatcatWorksSubsetJobBryan Newbold2019-08-263-0/+111
| | | | | | | | | | | | This is a hack-y variant of GroupFatcatWorksSubsetJob which allows setting different left and right sides of the join. The initial application is to re-run work merging with only longtail-oa works on the "left", with the goal of hard-merging these releases into existing releases with actual identifiers (instead of just grouping into works). As a refactor, the normal GroupFatcatWorksJob could just be this with the same file passed as both left and right, though that requires twice as much JSON parsing/filtering.
* update shadow sandcrawler schemaBryan Newbold2019-08-261-3/+4
|
* please command for groupworksfatcatBryan Newbold2019-08-102-1/+64
|
* FatcatScorable and ScoreSelfFatcat jobBryan Newbold2019-08-103-0/+334
|
* add fatcat ident fields in prep for self-scoring jobBryan Newbold2019-08-102-3/+24
|
* postgrest backfill updatesBryan Newbold2019-08-101-1/+19
|
* sandcrawler HTTP nginx configsBryan Newbold2019-08-093-0/+153
|
* move postgres/rest directoryBryan Newbold2019-08-098-0/+0
|
* SQL backfill notes and python scriptsBryan Newbold2019-08-096-0/+506
|
* more tweaks to sql schemaBryan Newbold2019-08-091-1/+2
|
* sandcrawler SQL schema more idempotent-ishBryan Newbold2019-08-081-8/+8
|
* minio READMEBryan Newbold2019-08-081-0/+24
|
* update sandcrawler_schema.sqlBryan Newbold2019-08-081-7/+7
|
* start of postgres/postgrest notes and schemaBryan Newbold2019-08-012-0/+177
|
* more kafka topicsBryan Newbold2019-07-071-0/+9
|
* ia_pdf_match.py bugfixBryan Newbold2019-07-071-4/+4
|
* please: add staging config (commented out)Bryan Newbold2019-07-071-0/+4
|
* create deliver_gwb_to_disk.pyBryan Newbold2019-07-071-0/+166
|
* petabox journal files ingest updatesBryan Newbold2019-06-202-0/+133
|
* new release schema kafka topicBryan Newbold2019-05-241-2/+3
|
* Merge remote-tracking branch 'github/master'Bryan Newbold2019-05-130-0/+0
|\
| * more fatcat update topicsBryan Newbold2019-03-041-0/+3
| |
* | update grobid2json to include given_name/surnameBryan Newbold2019-05-132-6/+10
| |
* | deliver_dumpgrobid_to_s3: allow heritrix-style SHA-1 formatBryan Newbold2019-05-101-0/+12
| |
* | clearer CDX munge notesBryan Newbold2019-05-091-1/+1
| |
* | deliver_dumpgrobid_to_s3: storage class configBryan Newbold2019-05-091-1/+7
| |
* | deliver_dumpgrobid_to_s3.pyBryan Newbold2019-04-151-0/+106
| |
* | schema notes on deeper file metadataBryan Newbold2019-04-121-0/+8
| |
* | update TODOBryan Newbold2019-04-121-1/+22
| |
* | scalding dump-grobid-status-code jobBryan Newbold2019-04-122-0/+58
| |
* | add ojs and dspace as in-domain patterns to look for in heuristic CDX PDF filterBryan Newbold2019-04-121-1/+1
| |
* | more fatcat update topicsBryan Newbold2019-04-121-0/+3
| |
* | set long timeout on HBaseStatusCountJobBryan Newbold2019-02-261-1/+3
|/
* python test fixesBryan Newbold2019-02-214-5/+8
|
* backport GWB fetch improvements to extraction/kafka workersBryan Newbold2019-02-213-18/+50
| | | | *Really* need to refactor out these common methods into a base class.
* don't print secret, and MRO pylint skipBryan Newbold2019-02-211-4/+6
|
* update PipefileBryan Newbold2019-02-212-266/+220
|
* include file size in S3 uploadsBryan Newbold2019-02-201-3/+3
|
* delivery gwb counter tweaksBryan Newbold2019-02-201-2/+8
|
* silly typoBryan Newbold2019-02-191-1/+1
|
* fix empty blob errorsBryan Newbold2019-02-191-1/+5
|
* make PETABOX_WEBDATA_SECRET explicitBryan Newbold2019-02-191-1/+9
| | | | | TODO: port this change to other workers; or better yet make GWB access a mixin or something
* deliver python tweaksBryan Newbold2019-02-191-5/+8
|
* add GWB-to-S3 delivery pipeline scriptBryan Newbold2019-02-192-0/+162
|
* give sort way more RAM by defaultBryan Newbold2019-02-013-6/+6
|
* update (internal) journal-infra linkBryan Newbold2019-01-031-1/+1
|
* match_filter_enrich notesBryan Newbold2019-01-031-0/+12
|
* remove old/redundant python CDX directoryBryan Newbold2019-01-033-103/+0
| | | | This was code from Vinay; it lives on in git history.