aboutsummaryrefslogtreecommitdiffstats
path: root/python/fatcat_ingest.py
Commit message (Collapse)AuthorAgeFilesLines
* switch '!= None' to 'is not None'Bryan Newbold2020-02-041-3/+3
| | | | As reminded in code review, thanks Martin.
* allow-non-oa is a top-level flag, not sub-commandBryan Newbold2020-02-041-3/+0
|
* ingest: add 'extid' and 'query' modes; filters; refactorBryan Newbold2020-02-041-38/+147
| | | | | | This is a large refactor of the ingest script. It adds a number of filtering options (for all modes), and new modes for free-form queries or limiting to specific external identifiers.
* remove 'oa_only' feature from ingest transformBryan Newbold2020-01-281-1/+0
| | | | Refactoring to move this filter elsewhere
* add missing sentry/raven tagsBryan Newbold2020-01-101-2/+7
| | | | | | Good to have exceptions tracked and stored even for commands run from the command line. But in particular the importer runs as a kafka worker and should be tracking excpetions.
* container_issnl, not issnl, for ES release queryBryan Newbold2019-12-121-1/+1
| | | | Caught by Martin in review; Thanks!
* improve argparse usageBryan Newbold2019-12-111-6/+4
| | | | | | | | | | --fatcat-api-url is clearer than --host-url remove unimplemented --debug (copy/paste from webface argparse) use formater which will display 'default' parameters with --help Thanks to Martin for pointing out the later, which i've always wanted!
* simplify ES scroll deletion using param()Bryan Newbold2019-12-111-29/+29
| | | | | | | | | | | This gets rid of some mess error handling code by properly configuring the elasticsearch client to just not clean up scroll iterators when accessing the public (prod or qa) search interfaces. Leaving the scroll state around isn't ideal, so we still delete them if possible (eg, connecting directly to elasticsearch). Thanks to Martin for pointing out this solution in review.
* add ingest-container command (new CLI tool)Bryan Newbold2019-12-101-0/+136
The intent of this tool is to make it easy to enque ingest requests into kafka, to be processed by a worker pool and eventually end up inserted into fatcat (for ingest hits that pass various checks). As a specific example use-case, we have pretty good coverage of eLife (a prominent OA publisher), but have missed some publications in the past, and have a large gap for the year 2019: https://fatcat.wiki/container/en4qj5ijrbf5djxx7p5zzpjyoq/coverage This tool would make it trivial to enqueue all the missing releases to be crawled. Future variants on this tool could query for, eg, long-tail OA works.