aboutsummaryrefslogtreecommitdiffstats
path: root/python/fatcat_ingest.py
Commit message (Collapse)AuthorAgeFilesLines
* container_issnl, not issnl, for ES release queryBryan Newbold2019-12-121-1/+1
| | | | Caught by Martin in review; Thanks!
* improve argparse usageBryan Newbold2019-12-111-6/+4
| | | | | | | | | | --fatcat-api-url is clearer than --host-url remove unimplemented --debug (copy/paste from webface argparse) use formater which will display 'default' parameters with --help Thanks to Martin for pointing out the later, which i've always wanted!
* simplify ES scroll deletion using param()Bryan Newbold2019-12-111-29/+29
| | | | | | | | | | | This gets rid of some mess error handling code by properly configuring the elasticsearch client to just not clean up scroll iterators when accessing the public (prod or qa) search interfaces. Leaving the scroll state around isn't ideal, so we still delete them if possible (eg, connecting directly to elasticsearch). Thanks to Martin for pointing out this solution in review.
* add ingest-container command (new CLI tool)Bryan Newbold2019-12-101-0/+136
The intent of this tool is to make it easy to enque ingest requests into kafka, to be processed by a worker pool and eventually end up inserted into fatcat (for ingest hits that pass various checks). As a specific example use-case, we have pretty good coverage of eLife (a prominent OA publisher), but have missed some publications in the past, and have a large gap for the year 2019: https://fatcat.wiki/container/en4qj5ijrbf5djxx7p5zzpjyoq/coverage This tool would make it trivial to enqueue all the missing releases to be crawled. Future variants on this tool could query for, eg, long-tail OA works.