| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
As reminded in code review, thanks Martin.
|
| |
|
|
|
|
|
|
| |
This is a large refactor of the ingest script. It adds a number of
filtering options (for all modes), and new modes for free-form queries
or limiting to specific external identifiers.
|
|
|
|
| |
Refactoring to move this filter elsewhere
|
|
|
|
|
|
| |
Good to have exceptions tracked and stored even for commands run from
the command line. But in particular the importer runs as a kafka worker
and should be tracking excpetions.
|
|
|
|
| |
Caught by Martin in review; Thanks!
|
|
|
|
|
|
|
|
|
|
| |
--fatcat-api-url is clearer than --host-url
remove unimplemented --debug (copy/paste from webface argparse)
use formater which will display 'default' parameters with --help
Thanks to Martin for pointing out the later, which i've always wanted!
|
|
|
|
|
|
|
|
|
|
|
| |
This gets rid of some mess error handling code by properly configuring
the elasticsearch client to just not clean up scroll iterators when
accessing the public (prod or qa) search interfaces.
Leaving the scroll state around isn't ideal, so we still delete them if
possible (eg, connecting directly to elasticsearch).
Thanks to Martin for pointing out this solution in review.
|
|
The intent of this tool is to make it easy to enque ingest requests into
kafka, to be processed by a worker pool and eventually end up inserted
into fatcat (for ingest hits that pass various checks).
As a specific example use-case, we have pretty good coverage of eLife (a
prominent OA publisher), but have missed some publications in the past,
and have a large gap for the year 2019:
https://fatcat.wiki/container/en4qj5ijrbf5djxx7p5zzpjyoq/coverage
This tool would make it trivial to enqueue all the missing releases to
be crawled.
Future variants on this tool could query for, eg, long-tail OA works.
|