| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
This is mostly changing ingest_type from 'file' to 'pdf', and adding
'link_source'/'link_source_id', plus some small cleanups.
|
|
|
|
| |
We really should just use file_meta result or nothing.
|
|
|
|
| |
Also fix a spurious typo.
|
| |
|
| |
|
| |
|
|
|
|
| |
As a form of documentation
|
|
|
|
| |
Based on ingest-file-results importer
|
| |
|
|
|
|
|
| |
For use with bots that don't have admin privileges, or where human
follow-up review is desired.
|
|\
| |
| |
| |
| | |
container-ingest tool
See merge request webgroup/fatcat!8
|
| |
| |
| |
| | |
Caught by Martin in review; Thanks!
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
--fatcat-api-url is clearer than --host-url
remove unimplemented --debug (copy/paste from webface argparse)
use formater which will display 'default' parameters with --help
Thanks to Martin for pointing out the later, which i've always wanted!
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This gets rid of some mess error handling code by properly configuring
the elasticsearch client to just not clean up scroll iterators when
accessing the public (prod or qa) search interfaces.
Leaving the scroll state around isn't ideal, so we still delete them if
possible (eg, connecting directly to elasticsearch).
Thanks to Martin for pointing out this solution in review.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The intent of this tool is to make it easy to enque ingest requests into
kafka, to be processed by a worker pool and eventually end up inserted
into fatcat (for ingest hits that pass various checks).
As a specific example use-case, we have pretty good coverage of eLife (a
prominent OA publisher), but have missed some publications in the past,
and have a large gap for the year 2019:
https://fatcat.wiki/container/en4qj5ijrbf5djxx7p5zzpjyoq/coverage
This tool would make it trivial to enqueue all the missing releases to
be crawled.
Future variants on this tool could query for, eg, long-tail OA works.
|
| | |
|
| | |
|
| |
| |
| |
| |
| | |
These are low-level and high-level (respectively)
client wrappers for elasticsearch
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Use --fatcat-api-url instead of (ambiguous) --host-url for commands that
aren't deployed/running via systemd.
TODO: update the other --host-url usage, and either roll-out change
consistently or support the old arg as an alias during cut-over
Use argparse.ArgumentDefaultsHelpFormatter (thanks Martin!)
Add help messages for all sub-commands, both as documentation and as a
way to get argparse to print available commands in a more readable
format.
|
| | |
|
|/
|
|
|
| |
- don't start kafka image until zookeeper is running
- set very liberal "watermarks" for elasticsearch disk monitoring
|
| |
|
|
|
|
|
|
| |
This was causing 5xx errors in production and qa. Eg, at:
https://qa.fatcat.wiki/release/aaaaaaaaaaaaarceaaaaaaaaai/history
|
| |
|
| |
|
|\
| |
| |
| |
| | |
Basic mocked test for crossref harvester
See merge request webgroup/fatcat!7
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| | |
producer creation/configuration should be happening in __init__() time,
not 'daily' call.
This specific refactor motivated by mocking out the producer in unit
tests.
|
| | |
|
|\ \
| |/
|/|
| |
| | |
increase max.message.bytes in container
See merge request webgroup/fatcat!5
|
|/
|
|
|
| |
While working on datacite, some message were larger than the default of
1000012 bytes.
|
| |
|
|
|
|
|
|
| |
The more complete fix is to actually render the JATS to HTML and display
that. This is just to fix a nit with the most common case of XML tags in
abstracts.
|
|
|
|
|
| |
- allow overriding source filter whitelist (common case for CLI use)
- fix editgroup description env variable pass-through
|
|
|
|
|
|
|
|
| |
I thought this would filter for metadata updates to an existing DOI, but
actually "updates" are a type of DOI (eg, a retraction).
TODO: handle 'updates' field. Should both do a lookup and set work_ident
appropriately, and store in crossref-specific metadata.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This isn't a fatcat rust requirement, but instead a diesel requirement,
via rust-smallvec, which in v1.0 uses the alloc crate:
https://github.com/servo/rust-smallvec/issues/73
I think the reason this came up now is that diesel-cli is an
application and doesn't have a Cargo.lock file, and the build was
updated. Using some binary mechanism to install these dependencies would
be more robust, but feels like a yak shave right now.
|
|
|
|
|
| |
Apparently the rust:1.34-stretch image is gone from docker hub, and this
was causing CI errors.
|
|
|
|
|
|
|
| |
Test in previous commit.
This fixes a user-reported 500 error when creating a file with
SHA1/SHA256/MD5 hashes in upper-case.
|
| |
|
| |
|
| |
|
|
|
|
| |
As opposed to sandcrawler-bot
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Initially was going to create a new worker to consume from the release
update channel, but couldn't get the edit context ("is this a new
release, or update to an existing") from that context.
Currently there is a flag in source code to control whether we only do
OA releases or all releases. Starting with OA only to start slow, but
should probably default to all, and make this a config flag. Should
probably also have a config flag to control this entire feature.
Tested locally in dev.
|