aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
...
| * guide fix: code and db uses release_stageMartin Czygan2019-12-171-2/+2
|/
* Merge branch 'martin-importers-common-print-stderr' into 'master'bnewbold2019-12-161-2/+2
|\ | | | | | | | | write diagnostic messages to stderr See merge request webgroup/fatcat!10
| * write diagnostic messages to stderrMartin Czygan2019-12-161-2/+2
|/ | | | | During debugging, it can be helpful to keep stdout (e.g. processing results) and dignostic messages separate.
* Merge branch 'martin-importers-common-doc-fix' into 'master'Martin Czygan2019-12-141-13/+10
|\ | | | | | | | | Update EntityImporter docstring. See merge request webgroup/fatcat!9
| * complete parse_record docstringMartin Czygan2019-12-141-0/+6
| |
| * Update EntityImporter docstring.Martin Czygan2019-12-131-13/+4
| | | | | | | | I believe the required method is `parse_record`, not `parse`.
* | add ingest import file collision protectionBryan Newbold2019-12-131-0/+6
| | | | | | | | | | | | | | | | The common case is the same URL being submitted repeatedly during testing. This is only within-editgroup, and per importer (eg, won't work across spn importer "submitted" editgroups), but is better than nothing.
* | fix spn kafka topic env varBryan Newbold2019-12-131-1/+1
| |
* | update ingest request schemaBryan Newbold2019-12-135-16/+44
| | | | | | | | | | This is mostly changing ingest_type from 'file' to 'pdf', and adding 'link_source'/'link_source_id', plus some small cleanups.
* | remove default mimetype from ingest-file importerBryan Newbold2019-12-131-2/+1
| | | | | | | | We really should just use file_meta result or nothing.
* | revert accidentally commited test timingBryan Newbold2019-12-131-2/+2
| | | | | | | | Also fix a spurious typo.
* | ensure importer description arg isn't clobberedBryan Newbold2019-12-123-5/+5
| |
* | tweaks to ingest-file transformBryan Newbold2019-12-121-13/+7
| |
* | initial 'Save Paper Now' web formBryan Newbold2019-12-127-2/+228
| |
* | more auth token vars in example.envBryan Newbold2019-12-121-0/+6
| | | | | | | | As a form of documentation
* | savepapernow result importerBryan Newbold2019-12-123-4/+89
| | | | | | | | Based on ingest-file-results importer
* | flush importer editgroups every few minutesBryan Newbold2019-12-121-5/+20
| |
* | EntityImporter: submit (not accept) modeBryan Newbold2019-12-121-2/+14
|/ | | | | For use with bots that don't have admin privileges, or where human follow-up review is desired.
* Merge branch 'bnewbold-ingest-oa-container' into 'master'bnewbold2019-12-126-3/+181
|\ | | | | | | | | container-ingest tool See merge request webgroup/fatcat!8
| * container_issnl, not issnl, for ES release queryBryan Newbold2019-12-121-1/+1
| | | | | | | | Caught by Martin in review; Thanks!
| * improve argparse usageBryan Newbold2019-12-111-6/+4
| | | | | | | | | | | | | | | | | | | | --fatcat-api-url is clearer than --host-url remove unimplemented --debug (copy/paste from webface argparse) use formater which will display 'default' parameters with --help Thanks to Martin for pointing out the later, which i've always wanted!
| * simplify ES scroll deletion using param()Bryan Newbold2019-12-111-29/+29
| | | | | | | | | | | | | | | | | | | | | | This gets rid of some mess error handling code by properly configuring the elasticsearch client to just not clean up scroll iterators when accessing the public (prod or qa) search interfaces. Leaving the scroll state around isn't ideal, so we still delete them if possible (eg, connecting directly to elasticsearch). Thanks to Martin for pointing out this solution in review.
| * add ingest-container command (new CLI tool)Bryan Newbold2019-12-101-0/+136
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The intent of this tool is to make it easy to enque ingest requests into kafka, to be processed by a worker pool and eventually end up inserted into fatcat (for ingest hits that pass various checks). As a specific example use-case, we have pretty good coverage of eLife (a prominent OA publisher), but have missed some publications in the past, and have a large gap for the year 2019: https://fatcat.wiki/container/en4qj5ijrbf5djxx7p5zzpjyoq/coverage This tool would make it trivial to enqueue all the missing releases to be crawled. Future variants on this tool could query for, eg, long-tail OA works.
| * factor out some basic kafka helpersBryan Newbold2019-12-102-0/+23
| |
| * add another ingest request source to whitelistBryan Newbold2019-12-101-2/+5
| |
| * pipenv: add elasticsearch and elasticsearch-dsl librariesBryan Newbold2019-12-102-1/+19
| | | | | | | | | | These are low-level and high-level (respectively) client wrappers for elasticsearch
* | improve argparse usageBryan Newbold2019-12-1110-78/+95
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Use --fatcat-api-url instead of (ambiguous) --host-url for commands that aren't deployed/running via systemd. TODO: update the other --host-url usage, and either roll-out change consistently or support the old arg as an alias during cut-over Use argparse.ArgumentDefaultsHelpFormatter (thanks Martin!) Add help messages for all sub-commands, both as documentation and as a way to get argparse to print available commands in a more readable format.
* | add kafka-pixy to docker-compose fileBryan Newbold2019-12-101-0/+8
| |
* | tweaks to docker-compose imageBryan Newbold2019-12-101-0/+5
|/ | | | | - don't start kafka image until zookeeper is running - set very liberal "watermarks" for elasticsearch disk monitoring
* another schema update idea (containers)Bryan Newbold2019-12-091-0/+1
|
* fix delete release history viewBryan Newbold2019-12-091-1/+1
| | | | | | This was causing 5xx errors in production and qa. Eg, at: https://qa.fatcat.wiki/release/aaaaaaaaaaaaarceaaaaaaaaai/history
* regression test for deleted entity history viewBryan Newbold2019-12-091-0/+25
|
* add missing underline in deleted entity web viewBryan Newbold2019-12-091-1/+1
|
* Merge branch 'bnewbold-crossref-harvest-test' into 'master'Martin Czygan2019-12-095-22/+82
|\ | | | | | | | | Basic mocked test for crossref harvester See merge request webgroup/fatcat!7
| * add basic test for crossref harvest API callBryan Newbold2019-12-062-0/+46
| |
| * refactor kafka producer in crossref harvesterBryan Newbold2019-12-061-21/+26
| | | | | | | | | | | | | | | | producer creation/configuration should be happening in __init__() time, not 'daily' call. This specific refactor motivated by mocking out the producer in unit tests.
| * add pytest-mock helper library to dev depsBryan Newbold2019-12-062-1/+10
| |
* | Merge branch 'martin-increase-docker-kafka-message-size' into 'master'bnewbold2019-12-061-0/+1
|\ \ | |/ |/| | | | | increase max.message.bytes in container See merge request webgroup/fatcat!5
| * increase max.message.bytes in containerMartin Czygan2019-12-051-0/+1
|/ | | | | While working on datacite, some message were larger than the default of 1000012 bytes.
* improve previous commit (JATS abstract hack)Bryan Newbold2019-12-031-4/+6
|
* hack: remove enclosing JATS XML tags around abstractsBryan Newbold2019-12-031-1/+7
| | | | | | The more complete fix is to actually render the JATS to HTML and display that. This is just to fix a nit with the most common case of XML tags in abstracts.
* tweaks to file ingest importerBryan Newbold2019-12-032-3/+10
| | | | | - allow overriding source filter whitelist (common case for CLI use) - fix editgroup description env variable pass-through
* crossref is_update isn't what I thoughtBryan Newbold2019-12-031-6/+2
| | | | | | | | I thought this would filter for metadata updates to an existing DOI, but actually "updates" are a type of DOI (eg, a retraction). TODO: handle 'updates' field. Should both do a lookup and set work_ident appropriately, and store in crossref-specific metadata.
* bump required rust to 1.36Bryan Newbold2019-12-032-2/+2
| | | | | | | | | | | | This isn't a fatcat rust requirement, but instead a diesel requirement, via rust-smallvec, which in v1.0 uses the alloc crate: https://github.com/servo/rust-smallvec/issues/73 I think the reason this came up now is that diesel-cli is an application and doesn't have a Cargo.lock file, and the build was updated. Using some binary mechanism to install these dependencies would be more robust, but feels like a yak shave right now.
* update gitlab-ci to rust 1.34Bryan Newbold2019-12-031-1/+1
| | | | | Apparently the rust:1.34-stretch image is gone from docker hub, and this was causing CI errors.
* make file edit form hash values case insensitiveBryan Newbold2019-12-021-0/+3
| | | | | | | Test in previous commit. This fixes a user-reported 500 error when creating a file with SHA1/SHA256/MD5 hashes in upper-case.
* add regression test for upper-case SHA-1 form submitBryan Newbold2019-12-021-0/+10
|
* re-order ingest want() for better statsBryan Newbold2019-11-151-7/+10
|
* project -> ingest_request_sourceBryan Newbold2019-11-153-9/+9
|
* have ingest-file-results importer operate as crawl-botBryan Newbold2019-11-151-1/+1
| | | | As opposed to sandcrawler-bot