Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | add basic MedlineDate year parsing | Bryan Newbold | 2019-12-23 | 1 | -0/+11 |
| | |||||
* | fix spn/ingest importer duplication check | Bryan Newbold | 2019-12-22 | 1 | -6/+8 |
| | | | | | | Check was happing after the `return True` by mistake, allowing duplicates in SPN editgroups, and potentially in ingest request editgroups as well. | ||||
* | write diagnostic messages to stderr | Martin Czygan | 2019-12-16 | 1 | -2/+2 |
| | | | | | During debugging, it can be helpful to keep stdout (e.g. processing results) and dignostic messages separate. | ||||
* | Merge branch 'martin-importers-common-doc-fix' into 'master' | Martin Czygan | 2019-12-14 | 1 | -13/+10 |
|\ | | | | | | | | | Update EntityImporter docstring. See merge request webgroup/fatcat!9 | ||||
| * | complete parse_record docstring | Martin Czygan | 2019-12-14 | 1 | -0/+6 |
| | | |||||
| * | Update EntityImporter docstring. | Martin Czygan | 2019-12-13 | 1 | -13/+4 |
| | | | | | | | | I believe the required method is `parse_record`, not `parse`. | ||||
* | | add ingest import file collision protection | Bryan Newbold | 2019-12-13 | 1 | -0/+6 |
| | | | | | | | | | | | | | | | | The common case is the same URL being submitted repeatedly during testing. This is only within-editgroup, and per importer (eg, won't work across spn importer "submitted" editgroups), but is better than nothing. | ||||
* | | update ingest request schema | Bryan Newbold | 2019-12-13 | 3 | -8/+30 |
| | | | | | | | | | | This is mostly changing ingest_type from 'file' to 'pdf', and adding 'link_source'/'link_source_id', plus some small cleanups. | ||||
* | | remove default mimetype from ingest-file importer | Bryan Newbold | 2019-12-13 | 1 | -2/+1 |
| | | | | | | | | We really should just use file_meta result or nothing. | ||||
* | | revert accidentally commited test timing | Bryan Newbold | 2019-12-13 | 1 | -2/+2 |
| | | | | | | | | Also fix a spurious typo. | ||||
* | | ensure importer description arg isn't clobbered | Bryan Newbold | 2019-12-12 | 3 | -5/+5 |
| | | |||||
* | | tweaks to ingest-file transform | Bryan Newbold | 2019-12-12 | 1 | -13/+7 |
| | | |||||
* | | savepapernow result importer | Bryan Newbold | 2019-12-12 | 2 | -4/+65 |
| | | | | | | | | Based on ingest-file-results importer | ||||
* | | flush importer editgroups every few minutes | Bryan Newbold | 2019-12-12 | 1 | -5/+20 |
| | | |||||
* | | EntityImporter: submit (not accept) mode | Bryan Newbold | 2019-12-12 | 1 | -2/+14 |
|/ | | | | | For use with bots that don't have admin privileges, or where human follow-up review is desired. | ||||
* | factor out some basic kafka helpers | Bryan Newbold | 2019-12-10 | 2 | -0/+23 |
| | |||||
* | add another ingest request source to whitelist | Bryan Newbold | 2019-12-10 | 1 | -2/+5 |
| | |||||
* | refactor kafka producer in crossref harvester | Bryan Newbold | 2019-12-06 | 1 | -21/+26 |
| | | | | | | | | producer creation/configuration should be happening in __init__() time, not 'daily' call. This specific refactor motivated by mocking out the producer in unit tests. | ||||
* | tweaks to file ingest importer | Bryan Newbold | 2019-12-03 | 1 | -3/+4 |
| | | | | | - allow overriding source filter whitelist (common case for CLI use) - fix editgroup description env variable pass-through | ||||
* | crossref is_update isn't what I thought | Bryan Newbold | 2019-12-03 | 1 | -6/+2 |
| | | | | | | | | I thought this would filter for metadata updates to an existing DOI, but actually "updates" are a type of DOI (eg, a retraction). TODO: handle 'updates' field. Should both do a lookup and set work_ident appropriately, and store in crossref-specific metadata. | ||||
* | re-order ingest want() for better stats | Bryan Newbold | 2019-11-15 | 1 | -7/+10 |
| | |||||
* | project -> ingest_request_source | Bryan Newbold | 2019-11-15 | 3 | -9/+9 |
| | |||||
* | fix release.pmcid typo | Bryan Newbold | 2019-11-15 | 1 | -2/+2 |
| | |||||
* | ingest importer fixes | Bryan Newbold | 2019-11-15 | 1 | -3/+4 |
| | |||||
* | more ingest importer comments and counts | Bryan Newbold | 2019-11-15 | 2 | -2/+29 |
| | |||||
* | crude support for 'sandcrawler' kafka topics | Bryan Newbold | 2019-11-15 | 1 | -2/+3 |
| | |||||
* | ingest file result importer | Bryan Newbold | 2019-11-15 | 2 | -2/+135 |
| | |||||
* | add ingest request feature to entity_updates worker | Bryan Newbold | 2019-11-15 | 1 | -4/+20 |
| | | | | | | | | | | | | | Initially was going to create a new worker to consume from the release update channel, but couldn't get the edit context ("is this a new release, or update to an existing") from that context. Currently there is a flag in source code to control whether we only do OA releases or all releases. Starting with OA only to start slow, but should probably default to all, and make this a config flag. Should probably also have a config flag to control this entire feature. Tested locally in dev. | ||||
* | add ingest request transform (and test) | Bryan Newbold | 2019-11-15 | 2 | -0/+67 |
| | |||||
* | crossref: accurate blank title counts | Bryan Newbold | 2019-11-05 | 1 | -0/+1 |
| | |||||
* | crossref: component type | Bryan Newbold | 2019-11-04 | 1 | -1/+3 |
| | |||||
* | crossref: count why skip happened | Bryan Newbold | 2019-11-04 | 1 | -1/+7 |
| | | | | | | Might skip based on release type (eg container, not a paper/release), or missing title, or other reasons. Over 7 million DOIs are getting skipped, curious why. | ||||
* | crossref: don't skip on short/null subtitle | Bryan Newbold | 2019-11-04 | 1 | -1/+1 |
| | | | | This was a bug. Should only set subtitle black, not skip the import. | ||||
* | file cleanup tweaks to actually run | Bryan Newbold | 2019-10-08 | 2 | -5/+4 |
| | |||||
* | refactor duplicated b32_hex function in importers | Bryan Newbold | 2019-10-08 | 3 | -21/+11 |
| | |||||
* | dict wrapper for entity_from_json() | Bryan Newbold | 2019-10-08 | 2 | -3/+7 |
| | |||||
* | new cleanup python tool/framework | Bryan Newbold | 2019-10-08 | 4 | -0/+241 |
| | |||||
* | review/fix all confluent-kafka produce code | Bryan Newbold | 2019-09-20 | 6 | -27/+75 |
| | |||||
* | small fixes to confluent-kafka importers/workers | Bryan Newbold | 2019-09-20 | 6 | -24/+67 |
| | | | | | | | | - decrease default changelog pipeline to 5.0sec - fix missing KafkaException harvester imports - more confluent-kafka tweaks - updates to kafka consumer configs - bump elastic updates consumergroup (again) | ||||
* | convert pipeline workers from pykafka to confluent-kafka | Bryan Newbold | 2019-09-20 | 3 | -125/+230 |
| | |||||
* | small kafka tweaks for robustness | Bryan Newbold | 2019-09-20 | 2 | -0/+5 |
| | |||||
* | convert importers to confluent-kafka library | Bryan Newbold | 2019-09-20 | 1 | -19/+71 |
| | |||||
* | bump max message size to ~20 MBytes | Bryan Newbold | 2019-09-20 | 2 | -0/+2 |
| | |||||
* | fixes to confluent-kafka harvesters | Bryan Newbold | 2019-09-20 | 3 | -20/+21 |
| | |||||
* | first draft harvesters using confluent-kafka | Bryan Newbold | 2019-09-20 | 3 | -48/+104 |
| | |||||
* | handle more external identifiers in python | Bryan Newbold | 2019-09-18 | 1 | -14/+97 |
| | | | | | This makes it possible to, eg, past an arxiv identifier or SHA-1 hash in the general search box and do a quick lookup. | ||||
* | refactor all python source for client lib name | Bryan Newbold | 2019-09-05 | 21 | -121/+121 |
| | |||||
* | fix Importer editgroup_extra pass-through | Bryan Newbold | 2019-09-05 | 1 | -2/+1 |
| | |||||
* | comment clarifying container.ident in ES release transform | Bryan Newbold | 2019-09-03 | 1 | -0/+2 |
| | |||||
* | file rel: social -> academicsocial | Bryan Newbold | 2019-09-03 | 1 | -2/+2 |
| |