Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | update fuzzy helper to pass 'reason' through to import code | Bryan Newbold | 2020-12-17 | 1 | -3/+3 |
| | | | | | The motivation for this change is to enable passing the 'reason' through to edit extra metadata, in cases where we merge or cluster releases. | ||||
* | add fuzzy matching helper to importer base class | Bryan Newbold | 2020-12-16 | 1 | -2/+62 |
| | | | | Using fuzzycat. Add basic test coverage. | ||||
* | more python normalizers, and move from importer common | Bryan Newbold | 2020-11-19 | 1 | -154/+4 |
| | | | | | | | | | | | | Moved several normalizer helpers out of fatcat_tools.importers.common to fatcat_tools.normal. Copied language name and country name parser helpers from chocula repository (built on existing pycountry helper library). Have not gone through and refactored other importers to point to these helpers yet; that should be a separate PR when this branch is merged. Current changes are backwards compatible via re-imports. | ||||
* | remove spurious print statement | Bryan Newbold | 2020-09-03 | 1 | -1/+0 |
| | |||||
* | generic file entity clean-ups as part of file_meta importer | Bryan Newbold | 2020-09-02 | 1 | -0/+47 |
| | |||||
* | simple lint (flake8) fixes over python codebase | Bryan Newbold | 2020-07-23 | 1 | -1/+0 |
| | | | | | | These should not have any behavior changes, though a number of exception catches are now more general, and there may be long-tail exceptions getting thrown in these statements. | ||||
* | lint (flake8) tool python files | Bryan Newbold | 2020-07-01 | 1 | -13/+13 |
| | |||||
* | importers: clarify handling of ApiException | Bryan Newbold | 2020-05-22 | 1 | -4/+8 |
| | | | | | | | | One of these (in ingest importer pipeline) is an actual bug, the others are just changing the syntax to be more explicit/conservative. The ingest importer bug seems to have resulted in some bad file match imports; scale of impact is unknown. | ||||
* | consistently use raw string prefix for regex | Bryan Newbold | 2020-04-17 | 1 | -1/+1 |
| | |||||
* | Merge pull request #53 from EdwardBetts/spelling | bnewbold | 2020-03-27 | 1 | -1/+1 |
|\ | | | | | Correct spelling mistakes | ||||
| * | Correct spelling mistakes | Edward Betts | 2020-03-27 | 1 | -1/+1 |
| | | |||||
* | | Merge branch 'martin-kafka-bs4-import' into 'master' | Martin Czygan | 2020-03-10 | 1 | -0/+65 |
|\ \ | |/ |/| | | | | | pubmed and arxiv harvest preparations See merge request webgroup/fatcat!28 | ||||
| * | common: use smaller batch size since XML parsing may be slow | Martin Czygan | 2020-03-10 | 1 | -1/+1 |
| | | | | | | | | | | | | | | | | Address kafka tradeoff between long and short time-outs. Shorter time-outs would facilitate > consumer group re-balances and other consumer group state changes [...] in a reasonable human time-frame. | ||||
| * | pubmed ftp harvest and KafkaBs4XmlPusher | Martin Czygan | 2020-02-19 | 1 | -0/+65 |
| | | | | | | | | | | | | | | * add PubmedFTPWorker * utils are currently stored alongside pubmed (e.g. ftpretr, xmlstream) but may live elsewhere, as they are more generic * add KafkaBs4XmlPusher | ||||
* | | add some more domain/rel URL mappings | Bryan Newbold | 2020-02-22 | 1 | -0/+9 |
|/ | |||||
* | fix KafkaError worker reporting for partition errors | Bryan Newbold | 2020-01-29 | 1 | -1/+1 |
| | |||||
* | importers: control update behavior with more-standard flag | Bryan Newbold | 2020-01-06 | 1 | -0/+1 |
| | |||||
* | write diagnostic messages to stderr | Martin Czygan | 2019-12-16 | 1 | -2/+2 |
| | | | | | During debugging, it can be helpful to keep stdout (e.g. processing results) and dignostic messages separate. | ||||
* | Merge branch 'martin-importers-common-doc-fix' into 'master' | Martin Czygan | 2019-12-14 | 1 | -13/+10 |
|\ | | | | | | | | | Update EntityImporter docstring. See merge request webgroup/fatcat!9 | ||||
| * | complete parse_record docstring | Martin Czygan | 2019-12-14 | 1 | -0/+6 |
| | | |||||
| * | Update EntityImporter docstring. | Martin Czygan | 2019-12-13 | 1 | -13/+4 |
| | | | | | | | | I believe the required method is `parse_record`, not `parse`. | ||||
* | | revert accidentally commited test timing | Bryan Newbold | 2019-12-13 | 1 | -2/+2 |
| | | | | | | | | Also fix a spurious typo. | ||||
* | | ensure importer description arg isn't clobbered | Bryan Newbold | 2019-12-12 | 1 | -1/+3 |
| | | |||||
* | | flush importer editgroups every few minutes | Bryan Newbold | 2019-12-12 | 1 | -5/+20 |
| | | |||||
* | | EntityImporter: submit (not accept) mode | Bryan Newbold | 2019-12-12 | 1 | -2/+14 |
|/ | | | | | For use with bots that don't have admin privileges, or where human follow-up review is desired. | ||||
* | crude support for 'sandcrawler' kafka topics | Bryan Newbold | 2019-11-15 | 1 | -2/+3 |
| | |||||
* | refactor duplicated b32_hex function in importers | Bryan Newbold | 2019-10-08 | 1 | -0/+9 |
| | |||||
* | review/fix all confluent-kafka produce code | Bryan Newbold | 2019-09-20 | 1 | -1/+0 |
| | |||||
* | small fixes to confluent-kafka importers/workers | Bryan Newbold | 2019-09-20 | 1 | -10/+24 |
| | | | | | | | | - decrease default changelog pipeline to 5.0sec - fix missing KafkaException harvester imports - more confluent-kafka tweaks - updates to kafka consumer configs - bump elastic updates consumergroup (again) | ||||
* | small kafka tweaks for robustness | Bryan Newbold | 2019-09-20 | 1 | -0/+3 |
| | |||||
* | convert importers to confluent-kafka library | Bryan Newbold | 2019-09-20 | 1 | -19/+71 |
| | |||||
* | refactor all python source for client lib name | Bryan Newbold | 2019-09-05 | 1 | -3/+3 |
| | |||||
* | fix Importer editgroup_extra pass-through | Bryan Newbold | 2019-09-05 | 1 | -2/+1 |
| | |||||
* | file rel: social -> academicsocial | Bryan Newbold | 2019-09-03 | 1 | -2/+2 |
| | |||||
* | better importer 'total' counting | Bryan Newbold | 2019-09-03 | 1 | -4/+2 |
| | |||||
* | make importer extid lookups faster by hiding | Bryan Newbold | 2019-05-29 | 1 | -2/+2 |
| | |||||
* | is_cjk() handles kanji better | Bryan Newbold | 2019-05-29 | 1 | -4/+6 |
| | |||||
* | faster LargeFile XML importer for PubMed | Bryan Newbold | 2019-05-29 | 1 | -0/+50 |
| | |||||
* | more MARC languages, and less verbose reporting | Bryan Newbold | 2019-05-24 | 1 | -3/+14 |
| | |||||
* | missing MARC/ISO languages | Bryan Newbold | 2019-05-22 | 1 | -0/+2 |
| | |||||
* | Gaelic! | Bryan Newbold | 2019-05-22 | 1 | -0/+3 |
| | |||||
* | creative importer for bulk JSTOR imports | Bryan Newbold | 2019-05-22 | 1 | -0/+22 |
| | |||||
* | bs4 XML parse cleanup | Bryan Newbold | 2019-05-22 | 1 | -0/+2 |
| | |||||
* | JALC bulk file importer | Bryan Newbold | 2019-05-21 | 1 | -0/+20 |
| | |||||
* | updates to pubmed importer | Bryan Newbold | 2019-05-21 | 1 | -1/+20 |
| | |||||
* | tweaks to new imports/tests | Bryan Newbold | 2019-05-21 | 1 | -5/+78 |
| | |||||
* | initial flesh out of JALC parser | Bryan Newbold | 2019-05-21 | 1 | -0/+36 |
| | |||||
* | python impl | Bryan Newbold | 2019-05-14 | 1 | -3/+3 |
| | |||||
* | add limits to match importers | Bryan Newbold | 2019-04-23 | 1 | -0/+3 |
| | |||||
* | archive.org isn't really a repository | Bryan Newbold | 2019-04-22 | 1 | -1/+3 |
| |