Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | tweaks to file ingest importer | Bryan Newbold | 2019-12-03 | 1 | -3/+4 |
| | | | | | - allow overriding source filter whitelist (common case for CLI use) - fix editgroup description env variable pass-through | ||||
* | crossref is_update isn't what I thought | Bryan Newbold | 2019-12-03 | 1 | -6/+2 |
| | | | | | | | | I thought this would filter for metadata updates to an existing DOI, but actually "updates" are a type of DOI (eg, a retraction). TODO: handle 'updates' field. Should both do a lookup and set work_ident appropriately, and store in crossref-specific metadata. | ||||
* | re-order ingest want() for better stats | Bryan Newbold | 2019-11-15 | 1 | -7/+10 |
| | |||||
* | project -> ingest_request_source | Bryan Newbold | 2019-11-15 | 3 | -9/+9 |
| | |||||
* | fix release.pmcid typo | Bryan Newbold | 2019-11-15 | 1 | -2/+2 |
| | |||||
* | ingest importer fixes | Bryan Newbold | 2019-11-15 | 1 | -3/+4 |
| | |||||
* | more ingest importer comments and counts | Bryan Newbold | 2019-11-15 | 2 | -2/+29 |
| | |||||
* | crude support for 'sandcrawler' kafka topics | Bryan Newbold | 2019-11-15 | 1 | -2/+3 |
| | |||||
* | ingest file result importer | Bryan Newbold | 2019-11-15 | 2 | -2/+135 |
| | |||||
* | add ingest request feature to entity_updates worker | Bryan Newbold | 2019-11-15 | 1 | -4/+20 |
| | | | | | | | | | | | | | Initially was going to create a new worker to consume from the release update channel, but couldn't get the edit context ("is this a new release, or update to an existing") from that context. Currently there is a flag in source code to control whether we only do OA releases or all releases. Starting with OA only to start slow, but should probably default to all, and make this a config flag. Should probably also have a config flag to control this entire feature. Tested locally in dev. | ||||
* | add ingest request transform (and test) | Bryan Newbold | 2019-11-15 | 2 | -0/+67 |
| | |||||
* | crossref: accurate blank title counts | Bryan Newbold | 2019-11-05 | 1 | -0/+1 |
| | |||||
* | crossref: component type | Bryan Newbold | 2019-11-04 | 1 | -1/+3 |
| | |||||
* | crossref: count why skip happened | Bryan Newbold | 2019-11-04 | 1 | -1/+7 |
| | | | | | | Might skip based on release type (eg container, not a paper/release), or missing title, or other reasons. Over 7 million DOIs are getting skipped, curious why. | ||||
* | crossref: don't skip on short/null subtitle | Bryan Newbold | 2019-11-04 | 1 | -1/+1 |
| | | | | This was a bug. Should only set subtitle black, not skip the import. | ||||
* | file cleanup tweaks to actually run | Bryan Newbold | 2019-10-08 | 2 | -5/+4 |
| | |||||
* | refactor duplicated b32_hex function in importers | Bryan Newbold | 2019-10-08 | 3 | -21/+11 |
| | |||||
* | dict wrapper for entity_from_json() | Bryan Newbold | 2019-10-08 | 2 | -3/+7 |
| | |||||
* | new cleanup python tool/framework | Bryan Newbold | 2019-10-08 | 4 | -0/+241 |
| | |||||
* | review/fix all confluent-kafka produce code | Bryan Newbold | 2019-09-20 | 6 | -27/+75 |
| | |||||
* | small fixes to confluent-kafka importers/workers | Bryan Newbold | 2019-09-20 | 6 | -24/+67 |
| | | | | | | | | - decrease default changelog pipeline to 5.0sec - fix missing KafkaException harvester imports - more confluent-kafka tweaks - updates to kafka consumer configs - bump elastic updates consumergroup (again) | ||||
* | convert pipeline workers from pykafka to confluent-kafka | Bryan Newbold | 2019-09-20 | 3 | -125/+230 |
| | |||||
* | small kafka tweaks for robustness | Bryan Newbold | 2019-09-20 | 2 | -0/+5 |
| | |||||
* | convert importers to confluent-kafka library | Bryan Newbold | 2019-09-20 | 1 | -19/+71 |
| | |||||
* | bump max message size to ~20 MBytes | Bryan Newbold | 2019-09-20 | 2 | -0/+2 |
| | |||||
* | fixes to confluent-kafka harvesters | Bryan Newbold | 2019-09-20 | 3 | -20/+21 |
| | |||||
* | first draft harvesters using confluent-kafka | Bryan Newbold | 2019-09-20 | 3 | -48/+104 |
| | |||||
* | handle more external identifiers in python | Bryan Newbold | 2019-09-18 | 1 | -14/+97 |
| | | | | | This makes it possible to, eg, past an arxiv identifier or SHA-1 hash in the general search box and do a quick lookup. | ||||
* | refactor all python source for client lib name | Bryan Newbold | 2019-09-05 | 21 | -121/+121 |
| | |||||
* | fix Importer editgroup_extra pass-through | Bryan Newbold | 2019-09-05 | 1 | -2/+1 |
| | |||||
* | comment clarifying container.ident in ES release transform | Bryan Newbold | 2019-09-03 | 1 | -0/+2 |
| | |||||
* | file rel: social -> academicsocial | Bryan Newbold | 2019-09-03 | 1 | -2/+2 |
| | |||||
* | fix previous fix (need tests) | Bryan Newbold | 2019-09-03 | 1 | -2/+2 |
| | |||||
* | fix typo bug in container ES transform | Bryan Newbold | 2019-09-03 | 1 | -2/+2 |
| | |||||
* | last chocula import behavior tweaks | Bryan Newbold | 2019-09-03 | 1 | -3/+21 |
| | |||||
* | more careful chocula import counts; don't re-update empty URLs | Bryan Newbold | 2019-09-03 | 1 | -2/+6 |
| | |||||
* | better importer 'total' counting | Bryan Newbold | 2019-09-03 | 1 | -4/+2 |
| | |||||
* | chocula importer: include DOAJ updates | Bryan Newbold | 2019-09-03 | 1 | -2/+2 |
| | |||||
* | use EZB and szczepanski as OA signals (ES) | Bryan Newbold | 2019-09-03 | 1 | -0/+12 |
| | |||||
* | improvements to chocula importer | Bryan Newbold | 2019-09-03 | 1 | -1/+7 |
| | |||||
* | implement ChoculaImporter | Bryan Newbold | 2019-09-03 | 2 | -0/+137 |
| | |||||
* | improvements to wayback_static importer | Bryan Newbold | 2019-08-22 | 1 | -6/+29 |
| | |||||
* | start new ES container worker kafka group | Bryan Newbold | 2019-07-31 | 1 | -0/+2 |
| | | | | | | | | The previous group seems to have gotten corrupted; my hypothesis is that this is due to pykafka being somewhat flakey, and am planning to move to librdkafka anyways. Re-indexing all the containers is pretty small/easy, so starting a new consumer group works find in this case; release indexer would be a bigger problem. | ||||
* | crossref: allow 'name' fallback (for groups, etc) | Bryan Newbold | 2019-06-24 | 1 | -1/+1 |
| | |||||
* | add inflight edit protection to matched importer | Bryan Newbold | 2019-06-24 | 1 | -1/+8 |
| | |||||
* | fix typo; do arxiv-specific match import hack | Bryan Newbold | 2019-06-24 | 1 | -3/+14 |
| | |||||
* | fix syntax in existing.url cleanup | Bryan Newbold | 2019-06-24 | 1 | -1/+1 |
| | |||||
* | fix existing updater | Bryan Newbold | 2019-06-24 | 1 | -2/+3 |
| | |||||
* | add minimal file URL cleanups to matched importer | Bryan Newbold | 2019-06-24 | 1 | -0/+8 |
| | |||||
* | matched importer: urls, not url | Bryan Newbold | 2019-06-24 | 1 | -1/+1 |
| | | | | | | This matches the docs in the header. Previous matched imports were using 'cdx' objects with no 'dt' key, but this makes more sense. As far as I know the old 'url' code path was never actually used (or tested, derp). |