| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
| |
This was used during initial bulk imports, but is no longer used and
could create serious metadata problems if used accidentially.
In retrospect, it also made metadata provenance less transparent, and
may have done more harm than good overall.
|
| |
|
|
|
|
|
|
| |
While these changes are more delicate than simple lint changes, this
specific batch of edits and annotations was *relatively* simple, and
resulted in few code changes other than function signature additions.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
| |
Behavior and motivation described in the kafka json import comment.
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|\
| |
| | |
Correct spelling mistakes
|
| | |
|
|\ \
| |/
|/|
| |
| | |
pubmed and arxiv harvest preparations
See merge request webgroup/fatcat!28
|
| | |
|
| |
| |
| |
| |
| | |
* regenerate map in continuous mode
* add tests
|
| |
| |
| |
| |
| |
| |
| | |
* add PubmedFTPWorker
* utils are currently stored alongside pubmed (e.g. ftpretr, xmlstream)
but may live elsewhere, as they are more generic
* add KafkaBs4XmlPusher
|
| | |
|
|/ |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
My current understanding is that consumer group names should be
one-to-one with topic names. I previously though offsets were stored on
a {topic, group} key, but they seem to be mixed and having too many
workers in the same group is bad. In particular, we don't want
cross-talk or load between QA and prod.
All these topics are caught up in prod, so deploying this change and
restarting workers should be safe.
This commit does not update the elasticsearch or entity updates workers.
|
|
|
|
| |
Should have run tests before pushing!
|
| |
|
|
|
|
|
|
| |
Good to have exceptions tracked and stored even for commands run from
the command line. But in particular the importer runs as a kafka worker
and should be tracking excpetions.
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pipfile.lock is broken.
* martin-datacite-import: (68 commits)
datacite: pass in doi into factored out method
datacite: reformat test cases and use jq . --sort-keys
datacite: factor out contributor handling
datacite: catch type mismatch in language detection
datacite: adjust tests for release_month
datacite: name extra.month, extra.release_month
datacite: mark additional files as stub
datacite: CCDC are entries, mostly
datacite: use more specific release_type, if possible
datacite: ignore certain names
datacite: over 3% records have the same title: stub
datacite: fill a few more release_type gaps
datacite: adding datacite-specific extra metadata
datacite: apply pylint suggestions
datacite: fix typos
datacite: set release_stage to published by default
datacite: month field should be top-level
datacite: include month in extra
datacite: indicate mismatched file in test
datacite: clean abstracts, use unknown value tokens
...
|
| | |
|
| |
| |
| |
| | |
Estimated time for a single call is in the order of 50ms.
|
| | |
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Current version succeeded to import a random sample of 100000 records
(0.5%) from datacite.
The --debug (write JSON to stdout) and --insert-log-file (log batch
before committing to db) flags are temporary added to help debugging.
Add few unit tests.
Some edge cases:
a) Existing keys without value requires a slightly awkward:
```
titles = attributes.get('titles', []) or []
```
b) There can be 0, 1, or more (first one wins) titles.
c) Date handling is probably not ideal. Datacite has a potentiall fine
grained list of dates.
The test case (tests/files/datacite_sample.jsonl) refers to
https://ssl.fao.org/glis/doi/10.18730/8DYM9, which has date (main
descriptor) 1986. The datacite record contains: 2017 (publicationYear,
probably the year of record creation with reference system), 1978-06-03
(collected, e.g. experimental sample), 1986 ("Accepted"). The online
version of the resource knows even one more date (2019-06-05 10:14:43 by
WIEWS update).
|
| |
| |
| |
| |
| |
| | |
* contributors, title, date, publisher, container, license
Field and value analysis via https://github.com/miku/indigo.
|
|/ |
|
|
|
|
| |
Based on ingest-file-results importer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Use --fatcat-api-url instead of (ambiguous) --host-url for commands that
aren't deployed/running via systemd.
TODO: update the other --host-url usage, and either roll-out change
consistently or support the old arg as an alias during cut-over
Use argparse.ArgumentDefaultsHelpFormatter (thanks Martin!)
Add help messages for all sub-commands, both as documentation and as a
way to get argparse to print available commands in a more readable
format.
|
|
|
|
|
| |
- allow overriding source filter whitelist (common case for CLI use)
- fix editgroup description env variable pass-through
|
|
|
|
| |
As opposed to sandcrawler-bot
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
- decrease default changelog pipeline to 5.0sec
- fix missing KafkaException harvester imports
- more confluent-kafka tweaks
- updates to kafka consumer configs
- bump elastic updates consumergroup (again)
|
| |
|
| |
|
| |
|
| |
|