summaryrefslogtreecommitdiffstats
path: root/python/fatcat_import.py
Commit message (Collapse)AuthorAgeFilesLines
* fix trivial one-character typo in fatcat_import.pyBryan Newbold2020-01-171-1/+1
| | | | Should have run tests before pushing!
* actually control pubmed updates with a flagBryan Newbold2020-01-171-0/+4
|
* add missing sentry/raven tagsBryan Newbold2020-01-101-0/+6
| | | | | | Good to have exceptions tracked and stored even for commands run from the command line. But in particular the importer runs as a kafka worker and should be tracking excpetions.
* Merge branch 'martin-datacite-import'Martin Czygan2020-01-081-0/+43
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pipfile.lock is broken. * martin-datacite-import: (68 commits) datacite: pass in doi into factored out method datacite: reformat test cases and use jq . --sort-keys datacite: factor out contributor handling datacite: catch type mismatch in language detection datacite: adjust tests for release_month datacite: name extra.month, extra.release_month datacite: mark additional files as stub datacite: CCDC are entries, mostly datacite: use more specific release_type, if possible datacite: ignore certain names datacite: over 3% records have the same title: stub datacite: fill a few more release_type gaps datacite: adding datacite-specific extra metadata datacite: apply pylint suggestions datacite: fix typos datacite: set release_stage to published by default datacite: month field should be top-level datacite: include month in extra datacite: indicate mismatched file in test datacite: clean abstracts, use unknown value tokens ...
| * datacite: fix typosMartin Czygan2020-01-071-1/+1
| |
| * datacite: remove --lang-detect flagMartin Czygan2020-01-031-4/+0
| | | | | | | | Estimated time for a single call is in the order of 50ms.
| * datacite: use specific auth varMartin Czygan2019-12-281-1/+1
| |
| * datacite: add missing --extid-map-file flagMartin Czygan2019-12-281-0/+4
| |
| * improve datacite field mapping and importMartin Czygan2019-12-281-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current version succeeded to import a random sample of 100000 records (0.5%) from datacite. The --debug (write JSON to stdout) and --insert-log-file (log batch before committing to db) flags are temporary added to help debugging. Add few unit tests. Some edge cases: a) Existing keys without value requires a slightly awkward: ``` titles = attributes.get('titles', []) or [] ``` b) There can be 0, 1, or more (first one wins) titles. c) Date handling is probably not ideal. Datacite has a potentiall fine grained list of dates. The test case (tests/files/datacite_sample.jsonl) refers to https://ssl.fao.org/glis/doi/10.18730/8DYM9, which has date (main descriptor) 1986. The datacite record contains: 2017 (publicationYear, probably the year of record creation with reference system), 1978-06-03 (collected, e.g. experimental sample), 1986 ("Accepted"). The online version of the resource knows even one more date (2019-06-05 10:14:43 by WIEWS update).
| * datacite: importer skeletonMartin Czygan2019-12-281-0/+30
| | | | | | | | | | | | * contributors, title, date, publisher, container, license Field and value analysis via https://github.com/miku/indigo.
* | importers: control update behavior with more-standard flagBryan Newbold2020-01-061-1/+5
|/
* savepapernow result importerBryan Newbold2019-12-121-0/+24
| | | | Based on ingest-file-results importer
* improve argparse usageBryan Newbold2019-12-111-18/+30
| | | | | | | | | | | | | | Use --fatcat-api-url instead of (ambiguous) --host-url for commands that aren't deployed/running via systemd. TODO: update the other --host-url usage, and either roll-out change consistently or support the old arg as an alias during cut-over Use argparse.ArgumentDefaultsHelpFormatter (thanks Martin!) Add help messages for all sub-commands, both as documentation and as a way to get argparse to print available commands in a more readable format.
* tweaks to file ingest importerBryan Newbold2019-12-031-0/+6
| | | | | - allow overriding source filter whitelist (common case for CLI use) - fix editgroup description env variable pass-through
* have ingest-file-results importer operate as crawl-botBryan Newbold2019-11-151-1/+1
| | | | As opposed to sandcrawler-bot
* better ingest-file-results import nameBryan Newbold2019-11-151-1/+1
|
* ingest file result importerBryan Newbold2019-11-151-0/+34
|
* small fixes to confluent-kafka importers/workersBryan Newbold2019-09-201-1/+1
| | | | | | | | - decrease default changelog pipeline to 5.0sec - fix missing KafkaException harvester imports - more confluent-kafka tweaks - updates to kafka consumer configs - bump elastic updates consumergroup (again)
* convert importers to confluent-kafka libraryBryan Newbold2019-09-201-2/+3
|
* start chocula importerBryan Newbold2019-09-031-0/+14
|
* support extids in matched importerBryan Newbold2019-06-201-0/+4
|
* faster LargeFile XML importer for PubMedBryan Newbold2019-05-291-1/+1
|
* make pubmed ref lookups configurableBryan Newbold2019-05-221-1/+8
|
* creative importer for bulk JSTOR importsBryan Newbold2019-05-221-0/+18
|
* pubmed importer command and tweaksBryan Newbold2019-05-221-0/+25
|
* arxiv importer robustification and CLI implBryan Newbold2019-05-211-0/+21
|
* JALC bulk file importerBryan Newbold2019-05-211-0/+21
|
* fix default mimetype (impacted pre-1923 files)Bryan Newbold2019-05-151-1/+5
|
* editgroup description overrideBryan Newbold2019-04-221-1/+11
|
* minor arabesque tweaksBryan Newbold2019-04-181-12/+22
|
* arabesque importer using crawl-bot credsBryan Newbold2019-04-181-1/+1
|
* arabesque import tweaksBryan Newbold2019-04-181-0/+4
|
* early version of arabesque importerBryan Newbold2019-04-121-0/+28
|
* importer for CDL/DASH dat pilot dweb datasetsBryan Newbold2019-03-191-1/+29
|
* new importer: wayback_staticBryan Newbold2019-03-191-0/+48
|
* reduce default import batch size to 50Bryan Newbold2019-01-291-1/+1
|
* batch size as a general import paramBryan Newbold2019-01-281-13/+4
|
* add missing bezerk-mode flag to GROBID importBryan Newbold2019-01-281-3/+8
|
* fix typo in crossref importerBryan Newbold2019-01-281-1/+1
|
* update journal meta import/transformBryan Newbold2019-01-251-3/+3
|
* more import script fixesBryan Newbold2019-01-231-1/+4
|
* update importer scriptBryan Newbold2019-01-231-33/+24
|
* pubmed+datacite tokens; no journal,grobid,matched tokensBryan Newbold2019-01-221-2/+2
|
* issn => journal_metadata in several placesBryan Newbold2019-01-171-9/+9
|
* start refactoring API object passingBryan Newbold2019-01-081-13/+36
|
* crossref importer checks for existing DOIsBryan Newbold2018-11-211-3/+7
|
* correct kafka topic namesBryan Newbold2018-11-201-1/+1
|
* start supporting kafka importersBryan Newbold2018-11-191-3/+17
| | | | A nice feature would be some/any log output as to progress.
* bunch of pylint cleanupBryan Newbold2018-11-151-1/+1
|
* large refactor of python names/pathsBryan Newbold2018-11-151-39/+37
| | | | | | | - Add __init__.py files for fatcat_tools submodules, and use them in imports - Add a bunch of comments to files. - rename a number of classes and functions to be less verbose