fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	fix trivial one-character typo in fatcat_import.py	Bryan Newbold	2020-01-17	1	-1/+1
\| \| \| \|	Should have run tests before pushing!
*	actually control pubmed updates with a flag	Bryan Newbold	2020-01-17	1	-0/+4
\|
*	add missing sentry/raven tags	Bryan Newbold	2020-01-10	1	-0/+6
\| \| \| \| \| \|	Good to have exceptions tracked and stored even for commands run from the command line. But in particular the importer runs as a kafka worker and should be tracking excpetions.
*	Merge branch 'martin-datacite-import'	Martin Czygan	2020-01-08	1	-0/+43
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pipfile.lock is broken. * martin-datacite-import: (68 commits) datacite: pass in doi into factored out method datacite: reformat test cases and use jq . --sort-keys datacite: factor out contributor handling datacite: catch type mismatch in language detection datacite: adjust tests for release_month datacite: name extra.month, extra.release_month datacite: mark additional files as stub datacite: CCDC are entries, mostly datacite: use more specific release_type, if possible datacite: ignore certain names datacite: over 3% records have the same title: stub datacite: fill a few more release_type gaps datacite: adding datacite-specific extra metadata datacite: apply pylint suggestions datacite: fix typos datacite: set release_stage to published by default datacite: month field should be top-level datacite: include month in extra datacite: indicate mismatched file in test datacite: clean abstracts, use unknown value tokens ...
\| *	datacite: fix typos	Martin Czygan	2020-01-07	1	-1/+1
\| \|
\| *	datacite: remove --lang-detect flag	Martin Czygan	2020-01-03	1	-4/+0
\| \| \| \| \| \| \| \|	Estimated time for a single call is in the order of 50ms.
\| *	datacite: use specific auth var	Martin Czygan	2019-12-28	1	-1/+1
\| \|
\| *	datacite: add missing --extid-map-file flag	Martin Czygan	2019-12-28	1	-0/+4
\| \|
\| *	improve datacite field mapping and import	Martin Czygan	2019-12-28	1	-1/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Current version succeeded to import a random sample of 100000 records (0.5%) from datacite. The --debug (write JSON to stdout) and --insert-log-file (log batch before committing to db) flags are temporary added to help debugging. Add few unit tests. Some edge cases: a) Existing keys without value requires a slightly awkward: ``` titles = attributes.get('titles', []) or [] ``` b) There can be 0, 1, or more (first one wins) titles. c) Date handling is probably not ideal. Datacite has a potentiall fine grained list of dates. The test case (tests/files/datacite_sample.jsonl) refers to https://ssl.fao.org/glis/doi/10.18730/8DYM9, which has date (main descriptor) 1986. The datacite record contains: 2017 (publicationYear, probably the year of record creation with reference system), 1978-06-03 (collected, e.g. experimental sample), 1986 ("Accepted"). The online version of the resource knows even one more date (2019-06-05 10:14:43 by WIEWS update).
\| *	datacite: importer skeleton	Martin Czygan	2019-12-28	1	-0/+30
\| \| \| \| \| \| \| \| \| \| \| \|	* contributors, title, date, publisher, container, license Field and value analysis via https://github.com/miku/indigo.
* \|	importers: control update behavior with more-standard flag	Bryan Newbold	2020-01-06	1	-1/+5
\|/
*	savepapernow result importer	Bryan Newbold	2019-12-12	1	-0/+24
\| \| \| \|	Based on ingest-file-results importer
*	improve argparse usage	Bryan Newbold	2019-12-11	1	-18/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use --fatcat-api-url instead of (ambiguous) --host-url for commands that aren't deployed/running via systemd. TODO: update the other --host-url usage, and either roll-out change consistently or support the old arg as an alias during cut-over Use argparse.ArgumentDefaultsHelpFormatter (thanks Martin!) Add help messages for all sub-commands, both as documentation and as a way to get argparse to print available commands in a more readable format.
*	tweaks to file ingest importer	Bryan Newbold	2019-12-03	1	-0/+6
\| \| \| \| \|	- allow overriding source filter whitelist (common case for CLI use) - fix editgroup description env variable pass-through
*	have ingest-file-results importer operate as crawl-bot	Bryan Newbold	2019-11-15	1	-1/+1
\| \| \| \|	As opposed to sandcrawler-bot
*	better ingest-file-results import name	Bryan Newbold	2019-11-15	1	-1/+1
\|
*	ingest file result importer	Bryan Newbold	2019-11-15	1	-0/+34
\|
*	small fixes to confluent-kafka importers/workers	Bryan Newbold	2019-09-20	1	-1/+1
\| \| \| \| \| \| \| \|	- decrease default changelog pipeline to 5.0sec - fix missing KafkaException harvester imports - more confluent-kafka tweaks - updates to kafka consumer configs - bump elastic updates consumergroup (again)
*	convert importers to confluent-kafka library	Bryan Newbold	2019-09-20	1	-2/+3
\|
*	start chocula importer	Bryan Newbold	2019-09-03	1	-0/+14
\|
*	support extids in matched importer	Bryan Newbold	2019-06-20	1	-0/+4
\|
*	faster LargeFile XML importer for PubMed	Bryan Newbold	2019-05-29	1	-1/+1
\|
*	make pubmed ref lookups configurable	Bryan Newbold	2019-05-22	1	-1/+8
\|
*	creative importer for bulk JSTOR imports	Bryan Newbold	2019-05-22	1	-0/+18
\|
*	pubmed importer command and tweaks	Bryan Newbold	2019-05-22	1	-0/+25
\|
*	arxiv importer robustification and CLI impl	Bryan Newbold	2019-05-21	1	-0/+21
\|
*	JALC bulk file importer	Bryan Newbold	2019-05-21	1	-0/+21
\|
*	fix default mimetype (impacted pre-1923 files)	Bryan Newbold	2019-05-15	1	-1/+5
\|
*	editgroup description override	Bryan Newbold	2019-04-22	1	-1/+11
\|
*	minor arabesque tweaks	Bryan Newbold	2019-04-18	1	-12/+22
\|
*	arabesque importer using crawl-bot creds	Bryan Newbold	2019-04-18	1	-1/+1
\|
*	arabesque import tweaks	Bryan Newbold	2019-04-18	1	-0/+4
\|
*	early version of arabesque importer	Bryan Newbold	2019-04-12	1	-0/+28
\|
*	importer for CDL/DASH dat pilot dweb datasets	Bryan Newbold	2019-03-19	1	-1/+29
\|
*	new importer: wayback_static	Bryan Newbold	2019-03-19	1	-0/+48
\|
*	reduce default import batch size to 50	Bryan Newbold	2019-01-29	1	-1/+1
\|
*	batch size as a general import param	Bryan Newbold	2019-01-28	1	-13/+4
\|
*	add missing bezerk-mode flag to GROBID import	Bryan Newbold	2019-01-28	1	-3/+8
\|
*	fix typo in crossref importer	Bryan Newbold	2019-01-28	1	-1/+1
\|
*	update journal meta import/transform	Bryan Newbold	2019-01-25	1	-3/+3
\|
*	more import script fixes	Bryan Newbold	2019-01-23	1	-1/+4
\|
*	update importer script	Bryan Newbold	2019-01-23	1	-33/+24
\|
*	pubmed+datacite tokens; no journal,grobid,matched tokens	Bryan Newbold	2019-01-22	1	-2/+2
\|
*	issn => journal_metadata in several places	Bryan Newbold	2019-01-17	1	-9/+9
\|
*	start refactoring API object passing	Bryan Newbold	2019-01-08	1	-13/+36
\|
*	crossref importer checks for existing DOIs	Bryan Newbold	2018-11-21	1	-3/+7
\|
*	correct kafka topic names	Bryan Newbold	2018-11-20	1	-1/+1
\|
*	start supporting kafka importers	Bryan Newbold	2018-11-19	1	-3/+17
\| \| \| \|	A nice feature would be some/any log output as to progress.
*	bunch of pylint cleanup	Bryan Newbold	2018-11-15	1	-1/+1
\|
*	large refactor of python names/paths	Bryan Newbold	2018-11-15	1	-39/+37
\| \| \| \| \| \| \|	- Add __init__.py files for fatcat_tools submodules, and use them in imports - Add a bunch of comments to files. - rename a number of classes and functions to be less verbose