fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
...
* \|	pytest: explicitly indicate all in-scope test files	Bryan Newbold	2020-01-03	1	-3/+1
\|/ \| \| \| \| \| \| \| \| \| \|	The purpose of this change is to test errors when pytest tries to recursively update assertion statements in all dependent packages. The reason pytest does this is to add pretty printing, which is nice, but probably shouldn't be done in all dependency libraries. This fixes test problems with both CSL (citeproc_styles) and dateparser (when actually imported in code, which currently on master does not happen).
*	orcid: skip non-person ORCID records	Bryan Newbold	2019-12-26	1	-0/+4
\|
*	datacite: fix harvest test	Martin Czygan	2019-12-27	1	-1/+1
\| \| \| \| \| \|	Produced messages should match: jq '.data\|length' tests/files/datacite_api.json
*	datacite: add simple test and fixture for datacite api interaction	Martin Czygan	2019-12-27	2	-0/+46
\|
*	datacite: extend range search query	Martin Czygan	2019-12-27	1	-1/+1
\| \| \| \| \|	The bracket syntax is inclusive. See also: https://www.elastic.co/guide/en/elasticsearch/reference/7.5/query-dsl-query-string-query.html#_ranges
*	avoid usage of short links	Martin Czygan	2019-12-27	1	-2/+2
\|
*	Datacite API v2 throws 400, we cannot recover from, currently.	Martin Czygan	2019-12-27	1	-0/+4
\| \| \| \| \| \| \| \| \| \|	As a first iteration, just mark the daily batch complete and continue. The occasional HTTP 400 issue has been reported as https://github.com/datacite/datacite/issues/897. A possible improvement would be to shrink the window, so losses will be smaller.
*	datacite: update documentation, add links to issues	Martin Czygan	2019-12-27	1	-10/+5
\|
*	datacite: use v2 of the API (flaky)	Martin Czygan	2019-12-27	1	-5/+28
\| \| \| \| \| \| \| \| \|	Update parameter update for datacite API v2. Works fine, but there are occasional HTTP 400 responses when using the cursor API (daily updates can exceed the 10000 record limit for search queries). The HTTP 400 issue is not solved yet, but reported to datacite as https://github.com/datacite/datacite/issues/897.
*	transform ingests via pmc/pmcid, not pubmed/pmid	Bryan Newbold	2019-12-24	1	-4/+4
\|
*	allow arabesque backfill ingests for some source types	Bryan Newbold	2019-12-24	1	-0/+5
\|
*	make chocula URL updates more conservative	Bryan Newbold	2019-12-24	1	-5/+5
\|
*	pubmed: if doing update, also do subtitle schema update	Bryan Newbold	2019-12-23	1	-1/+9
\|
*	doi parsing fixes	Bryan Newbold	2019-12-23	1	-0/+7
\| \| \| \| \| \| \| \| \| \|	Replace emdash with regular dash. Replace double slash after partner ID with single slash. This conversion seems to be done by crossref automatically on lookup. I tried several examples, using doi.org resolver and Crossref API lookup. Note that there are a number of fatcat entities with '//' in the DOI.
*	pubmed: improve warning and stderr formatting	Bryan Newbold	2019-12-23	1	-5/+6
\|
*	pubmed: use standard identifier cleaners	Bryan Newbold	2019-12-23	1	-17/+14
\|
*	pubmed: remove unused extid mapping code	Bryan Newbold	2019-12-23	1	-29/+0
\|
*	pubmed: do reference lookups by default	Bryan Newbold	2019-12-23	1	-1/+1
\|
*	normalizers: clean_pmid(), and handle nulls in all other cleaners	Bryan Newbold	2019-12-23	1	-0/+31
\|
*	pubmed: null doi parsing check	Bryan Newbold	2019-12-23	1	-1/+1
\|
*	add basic MedlineDate year parsing	Bryan Newbold	2019-12-23	1	-0/+11
\|
*	add regression test for medlinedate -> year parsing	Bryan Newbold	2019-12-23	2	-0/+102
\|
*	fix spn/ingest importer duplication check	Bryan Newbold	2019-12-22	1	-6/+8
\| \| \| \| \| \|	Check was happing after the `return True` by mistake, allowing duplicates in SPN editgroups, and potentially in ingest request editgroups as well.
*	datacite release links and metadata expansion	Bryan Newbold	2019-12-20	2	-9/+13
\| \| \| \| \| \| \|	Small ergonomic changes for datacite releases: - add a link to live/current datacite metadata (like we do for Crossref) - expand "extra" metadata fields under 'datacite' dict in metadata view
*	spn: incluce link_source/link_source_id in ingest request	Bryan Newbold	2019-12-20	1	-0/+2
\|
*	pipenv: update deps	Bryan Newbold	2019-12-17	2	-11/+55
\| \| \| \| \| \| \| \|	loginpass patches got accepted upstream a while back, so don't need to pin to a git version ipython 7.10 seems to have problems installing, so restricting to earlier 6.x versions
*	pipenv: restrict pytest<5.0.0	Bryan Newbold	2019-12-17	2	-5/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This prevents a test exception that presents like: tests/transform_csl.py:46: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ fatcat_tools/transforms/csl.py:204: in citeproc_csl style_path = get_style_filepath(style) .venv/lib/python3.5/site-packages/citeproc_styles/__init__.py:74: in get_style_filepath if resource_exists(__name__, independent_style): .venv/lib/python3.5/site-packages/pkg_resources/__init__.py:1134: in resource_exists return get_provider(package_or_requirement).has_resource(resource_name) .venv/lib/python3.5/site-packages/pkg_resources/__init__.py:1404: in has_resource return self._has(self._fn(self.module_path, resource_name)) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <pkg_resources.NullProvider object at 0x7f4f38c0bb00> path = '/home/bnewbold/code/fatcat/python/.venv/lib/python3.5/site-packages/citeproc_styles/styles/bibtex.csl' def _has(self, path): raise NotImplementedError( > "Can't perform this operation for unregistered loader type" ) E NotImplementedError: Can't perform this operation for unregistered loader type
*	pipenv: update Pipfile and Pipfile.lock	Bryan Newbold	2019-12-17	2	-286/+318
\| \| \| \| \| \|	This is still manually tweaked. I believe i've bifurcated the source of the CSL/citeproc_style import error to upgrade of the 'pytest' module. This commit upgrades all packages except pytest.
*	pipfile: add langcodes and dateparser dependencies	Bryan Newbold	2019-12-17	2	-1/+44
\|
*	write diagnostic messages to stderr	Martin Czygan	2019-12-16	1	-2/+2
\| \| \| \| \|	During debugging, it can be helpful to keep stdout (e.g. processing results) and dignostic messages separate.
*	Merge branch 'martin-importers-common-doc-fix' into 'master'	Martin Czygan	2019-12-14	1	-13/+10
\|\ \| \| \| \| \| \| \| \|	Update EntityImporter docstring. See merge request webgroup/fatcat!9
\| *	complete parse_record docstring	Martin Czygan	2019-12-14	1	-0/+6
\| \|
\| *	Update EntityImporter docstring.	Martin Czygan	2019-12-13	1	-13/+4
\| \| \| \| \| \| \| \|	I believe the required method is `parse_record`, not `parse`.
* \|	add ingest import file collision protection	Bryan Newbold	2019-12-13	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The common case is the same URL being submitted repeatedly during testing. This is only within-editgroup, and per importer (eg, won't work across spn importer "submitted" editgroups), but is better than nothing.
* \|	fix spn kafka topic env var	Bryan Newbold	2019-12-13	1	-1/+1
\| \|
* \|	update ingest request schema	Bryan Newbold	2019-12-13	5	-16/+44
\| \| \| \| \| \| \| \| \| \|	This is mostly changing ingest_type from 'file' to 'pdf', and adding 'link_source'/'link_source_id', plus some small cleanups.
* \|	remove default mimetype from ingest-file importer	Bryan Newbold	2019-12-13	1	-2/+1
\| \| \| \| \| \| \| \|	We really should just use file_meta result or nothing.
* \|	revert accidentally commited test timing	Bryan Newbold	2019-12-13	1	-2/+2
\| \| \| \| \| \| \| \|	Also fix a spurious typo.
* \|	ensure importer description arg isn't clobbered	Bryan Newbold	2019-12-12	3	-5/+5
\| \|
* \|	tweaks to ingest-file transform	Bryan Newbold	2019-12-12	1	-13/+7
\| \|
* \|	initial 'Save Paper Now' web form	Bryan Newbold	2019-12-12	7	-2/+228
\| \|
* \|	more auth token vars in example.env	Bryan Newbold	2019-12-12	1	-0/+6
\| \| \| \| \| \| \| \|	As a form of documentation
* \|	savepapernow result importer	Bryan Newbold	2019-12-12	3	-4/+89
\| \| \| \| \| \| \| \|	Based on ingest-file-results importer
* \|	flush importer editgroups every few minutes	Bryan Newbold	2019-12-12	1	-5/+20
\| \|
* \|	EntityImporter: submit (not accept) mode	Bryan Newbold	2019-12-12	1	-2/+14
\|/ \| \| \| \|	For use with bots that don't have admin privileges, or where human follow-up review is desired.
*	Merge branch 'bnewbold-ingest-oa-container' into 'master'	bnewbold	2019-12-12	6	-3/+181
\|\ \| \| \| \| \| \| \| \|	container-ingest tool See merge request webgroup/fatcat!8
\| *	container_issnl, not issnl, for ES release query	Bryan Newbold	2019-12-12	1	-1/+1
\| \| \| \| \| \| \| \|	Caught by Martin in review; Thanks!
\| *	improve argparse usage	Bryan Newbold	2019-12-11	1	-6/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	--fatcat-api-url is clearer than --host-url remove unimplemented --debug (copy/paste from webface argparse) use formater which will display 'default' parameters with --help Thanks to Martin for pointing out the later, which i've always wanted!
\| *	simplify ES scroll deletion using param()	Bryan Newbold	2019-12-11	1	-29/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This gets rid of some mess error handling code by properly configuring the elasticsearch client to just not clean up scroll iterators when accessing the public (prod or qa) search interfaces. Leaving the scroll state around isn't ideal, so we still delete them if possible (eg, connecting directly to elasticsearch). Thanks to Martin for pointing out this solution in review.
\| *	add ingest-container command (new CLI tool)	Bryan Newbold	2019-12-10	1	-0/+136
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The intent of this tool is to make it easy to enque ingest requests into kafka, to be processed by a worker pool and eventually end up inserted into fatcat (for ingest hits that pass various checks). As a specific example use-case, we have pretty good coverage of eLife (a prominent OA publisher), but have missed some publications in the past, and have a large gap for the year 2019: https://fatcat.wiki/container/en4qj5ijrbf5djxx7p5zzpjyoq/coverage This tool would make it trivial to enqueue all the missing releases to be crawled. Future variants on this tool could query for, eg, long-tail OA works.