aboutsummaryrefslogtreecommitdiffstats
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
* pubmed: if doing update, also do subtitle schema updateBryan Newbold2019-12-231-1/+9
|
* doi parsing fixesBryan Newbold2019-12-231-0/+7
| | | | | | | | | | Replace emdash with regular dash. Replace double slash after partner ID with single slash. This conversion seems to be done by crossref automatically on lookup. I tried several examples, using doi.org resolver and Crossref API lookup. Note that there are a number of fatcat entities with '//' in the DOI.
* pubmed: improve warning and stderr formattingBryan Newbold2019-12-231-5/+6
|
* pubmed: use standard identifier cleanersBryan Newbold2019-12-231-17/+14
|
* pubmed: remove unused extid mapping codeBryan Newbold2019-12-231-29/+0
|
* pubmed: do reference lookups by defaultBryan Newbold2019-12-231-1/+1
|
* normalizers: clean_pmid(), and handle nulls in all other cleanersBryan Newbold2019-12-231-0/+31
|
* pubmed: null doi parsing checkBryan Newbold2019-12-231-1/+1
|
* add basic MedlineDate year parsingBryan Newbold2019-12-231-0/+11
|
* add regression test for medlinedate -> year parsingBryan Newbold2019-12-232-0/+102
|
* fix spn/ingest importer duplication checkBryan Newbold2019-12-221-6/+8
| | | | | | Check was happing after the `return True` by mistake, allowing duplicates in SPN editgroups, and potentially in ingest request editgroups as well.
* datacite release links and metadata expansionBryan Newbold2019-12-202-9/+13
| | | | | | | Small ergonomic changes for datacite releases: - add a link to live/current datacite metadata (like we do for Crossref) - expand "extra" metadata fields under 'datacite' dict in metadata view
* spn: incluce link_source/link_source_id in ingest requestBryan Newbold2019-12-201-0/+2
|
* pipenv: update depsBryan Newbold2019-12-172-11/+55
| | | | | | | | loginpass patches got accepted upstream a while back, so don't need to pin to a git version ipython 7.10 seems to have problems installing, so restricting to earlier 6.x versions
* pipenv: restrict pytest<5.0.0Bryan Newbold2019-12-172-5/+13
| | | | | | | | | | | | | | | | | | | | | | | | | This prevents a test exception that presents like: tests/transform_csl.py:46: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ fatcat_tools/transforms/csl.py:204: in citeproc_csl style_path = get_style_filepath(style) .venv/lib/python3.5/site-packages/citeproc_styles/__init__.py:74: in get_style_filepath if resource_exists(__name__, independent_style): .venv/lib/python3.5/site-packages/pkg_resources/__init__.py:1134: in resource_exists return get_provider(package_or_requirement).has_resource(resource_name) .venv/lib/python3.5/site-packages/pkg_resources/__init__.py:1404: in has_resource return self._has(self._fn(self.module_path, resource_name)) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <pkg_resources.NullProvider object at 0x7f4f38c0bb00> path = '/home/bnewbold/code/fatcat/python/.venv/lib/python3.5/site-packages/citeproc_styles/styles/bibtex.csl' def _has(self, path): raise NotImplementedError( > "Can't perform this operation for unregistered loader type" ) E NotImplementedError: Can't perform this operation for unregistered loader type
* pipenv: update Pipfile and Pipfile.lockBryan Newbold2019-12-172-286/+318
| | | | | | This is still manually tweaked. I believe i've bifurcated the source of the CSL/citeproc_style import error to upgrade of the 'pytest' module. This commit upgrades all packages except pytest.
* pipfile: add langcodes and dateparser dependenciesBryan Newbold2019-12-172-1/+44
|
* write diagnostic messages to stderrMartin Czygan2019-12-161-2/+2
| | | | | During debugging, it can be helpful to keep stdout (e.g. processing results) and dignostic messages separate.
* Merge branch 'martin-importers-common-doc-fix' into 'master'Martin Czygan2019-12-141-13/+10
|\ | | | | | | | | Update EntityImporter docstring. See merge request webgroup/fatcat!9
| * complete parse_record docstringMartin Czygan2019-12-141-0/+6
| |
| * Update EntityImporter docstring.Martin Czygan2019-12-131-13/+4
| | | | | | | | I believe the required method is `parse_record`, not `parse`.
* | add ingest import file collision protectionBryan Newbold2019-12-131-0/+6
| | | | | | | | | | | | | | | | The common case is the same URL being submitted repeatedly during testing. This is only within-editgroup, and per importer (eg, won't work across spn importer "submitted" editgroups), but is better than nothing.
* | fix spn kafka topic env varBryan Newbold2019-12-131-1/+1
| |
* | update ingest request schemaBryan Newbold2019-12-135-16/+44
| | | | | | | | | | This is mostly changing ingest_type from 'file' to 'pdf', and adding 'link_source'/'link_source_id', plus some small cleanups.
* | remove default mimetype from ingest-file importerBryan Newbold2019-12-131-2/+1
| | | | | | | | We really should just use file_meta result or nothing.
* | revert accidentally commited test timingBryan Newbold2019-12-131-2/+2
| | | | | | | | Also fix a spurious typo.
* | ensure importer description arg isn't clobberedBryan Newbold2019-12-123-5/+5
| |
* | tweaks to ingest-file transformBryan Newbold2019-12-121-13/+7
| |
* | initial 'Save Paper Now' web formBryan Newbold2019-12-127-2/+228
| |
* | more auth token vars in example.envBryan Newbold2019-12-121-0/+6
| | | | | | | | As a form of documentation
* | savepapernow result importerBryan Newbold2019-12-123-4/+89
| | | | | | | | Based on ingest-file-results importer
* | flush importer editgroups every few minutesBryan Newbold2019-12-121-5/+20
| |
* | EntityImporter: submit (not accept) modeBryan Newbold2019-12-121-2/+14
|/ | | | | For use with bots that don't have admin privileges, or where human follow-up review is desired.
* Merge branch 'bnewbold-ingest-oa-container' into 'master'bnewbold2019-12-126-3/+181
|\ | | | | | | | | container-ingest tool See merge request webgroup/fatcat!8
| * container_issnl, not issnl, for ES release queryBryan Newbold2019-12-121-1/+1
| | | | | | | | Caught by Martin in review; Thanks!
| * improve argparse usageBryan Newbold2019-12-111-6/+4
| | | | | | | | | | | | | | | | | | | | --fatcat-api-url is clearer than --host-url remove unimplemented --debug (copy/paste from webface argparse) use formater which will display 'default' parameters with --help Thanks to Martin for pointing out the later, which i've always wanted!
| * simplify ES scroll deletion using param()Bryan Newbold2019-12-111-29/+29
| | | | | | | | | | | | | | | | | | | | | | This gets rid of some mess error handling code by properly configuring the elasticsearch client to just not clean up scroll iterators when accessing the public (prod or qa) search interfaces. Leaving the scroll state around isn't ideal, so we still delete them if possible (eg, connecting directly to elasticsearch). Thanks to Martin for pointing out this solution in review.
| * add ingest-container command (new CLI tool)Bryan Newbold2019-12-101-0/+136
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The intent of this tool is to make it easy to enque ingest requests into kafka, to be processed by a worker pool and eventually end up inserted into fatcat (for ingest hits that pass various checks). As a specific example use-case, we have pretty good coverage of eLife (a prominent OA publisher), but have missed some publications in the past, and have a large gap for the year 2019: https://fatcat.wiki/container/en4qj5ijrbf5djxx7p5zzpjyoq/coverage This tool would make it trivial to enqueue all the missing releases to be crawled. Future variants on this tool could query for, eg, long-tail OA works.
| * factor out some basic kafka helpersBryan Newbold2019-12-102-0/+23
| |
| * add another ingest request source to whitelistBryan Newbold2019-12-101-2/+5
| |
| * pipenv: add elasticsearch and elasticsearch-dsl librariesBryan Newbold2019-12-102-1/+19
| | | | | | | | | | These are low-level and high-level (respectively) client wrappers for elasticsearch
* | improve argparse usageBryan Newbold2019-12-1110-78/+95
|/ | | | | | | | | | | | | | Use --fatcat-api-url instead of (ambiguous) --host-url for commands that aren't deployed/running via systemd. TODO: update the other --host-url usage, and either roll-out change consistently or support the old arg as an alias during cut-over Use argparse.ArgumentDefaultsHelpFormatter (thanks Martin!) Add help messages for all sub-commands, both as documentation and as a way to get argparse to print available commands in a more readable format.
* fix delete release history viewBryan Newbold2019-12-091-1/+1
| | | | | | This was causing 5xx errors in production and qa. Eg, at: https://qa.fatcat.wiki/release/aaaaaaaaaaaaarceaaaaaaaaai/history
* regression test for deleted entity history viewBryan Newbold2019-12-091-0/+25
|
* add missing underline in deleted entity web viewBryan Newbold2019-12-091-1/+1
|
* add basic test for crossref harvest API callBryan Newbold2019-12-062-0/+46
|
* refactor kafka producer in crossref harvesterBryan Newbold2019-12-061-21/+26
| | | | | | | | producer creation/configuration should be happening in __init__() time, not 'daily' call. This specific refactor motivated by mocking out the producer in unit tests.
* add pytest-mock helper library to dev depsBryan Newbold2019-12-062-1/+10
|
* improve previous commit (JATS abstract hack)Bryan Newbold2019-12-031-4/+6
|
* hack: remove enclosing JATS XML tags around abstractsBryan Newbold2019-12-031-1/+7
| | | | | | The more complete fix is to actually render the JATS to HTML and display that. This is just to fix a nit with the most common case of XML tags in abstracts.