fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
...
* \|	rename HarvestState.next() to HarvestState.next_span()	Bryan Newbold	2020-05-26	1	-2/+2
\|/ \| \| \| \| \| \| \| \|	"span" short for "timespan" to harvest; there may be a better name to use. Motivation for this is to work around a pylint erorr that .next() was not callable. This might be a bug with pylint, but .next() is also a very generic name.
*	HACK: try to squelch pylint in CI	Bryan Newbold	2020-05-26	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Gitlab CI is showing lint errors like: =================================== FAILURES =================================== 6316 _______________________ [pylint] tests/harvest_state.py ________________________ 6317 E: 19,11: hs.next is not callable (not-callable) 6318 E: 33,11: hs.next is not callable (not-callable) 6319 E: 19,11: hs.next is not callable (not-callable) [...] this is confusing as we use pipenv with a lock, so I should see the exact same errors locally. This commit is a hack to try and fix this and unbreak builds until we can debug further.
*	Merge remote-tracking branch 'github/master'	Bryan Newbold	2020-05-22	1	-5/+5
\|\
\| *	Indentity is not the same this as equality in Python	Christian Clauss	2020-05-14	1	-5/+5
\| \|
* \|	datacite: fix type error	Martin Czygan	2020-04-22	3	-1/+77
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Up to now, we expected the description to be a string or list. Add handling for int as well. First appeared: Apr 22 19:58:39.
* \|	datacite: fix a raw name constraint violation	Martin Czygan	2020-04-20	3	-1/+78
\|/ \| \| \| \| \| \|	It was possible that contribs got added which had no raw name. One example would be a name consisting of whitespace only. This fix adds a final check for this case.
*	crossref: switch from index-date to update-date	Bryan Newbold	2020-03-30	1	-1/+1
\| \| \| \| \| \|	This goes against what the API docs recommend, but we are currently far behind on updates and need to catch up. Other than what the docs say, this seems to be consistent with the behavior we want.
*	Merge pull request #53 from EdwardBetts/spelling	bnewbold	2020-03-27	2	-3/+3
\|\ \| \| \| \|	Correct spelling mistakes
\| *	Correct spelling mistakes	Edward Betts	2020-03-27	2	-3/+3
\| \|
* \|	Merge branch 'bnewbold-400-bad-revisions' into 'master'	Martin Czygan	2020-03-26	1	-0/+2
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \|	catch ApiValueError in some generic API calls See merge request webgroup/fatcat!35
\| * \|	catch ApiValueError in some generic API calls	Bryan Newbold	2020-03-25	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The motivation for this change is to handle bogus revision IDs in URLs, which were causing 500 errors not 400 errors. Eg: https://qa.fatcat.wiki/file/rev/5d5d5162-b676-4f0a-968f-e19dadeaf96e%2B2019-11-27%2B13:49:51%2B0%2B6 I have no idea where these URLs are actually coming from, but they should be 4xx not 5xx. Investigating made me realize there is a whole category of ApiValueError exceptions we were not catching and should have been.
* \| \|	improve citeproc/CSL web interface	Bryan Newbold	2020-03-25	2	-13/+53
\|/ / \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This tries to show the citeproc (bibtext, MLA, CSL-JSON) options for more releases, and not show the links when they would break. The primary motivation here is to work around two exceptions being thrown in prod every day (according to sentry): KeyError: 'role' ValueError: CLS requries some surname (family name) I'm guessing these are mostly coming from crawlers following the citeproc links on release landing pages.
* \|	pubmed: handle multiple ReferenceList	Bryan Newbold	2020-03-20	2	-0/+218
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This resolves a situation noticed in prod where we were only importing/updating a single reference per article. Includes a regression test.
* \|	Merge branch 'martin-kafka-bs4-import' into 'master'	Martin Czygan	2020-03-10	3	-0/+80
\|\ \ \| \|/ \|/\| \| \| \| \|	pubmed and arxiv harvest preparations See merge request webgroup/fatcat!28
\| *	pubmed: move mapping generation out of fetch_date	Martin Czygan	2020-03-10	1	-0/+2
\| \| \| \| \| \| \| \| \| \|	* fetch_date will fail on missing mapping * adjust tests (test will require access to pubmed ftp)
\| *	more pubmed adjustments	Martin Czygan	2020-02-22	3	-0/+78
\| \| \| \| \| \| \| \| \| \|	* regenerate map in continuous mode * add tests
* \|	Merge branch 'bnewbold-elastic-v03b'	Bryan Newbold	2020-02-26	4	-4/+61
\|\ \
\| * \|	ES updates: fix tests to accept archive.org in host/domain	Bryan Newbold	2020-02-14	1	-2/+3
\| \| \|
\| * \|	ES releases: host/domain fixes	Bryan Newbold	2020-01-31	1	-0/+3
\| \| \|
\| * \|	implement host+domain parsing for file ES transform	Bryan Newbold	2020-01-30	1	-4/+3
\| \| \|
\| * \|	fix ES file schema plural field names	Bryan Newbold	2020-01-29	1	-1/+1
\| \| \|
\| * \|	actually implement changelog transform	Bryan Newbold	2020-01-29	1	-1/+23
\| \| \|
\| * \|	fix some transform bugs, add some tests	Bryan Newbold	2020-01-29	4	-5/+16
\| \| \|
\| * \|	first implementation of ES file schema	Bryan Newbold	2020-01-29	1	-2/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Includes a trivial test and transform, but not any workers or doc updates.
* \| \|	shadow import: more filtering of file_meta fields	Bryan Newbold	2020-02-13	2	-18/+18
\| \| \|
* \| \|	basic shadow importer	Bryan Newbold	2020-02-13	2	-0/+71
\| \|/ \|/\|
* \|	datacite: add exception for https://www.micropublication.org/	Martin Czygan	2020-01-31	1	-1/+2
\| \|
* \|	datacite: improve date handling and minor tweak	Martin Czygan	2020-01-30	3	-2/+111
\|/ \| \| \| \| \| \| \| \| \| \| \| \|	Records from https://www.micropublication.org/ did not have a date in FC, although raw data contained date strings - they were not using the finer-grained "attributes.date" but "attributes.published" and/or "attributes.publicationYear". Support for those fields has been added, including a test case. During this test (#30) a processing gap for names became clear (author may have "given_name" and "surname", but no "name"). This bug has been fixed, too.
*	do not normalize "en dash" in DOI	Martin Czygan	2020-01-17	1	-1/+1
\| \| \| \| \| \| \| \| \|	Technically, [...] DOI names may incorporate any printable characters from the Universal Character Set (UCS-2), of ISO/IEC 10646, which is the character set defined by Unicode (https://www.doi.org/doi_handbook/2_Numbering.html#2.5.1). For mostly QA reasons, we currently treat a DOI with an "en dash" as invalid.
*	ingest: improve tests, support old ingest results	Bryan Newbold	2020-01-15	3	-1/+18
\|
*	datacite: add entry to license slug map	Martin Czygan	2020-01-09	1	-0/+1
\|
*	datacite: ignore known unknown values in resourceType*	Martin Czygan	2020-01-09	3	-1/+95
\|
*	datacite: abstracts may be strings or list of strings	Martin Czygan	2020-01-09	5	-1/+187
\|
*	datacite: improve license_slug handling	Martin Czygan	2020-01-09	3	-2/+33
\|
*	datacite: add 'Unknown' to blacklist	Martin Czygan	2020-01-09	1	-7/+1
\|
*	datacite: get rid of schemaVersion	Martin Czygan	2020-01-09	17	-32/+14
\|
*	datacite: reformat test cases and use jq . --sort-keys	Martin Czygan	2020-01-08	54	-2299/+2301
\|
*	datacite: factor out contributor handling	Martin Czygan	2020-01-08	5	-2/+107
\| \| \| \| \| \| \|	Use values from: * attributes.creators[] * attributes.contributors[]
*	datacite: adjust tests for release_month	Martin Czygan	2020-01-08	12	-12/+12
\|
*	datacite: mark additional files as stub	Martin Czygan	2020-01-08	3	-1/+73
\|
*	datacite: CCDC are entries, mostly	Martin Czygan	2020-01-08	1	-1/+1
\|
*	datacite: adding datacite-specific extra metadata	Martin Czygan	2020-01-07	30	-1468/+1570
\| \| \| \| \| \| \| \| \| \| \| \| \|	* attributes.metadataVersion * attributes.schemaVersion * attributes.version (source dependent values, follows suggestions in https://schema.datacite.org/meta/kernel-4.3/doc/DataCite-MetadataKernel_v4.3.pdf#page=26, but values vary) Furthermore: * attributes.types.resourceTypeGeneral * attributes.types.resourceType
*	datacite: month field should be top-level	Martin Czygan	2020-01-06	11	-14/+14
\|
*	datacite: include month in extra	Martin Czygan	2020-01-06	11	-11/+13
\| \| \| \| \|	> include release_month as a top-level extra field [...] to auto-populate the schema field from that
*	datacite: indicate mismatched file in test	Martin Czygan	2020-01-06	1	-1/+1
\|
*	datacite: clean abstracts, use unknown value tokens	Martin Czygan	2020-01-06	3	-3/+3
\| \| \| \| \| \| \| \|	Datacite defines placeholders for unknown values: * https://support.datacite.org/docs/schema-values-unknown-information-v43 Clean abstracts.
*	datacite: always include "datacite" key in extra	Martin Czygan	2020-01-04	14	-26/+26
\| \| \| \| \| \|	> always include extra values for the respective DOI registrars (datacite, crossref, jalc), even if they are empty ({}), to be used as a flag so we know which DOI registrar supplied the metadata.
*	datacite: use normal.clean_doi	Martin Czygan	2020-01-03	1	-4/+0
\|
*	datacite: parse_datacite_dates returns month	Martin Czygan	2020-01-03	1	-7/+16
\| \| \| \|	As [...] we will soon add support for release_month field in the release schema.
*	datacite: prepare release_month (stub)	Martin Czygan	2020-01-03	1	-14/+14
\|