fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	remove 'import *' from fatcat_tools (for transforms)	Bryan Newbold	2021-11-02	1	-2/+2
\|
*	small python tweaks for annotations, imports	Bryan Newbold	2021-11-02	3	-3/+7
\|
*	try some type annotations	Bryan Newbold	2021-11-02	4	-70/+79
\|
*	reviewer: add annotations required by mypy	Bryan Newbold	2021-11-02	1	-2/+3
\|
*	fix missing variable in fileset ingest	Bryan Newbold	2021-11-02	1	-2/+1
\|
*	Merge branch 'bnewbold-import-fileset'	Bryan Newbold	2021-11-02	5	-4/+350
\|\
\| *	WIP: more fileset ingest	Bryan Newbold	2021-10-18	1	-13/+21
\| \|
\| *	WIP: rel fixes	Bryan Newbold	2021-10-14	1	-6/+6
\| \|
\| *	fileset ingest small tweaks	Bryan Newbold	2021-10-14	1	-21/+36
\| \|
\| *	initial implementation of fileset ingest importers	Bryan Newbold	2021-10-14	2	-3/+224
\| \|
\| *	ingest: handle datasets, components, other ingest types	Bryan Newbold	2021-10-14	1	-1/+15
\| \|
\| *	generic fileset importer class, with test coverage	Bryan Newbold	2021-10-14	3	-0/+88
\| \|
* \|	Merge branch 'bnewbold-match-get'	Bryan Newbold	2021-11-02	1	-3/+9
\|\ \
\| * \|	access: populate thumbnail_url for PDFs	Bryan Newbold	2021-10-18	1	-3/+9
\| \|/
* /	pubmed: switch default http site to retrieve update files	Martin Czygan	2021-10-15	1	-2/+4
\|/ \| \| \| \| \| \|	Proxy started to throw: "dial tcp: lookup ftp.ncbi.nlm.nih.gov on [::1]:53: read udp [::1]:45178->[::1]:53: read: connection refused" NIH has a http version on it's own, try to use that.
*	dblp import: basic support for handles as identifiers	Bryan Newbold	2021-10-13	1	-1/+5
\|
*	python: normalization/validation support for handle identifiers (hdl)	Bryan Newbold	2021-10-13	1	-0/+33
\|
*	dblp import: fix typos in identifier parsing	Bryan Newbold	2021-10-13	1	-2/+1
\|
*	python: partial importer utilization of new schema changes	Bryan Newbold	2021-10-13	3	-6/+18
\|
*	python: implement ES schema changes	Bryan Newbold	2021-10-13	1	-4/+17
\|
*	Merge branch 'bnewbold-ingest-tweaks' into 'master'	bnewbold	2021-10-02	3	-39/+106
\|\ \| \| \| \| \| \| \| \|	ingest importer behavior tweaks See merge request webgroup/fatcat!120
\| *	kafka import: optional 'force-flush' mode for some importers	Bryan Newbold	2021-10-01	1	-0/+13
\| \| \| \| \| \| \| \|	Behavior and motivation described in the kafka json import comment.
\| *	new SPN web (html) importer	Bryan Newbold	2021-10-01	2	-27/+81
\| \|
\| *	ingest importer behavior tweaks	Bryan Newbold	2021-10-01	1	-8/+8
\| \| \| \| \| \| \| \| \| \|	- change order of 'want()' checks, so that result counts are clearer - don't require GROBID success for file imports with SPN
\| *	importer common: more verbose logging (with counts)	Bryan Newbold	2021-10-01	1	-4/+4
\| \|
* \|	datacite: skip empty abstracts	Martin Czygan	2021-10-01	1	-1/+4
\|/ \| \| \| \|	Do not add abstracts where `clean` results in the empty string - this violates a constraint: `either abstract_sha1 or content is required`
*	pubmed: workaround a networking issue	Martin Czygan	2021-09-09	1	-24/+21
\| \| \| \| \| \|	use an http proxy (https://github.com/miku/ftpup) to fetch files from FTP, keep some retry logic; also, hardcoding the proxy path as this should be a temporary workaround
*	pubmed: add option to ftp download with lftp	Martin Czygan	2021-09-08	1	-2/+31
\| \| \| \| \|	lftp is a classic command line ftp client, and we hope that its retry capabilities are enough of a workaround for the current networking issue
*	pubmed harvester: add basic retry logic	Martin Czygan	2021-08-20	1	-8/+21
\| \| \| \| \| \| \| \|	Related to a previous issue with seemingly random EOFError from FTP connections, this patch wrap "ftpretr" helper function with a basic retry. Refs: fatcat-workers/issues/92151, fatcat-workers/issues/91102
*	refs: default to not consolidating works	Bryan Newbold	2021-08-06	1	-1/+1
\| \| \| \| \| \| \|	We don't handle counts for consolidated refs yet, so just don't consolidate. This should fix, eg, "Showing 1-18 of 19" type UX confusion, with the trade-off that some works will be duplicated in inbound ref tables.
*	refs: lint fixes	Bryan Newbold	2021-07-27	1	-0/+1
\|
*	refs: support for wikipedia outbound refs, and display in tables	Bryan Newbold	2021-07-27	1	-2/+2
\|
*	refs: generalize web endpoints; JSON content negotiation; openlibrary ↵	Bryan Newbold	2021-07-23	2	-22/+57
\| \| \| \|	inbound view; etc
*	refs: small refactors/tweaks	Bryan Newbold	2021-07-23	1	-11/+17
\|
*	remove unused imports (lint)	Bryan Newbold	2021-07-23	2	-3/+2
\|
*	pylint: skip pydantic import check (dynamic/extensions)	Bryan Newbold	2021-07-23	1	-8/+2
\|
*	refs: refactor web paths; enrich refs as generic; remove old refs link	Bryan Newbold	2021-07-23	1	-50/+35
\|
*	refs fetch: add some hacks; sort hits	Bryan Newbold	2021-07-23	1	-6/+16
\|
*	fixes for newer ref index	Bryan Newbold	2021-07-23	1	-1/+1
\|
*	references: refactor to point to access_options transform; comment out CSL ↵	Bryan Newbold	2021-07-23	1	-57/+8
\| \| \| \|	fields
*	partial access options transform for releases	Bryan Newbold	2021-07-23	1	-0/+58
\|
*	initial inbound/outbound reference query helpers	Bryan Newbold	2021-07-23	1	-0/+450
\|
*	pubmed: update docs	Martin Czygan	2021-07-17	1	-2/+3
\|
*	pubmed: do not fail when accessing missing file	Martin Czygan	2021-07-17	1	-2/+8
\| \| \| \| \| \| \|	after a sync gap (e.g. 06/07 2021) harvester wanted to fetch a file, that was not on the server (any more) - do not fail in this case we'll need to backfill missing records via full data dump
*	pubmed: reconnect on error	Martin Czygan	2021-07-16	1	-4/+30
\| \| \| \| \| \| \| \| \|	ftp retrieval would run but fail with EOFError on /pubmed/updatefiles/pubmed21n1328_stats.html - not able to find the root cause; using a fresh client, the exact same file would work just fine. So when we retry, we reconnect on failure. Refs: sentry #91102.
*	more consistent and defensive lower-casing of DOIs	Bryan Newbold	2021-06-23	3	-3/+8
\| \| \| \| \| \| \|	After noticing more upper/lower ambiguity in production. In particular, we have some old ingest requests in sandcrawler DB, which get re-submitted/re-tried, which have capitalized DOIs in the link source id field.
*	datacite: more careful title string access; fixes sentry #88350	Martin Czygan	2021-06-11	1	-1/+1
\| \| \| \| \|	Caused by a partial "title entry without title" coming first (e.g. just holding, e.g. a language, like: {'lang': 'da'}
*	clean_doi() should lower-case returned DOI	Bryan Newbold	2021-06-07	1	-1/+4
\| \| \| \| \| \| \| \| \| \|	Code in a number of places (including Pubmed importer) assumed that this was already lower-casing DOIs, resulting in some broken metadata getting created. See also: https://github.com/internetarchive/fatcat/issues/83 This is just the first step of mitigation.
*	ingest: swap ingest and file checks, to result in clearer stats/counts of ↵	Bryan Newbold	2021-06-03	1	-2/+2
\| \| \| \|	skipping
*	ingest: don't accept mag and s2 URLs	Bryan Newbold	2021-06-03	1	-4/+4
\|