fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	entity updates: don't try to ingest arxiv DOIs (for now)	Bryan Newbold	2022-02-28	1	-0/+2
\|
*	entity worker: expand creators in release entities	Bryan Newbold	2021-12-15	1	-1/+1
\|
*	typing: add assertions to fatcat_tool code to make type assumptions explicit	Bryan Newbold	2021-11-03	1	-0/+1
\|
*	typing: add annotations to remaining fatcat_tools code	Bryan Newbold	2021-11-03	1	-17/+26
\| \| \| \| \|	Again, these are just annotations, no changes made to get type checks to pass
*	fmt (black): fatcat_tools/	Bryan Newbold	2021-11-02	1	-118/+138
\|
*	python: isort everything	Bryan Newbold	2021-11-02	1	-1/+2
\|
*	changelog worker: fix file/fileset typo, caught by lint	Bryan Newbold	2021-05-25	1	-1/+1
\| \| \| \| \|	This would have been resulting in some releases not getting re-indexed into search.
*	entity update worker: treat fileset and webcapture updates like file updates	Bryan Newbold	2020-12-16	1	-3/+25
\| \| \| \| \| \| \| \| \|	When webcapture or fileset entities are updated, then the release entities associated with them also need to be updated (and work entities, recursively). A TODO is to handle the case where a release_id is removed as well as added, and reprocess the releases in that case as well.
*	entity updates: don't ingest JSTOR DOI prefixes	Bryan Newbold	2020-10-23	1	-0/+2
\|
*	entity updater: new work update feed (ident and changelog metadata only)	Bryan Newbold	2020-10-16	1	-2/+24
\|
*	ingest: default to crawl protocols.io DOIs	Bryan Newbold	2020-09-10	1	-0/+2
\|
*	entity updater: handle doi=None case better	Bryan Newbold	2020-08-14	1	-1/+1
\|
*	entity updater: es['publisher_type'] not always set	Bryan Newbold	2020-08-14	1	-1/+1
\| \| \| \|	This is a small bugfix for a production issue.
*	entity update: change big5 ingest behavior	Bryan Newbold	2020-08-11	1	-9/+15
\| \| \| \| \| \| \| \| \|	In addition to changing the OA default, this was the main intended behavior change in this group of commits: want to ingest fewer attempts that we expect to fail, but default to ingest/crawl attempt if we are uncertain. This is because there is a long tail of journals that register DOIs and are defacto OA (fulltext is available), but we don't have metadata indicating them as such.
*	entity update: default to ingest non-OA works	Bryan Newbold	2020-08-11	1	-9/+10
\|
*	entity update: skip ingest of figshare+zenodo 'group' DOIs	Bryan Newbold	2020-08-11	1	-0/+15
\|
*	update crawl blocklist for SPNv2 requests which mostly fail	Bryan Newbold	2020-08-10	1	-2/+10
\|
*	lint (flake8) tool python files	Bryan Newbold	2020-07-01	1	-1/+0
\|
*	changelog: limit types	Martin Czygan	2020-04-16	1	-5/+1
\| \| \| \| \|	No partial docs (e.g. abstract), too generic components and entries, not HTML blogs.
*	changelog: extend release_types considered documents	Martin Czygan	2020-04-16	1	-10/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	according to release_rev.release_type, we have 29 values: fatcat_prod=# select release_type, count(release_type) from release_rev group by release_type; release_type \| count -------------------+----------- abstract \| 2264 article \| 6371076 article-journal \| 101083841 article-newspaper \| 17062 book \| 1676941 chapter \| 13914854 component \| 58990 dataset \| 6860325 editorial \| 133573 entry \| 1628487 graphic \| 1809471 interview \| 19898 legal_case \| 3581 legislation \| 1626 letter \| 275119 paper-conference \| 6074669 peer_review \| 30581 post \| 245807 post-weblog \| 135 report \| 1010699 retraction \| 1292 review-book \| 96219 software \| 316 song \| 24027 speech \| 4263 standard \| 312364 stub \| 1036813 thesis \| 414397 \| 0 (29 rows)
*	ingest: more DOI patterns to treat as OA	Bryan Newbold	2020-03-28	1	-0/+26
\| \| \| \| \| \| \|	These are journal/publisher patterns which we suspect to actually be OA based on the large quantity of papers that crawl successfully. The better long-term solution will be to flag containers in some way as OA (or "should crawl"), but this is a good short-term solution.
*	ingest: always try some lancet journals	Bryan Newbold	2020-03-19	1	-0/+3
\|
*	entity worker: ingest more releases	Bryan Newbold	2020-02-22	1	-1/+37
\| \| \| \| \| \| \| \|	If release is a dataset or image, don't do a pdf ingest request. If release is a datacite DOI, and release_type is a "document", crawl regardless of is_oa detection. This is mostly to crawl repositories (institutional or subject).
*	always crawl researchgate DOIs	Bryan Newbold	2020-02-18	1	-0/+2
\| \| \| \|	Now that ingest is fixed
*	add acceptlist override for biorxiv/medrxiv	Bryan Newbold	2020-02-10	1	-2/+12
\|
*	fix KafkaError worker reporting for partition errors	Bryan Newbold	2020-01-29	1	-1/+1
\|
*	additional DOI prefix filters	Bryan Newbold	2020-01-28	1	-0/+8
\| \| \| \|	From martin, thanks.
*	apply ingest request filtering in entity worker	Bryan Newbold	2020-01-28	1	-3/+34
\| \| \| \| \| \| \|	`ingest_oa_only` behavior, and other filters, now handled in the entity update worker, instead of in the transform function. Also add a DOI prefix blocklist feature.
*	update ingest request schema	Bryan Newbold	2019-12-13	1	-1/+1
\| \| \| \| \|	This is mostly changing ingest_type from 'file' to 'pdf', and adding 'link_source'/'link_source_id', plus some small cleanups.
*	project -> ingest_request_source	Bryan Newbold	2019-11-15	1	-1/+1
\|
*	add ingest request feature to entity_updates worker	Bryan Newbold	2019-11-15	1	-4/+20
\| \| \| \| \| \| \| \| \| \| \| \| \|	Initially was going to create a new worker to consume from the release update channel, but couldn't get the edit context ("is this a new release, or update to an existing") from that context. Currently there is a flag in source code to control whether we only do OA releases or all releases. Starting with OA only to start slow, but should probably default to all, and make this a config flag. Should probably also have a config flag to control this entire feature. Tested locally in dev.
*	review/fix all confluent-kafka produce code	Bryan Newbold	2019-09-20	1	-4/+12
\|
*	small fixes to confluent-kafka importers/workers	Bryan Newbold	2019-09-20	1	-4/+10
\| \| \| \| \| \| \| \|	- decrease default changelog pipeline to 5.0sec - fix missing KafkaException harvester imports - more confluent-kafka tweaks - updates to kafka consumer configs - bump elastic updates consumergroup (again)
*	convert pipeline workers from pykafka to confluent-kafka	Bryan Newbold	2019-09-20	1	-67/+116
\|
*	fix typo in typo	Bryan Newbold	2019-06-24	1	-1/+1
\|
*	fix typo in changelog worker	Bryan Newbold	2019-06-24	1	-1/+1
\|
*	more links on new homepage	Bryan Newbold	2019-06-19	1	-1/+1
\| \| \| \| \|	matching produce sizes. may want to tweak this config in the future for throughput.
*	fix and workaround container entities in release topic	Bryan Newbold	2019-05-30	1	-2/+2
\|
*	file and container update kafka topics	Bryan Newbold	2019-05-30	1	-54/+69
\|
*	update elastic for releases when files added	Bryan Newbold	2019-05-30	1	-1/+36
\| \| \| \|	A bunch of remaining TODOs here
*	10 MByte default Kafka produce (workers)	Bryan Newbold	2019-03-06	1	-2/+6
\|
*	bunch of lint/whitespace cleanups	Bryan Newbold	2019-02-22	1	-2/+1
\|
*	include filesets and webcaptures in exports	Bryan Newbold	2019-01-18	1	-1/+1
\|
*	workers do API-passing (not URI-passing)	Bryan Newbold	2019-01-08	1	-4/+4
\|
*	not as strong a todo (timestamps)	Bryan Newbold	2018-11-19	1	-1/+1
\|
*	bunch of pylint cleanup	Bryan Newbold	2018-11-15	1	-1/+1
\|
*	large refactor of python names/paths	Bryan Newbold	2018-11-15	1	-3/+4
\| \| \| \| \| \| \|	- Add __init__.py files for fatcat_tools submodules, and use them in imports - Add a bunch of comments to files. - rename a number of classes and functions to be less verbose
*	fix worker code	Bryan Newbold	2018-11-14	1	-2/+3
\|
*	most_recent_message as reusable function	Bryan Newbold	2018-11-14	1	-26/+1
\|
*	switch to auto consumer offset updates	Bryan Newbold	2018-11-13	1	-1/+6
\| \| \| \| \| \|	This is the classic/correct way to do consumer group updates for higher throughput, when "at least once" semantics are acceptible (as they are here; double processing should be safe/fine).