fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	python: isort everything	Bryan Newbold	2021-11-02	3	-8/+17
\|
*	lint: simple, safe inline lint fixes	Bryan Newbold	2021-11-02	1	-5/+5
\| \| \| \|	'==' vs 'is'; 'not a in b' vs 'a not in b'; etc
*	entity transforms: add basic type annotations	Bryan Newbold	2021-11-02	1	-7/+19
\|
*	re-fmt all the fatcat_tools __init__ files for readability	Bryan Newbold	2021-11-02	1	-4/+14
\|
*	small python tweaks for annotations, imports	Bryan Newbold	2021-11-02	1	-1/+1
\|
*	try some type annotations	Bryan Newbold	2021-11-02	1	-6/+6
\|
*	Merge branch 'bnewbold-import-fileset'	Bryan Newbold	2021-11-02	1	-1/+15
\|\
\| *	ingest: handle datasets, components, other ingest types	Bryan Newbold	2021-10-14	1	-1/+15
\| \|
* \|	access: populate thumbnail_url for PDFs	Bryan Newbold	2021-10-18	1	-3/+9
\|/
*	python: implement ES schema changes	Bryan Newbold	2021-10-13	1	-4/+17
\|
*	refs: generalize web endpoints; JSON content negotiation; openlibrary ↵	Bryan Newbold	2021-07-23	1	-0/+2
\| \| \| \|	inbound view; etc
*	remove unused imports (lint)	Bryan Newbold	2021-07-23	1	-1/+1
\|
*	partial access options transform for releases	Bryan Newbold	2021-07-23	1	-0/+58
\|
*	more consistent and defensive lower-casing of DOIs	Bryan Newbold	2021-06-23	1	-2/+2
\| \| \| \| \| \| \|	After noticing more upper/lower ambiguity in production. In particular, we have some old ingest requests in sandcrawler DB, which get re-submitted/re-tried, which have capitalized DOIs in the link source id field.
*	small python lint fixes (no behavior change)	Bryan Newbold	2021-05-25	1	-1/+1
\|
*	ingest: add per-container ingest type overrides	Bryan Newbold	2021-05-21	1	-1/+17
\|
*	transforms: fix 'display_ame' typo	Bryan Newbold	2021-04-19	1	-2/+2
\|
*	prefer contrib.creator.display_name over contrib.raw_name	Bryan Newbold	2021-04-12	2	-4/+7
\| \| \| \| \| \| \| \|	These will be getting updates from ORCID and are usually more complete and more correct for display, attribution, and search purposes. Might need to tweak fuzzycat code to handle multiple names at the verification stage.
*	ES schema updates: doc_index_ts as a str, not datetime	Bryan Newbold	2021-04-06	1	-4/+4
\| \| \| \| \|	The schema is a timestamp, but python needs to serialize as JSON, and doesn't do datetime automatically.
*	container search schema: preservation stats, new fields	Bryan Newbold	2021-04-06	1	-2/+18
\| \| \| \|	Includes transform code updates and partial test coverage.
*	release ES: add discipline field	Bryan Newbold	2021-04-06	1	-0/+2
\|
*	ES schemas: add doc_index_ts to all mappings	Bryan Newbold	2021-04-06	1	-0/+4
\|
*	elasticsearch: simple new dblp and doaj fields	Bryan Newbold	2021-01-20	1	-0/+4
\|
*	bug fix: is_preserved should always be bool	Bryan Newbold	2020-12-17	1	-2/+2
\|
*	fix indentation	Bryan Newbold	2020-12-16	1	-2/+2
\|
*	have release elasticsearch transform count webcaptures and filesets towards ↵	Bryan Newbold	2020-12-16	1	-26/+57
\| \| \| \| \| \| \| \| \| \| \| \| \|	preservation These are simple/partial changes to have webcaptures and filesets show up in 'preservation', 'in_ia', and 'in_web' ES schema flags. A longer-term TODO is to update the ES schema to have more granular analytic flags. Also includes a small generalization refactor for URL object parsing into preservation status, shared across file+fileset+webcapture entity types (all have similar URL objects with url+rel fields).
*	small release_to_elasticsearch refactors	Bryan Newbold	2020-12-16	1	-7/+12
\| \| \| \| \| \| \|	These should have almost no change in behavior, but improve code quality. The one behavior change is counting ftp URLs as "in_web"
*	refactor release_to_elasticsearch transform	Bryan Newbold	2020-12-16	1	-131/+148
\| \| \| \| \| \| \| \| \| \| \| \|	This method was huge an monolithic. This commit splits out the content and container specific sections into helper functions to make it more managable. This involved refactoring to make many flags ("is_" and "in_") part of the output dict through the entire code path, allowing simple update() calls on the dict. Noting that in the future should refactor to use a type-annotated class for the elasticsearch output object. Perhaps something auto-generated from the ES schema itself (JSON files).
*	if a release has DOAJ article id, count as OA	Bryan Newbold	2020-11-19	1	-0/+3
\|
*	ingest tool: support for setting ingest type	Bryan Newbold	2020-11-06	1	-6/+6
\|
*	elastic transform: more preservation keepers	Bryan Newbold	2020-10-08	1	-1/+2
\|
*	release ES transform tweaks	Bryan Newbold	2020-08-07	1	-3/+5
\| \| \| \| \| \| \| \|	pass-through publisher_type from container extra metadata (ES field already existed; this is from newer chocula metadata) count arxiv and PMCID papers which haven't been crawled (by IA) as "dark", not "bright"
*	basic toml transform helper	Bryan Newbold	2020-07-30	2	-4/+20
\|
*	simplify in_kbart check statement	Bryan Newbold	2020-07-23	1	-1/+1
\| \| \| \|	Thanks @martin
*	make in_kbart transform inclusive of last year	Bryan Newbold	2020-07-23	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Frequently when looking at preservation coverage of journals, the current year shows as "un-preserved" when in fact there is robust KBART (keepers, eg CLOCKSS/Portico) coverage. This is partially because we don't update containers with KBART year spans very frequently (which is on us), and partially because KBART reports are often a bit out of day (eg, doesn't show coverage for the current year. For that matter, they probably take a few months to update the previous year as well, but that is a larger time span to fudge over. This patch means we will count Portico/LOCKSS/etc coverage for "last year" to count as coverage of publications dated "this year". Note that for this to be effective/correct, it is assumed that we will update containers with coverage year spans at least once a year, and that we will re-index all releases at least once a year.
*	lint (flake8) tool python files	Bryan Newbold	2020-07-01	4	-18/+10
\|
*	ES schema: add best_url to file schema	Bryan Newbold	2020-06-04	1	-0/+12
\| \| \| \| \| \| \| \| \|	This will increase index size (URLs are often long in our corpus, and we have many file entities), but seems worth it. Initially added `ia_url` as a second field, guaranteed to always be an *.archive.org URL, but `best_url` defaults to that anyways so didn't seem worthwhile.
*	improve citeproc/CSL web interface	Bryan Newbold	2020-03-25	1	-6/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This tries to show the citeproc (bibtext, MLA, CSL-JSON) options for more releases, and not show the links when they would break. The primary motivation here is to work around two exceptions being thrown in prod every day (according to sentry): KeyError: 'role' ValueError: CLS requries some surname (family name) I'm guessing these are mostly coming from crawlers following the citeproc links on release landing pages.
*	Merge branch 'bnewbold-elastic-v03b'	Bryan Newbold	2020-02-26	2	-46/+198
\|\
\| *	improve is_oa flag accuracy	Bryan Newbold	2020-02-26	1	-8/+4
\| \| \| \| \| \| \| \| \| \| \| \|	Particularly, the ezb=green match seems mostly incorrect. Note that pmcid being assigned could still be in an embargo window?
\| *	ES container last tweaks	Bryan Newbold	2020-02-26	1	-0/+3
\| \|
\| *	ES release: last minor tweaks	Bryan Newbold	2020-02-26	1	-2/+2
\| \|
\| *	ES files: don't remove archive.org domains/hosts	Bryan Newbold	2020-02-07	1	-5/+0
\| \|
\| *	ES releases: host/domain fixes	Bryan Newbold	2020-01-31	1	-2/+2
\| \|
\| *	fix release es transform missing 'issue'	Bryan Newbold	2020-01-30	1	-0/+1
\| \|
\| *	add upper-case work-around from kibana map join	Bryan Newbold	2020-01-30	1	-0/+1
\| \|
\| *	tweak file ES archive.org domain tracking	Bryan Newbold	2020-01-30	1	-0/+6
\| \|
\| *	implement host+domain parsing for file ES transform	Bryan Newbold	2020-01-30	1	-9/+5
\| \|
\| *	fix ES file schema plural field names	Bryan Newbold	2020-01-29	1	-4/+3
\| \|
\| *	elastic schema fixes	Bryan Newbold	2020-01-29	1	-0/+5
\| \|