fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	ingest fileset fixes, and some test coverage	Bryan Newbold	2022-03-23	2	-13/+30
\|
*	dataset ingest: JSON object fixes	Bryan Newbold	2022-03-22	1	-5/+5
\|
*	Merge branch 'bnewbold-container-web' into 'master'	bnewbold	2022-03-10	3	-2/+185
\|\ \| \| \| \| \| \| \| \|	container web interface improvements See merge request webgroup/fatcat!140
\| *	move container_status ES query code from fatcat_web to fatcat_tools	Bryan Newbold	2022-02-09	3	-2/+185
\| \| \| \| \| \| \| \| \| \| \| \|	The main motivation is to never have fatcat_tools import from fatcat_web, only vica-versa. Some code in fatcat_tools needs container stats, so starting with that code path (plus some generic helpers).
* \|	entity updates: don't try to ingest arxiv DOIs (for now)	Bryan Newbold	2022-02-28	1	-0/+2
\| \|
* \|	datacite importer: skip container_id for some repository sources	Bryan Newbold	2022-02-09	1	-0/+34
\|/
*	doaj importer: TODO note to skip some larger publishers	Bryan Newbold	2022-02-09	1	-0/+4
\|
*	container ES transform: include old extra.issne/p fields	Bryan Newbold	2022-02-03	1	-1/+4
\| \| \| \| \|	These were removed prematurely. Not all containers have been updated to use these fields yet.
*	Merge branch 'bnewbold-file-es' into 'master'	bnewbold	2022-01-21	3	-4/+38
\|\ \| \| \| \| \| \| \| \|	File entity elasticsearch index worker See merge request webgroup/fatcat!136
\| *	entity worker: expand creators in release entities	Bryan Newbold	2021-12-15	1	-1/+1
\| \|
\| *	small default config typo fixes for elasticsearch workers	Bryan Newbold	2021-12-15	1	-2/+2
\| \|
\| *	file elasticsearch index worker	Bryan Newbold	2021-12-15	2	-1/+35
\| \|
* \|	crossref importer: skip affiliations lacking 'name'	Bryan Newbold	2021-12-15	1	-0/+3
\|/ \| \| \|	Relatedly, we should start handling ROR affiliations in contribs soon.
*	mergers: fix typo in env var name	Bryan Newbold	2021-12-07	3	-3/+3
\|
*	ES container schema: add 'sim_pubid' and `ia_sim_collection` fields	Bryan Newbold	2021-12-03	1	-0/+2
\|
*	ES transform: remove prototype microfilm links	Bryan Newbold	2021-12-03	1	-20/+0
\| \| \| \|	This ended up being a feature in scholar.archive.org, not fatcat.
*	chocula importer: handle not-upper-case ISSNs	Bryan Newbold	2021-11-30	1	-2/+6
\|
*	chocula importer: handle broken ISSNs in extra metadata	Bryan Newbold	2021-11-30	1	-2/+7
\|
*	chocula importer: tweak counting, conditions for doing updates	Bryan Newbold	2021-11-30	1	-15/+7
\|
*	chocula importer: move issne/issnp 'extra' to top-level fields if doing updates	Bryan Newbold	2021-11-30	1	-0/+6
\|
*	chocula: don't do name cleanups in importer	Bryan Newbold	2021-11-30	1	-8/+2
\| \| \| \|	This kind of cleanup should be done in 'chocula' instead.
*	container merger: fix bug with filtering by release count	Bryan Newbold	2021-11-30	1	-13/+15
\| \| \| \| \|	Also apply the "human edit" and "release count" checks only to the dupe (to-be-redirected) idents.
*	release merger: same editgroup_id fixes as for file and container mergers	Bryan Newbold	2021-11-24	1	-1/+5
\|
*	container merger: fixes from QA testing	Bryan Newbold	2021-11-24	1	-8/+13
\|
*	mergers: don't try to accept empty editgroups in dry-run-mode	Bryan Newbold	2021-11-24	1	-2/+4
\|
*	ES release transform: handle redirected containers better	Bryan Newbold	2021-11-24	1	-1/+1
\| \| \| \| \|	Despite the inline comment, we were not actually grabbing the "redirected" ident correctly, meaning some counts would not be accurate.
*	container merger: defer allocation of editgroup_id; and dummy code path	Bryan Newbold	2021-11-24	1	-1/+5
\|
*	initial implementation of container merger	Bryan Newbold	2021-11-24	1	-0/+237
\|
*	file merger: allocate editgroup id later in 'merge' process	Bryan Newbold	2021-11-24	1	-1/+5
\| \| \| \| \|	The motivation is to avoid creating empty editgroups in dry-run mode, and when all entities are "skipped"
*	Merge branch 'bnewbold-mergers' into 'master'	bnewbold	2021-11-25	4	-0/+640
\|\ \| \| \| \| \| \| \| \|	entity mergers framework See merge request webgroup/fatcat!133
\| *	mergers common: remove inaccurate comment	Bryan Newbold	2021-11-24	1	-2/+0
\| \| \| \| \| \| \| \|	Caught in review, thanks miku
\| *	file merger: add content_scope to list of merged fields	Bryan Newbold	2021-11-24	1	-1/+1
\| \|
\| *	release merger: some progress, but also disable (not complete)	Bryan Newbold	2021-11-23	1	-12/+72
\| \|
\| *	file merges: fixes from testing in QA	Bryan Newbold	2021-11-23	1	-14/+23
\| \|
\| *	mergers: small tweaks	Bryan Newbold	2021-11-23	2	-3/+3
\| \|
\| *	mergers: remove entity mergers from __init__ (to work around warning)	Bryan Newbold	2021-11-23	1	-2/+0
\| \|
\| *	initial file merger, with tests	Bryan Newbold	2021-11-23	1	-0/+228
\| \|
\| *	mergers: fmt, lint, refactors	Bryan Newbold	2021-11-23	3	-96/+200
\| \| \| \| \| \| \| \| \| \|	These old merger code is from an old branch and needed significant cleanup
\| *	first iteration of mergers	Bryan Newbold	2021-11-23	3	-0/+243
\| \|
* \|	codespell fixes in python code (comments)	Bryan Newbold	2021-11-24	2	-3/+3
\|/
*	content_scope: include in file ES schema and transform	Bryan Newbold	2021-11-17	1	-0/+1
\|
*	Merge branch 'bnewbold-import-refactors' into 'master'	bnewbold	2021-11-11	18	-1462/+811
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	import refactors and deprecations Some of these are from old stale branches (the datacite subject metadata patch), but most are from yesterday and today. Sort of a hodge-podge, but the general theme is getting around to deferred cleanups and refactors specific to importer code before making some behavioral changes. The Datacite-specific stuff could use review here. Remove unused/deprecated/dead code: - cdl_dash_dat and wayback_static importers, which were for specific early example entities and have been superseded by other importers - "extid map" sqlite3 feature from several importers, was only used for initial bulk imports (and maybe should not have been used) Refactors: - moved a number of large datastructures out of importer code and into a dedicated static file (`biblio_lookup_tables.py`). Didn't move all, just the ones that were either generic or very large (making it hard to read code) - shuffled around relative imports and some function names ("clean_str" vs. "clean") Some actual behavioral changes: - remove some Datacite-specific license slugs - stop trying to fix double-slashes in DOIs, that was causing more harm than help (some DOIs do actually have double-slashes!) - remove some excess metadata from datacite 'extra' fields
\| *	improve lookup_license_slug helper and lookup table	Bryan Newbold	2021-11-10	2	-56/+62
\| \|
\| *	refactor importer metadata tables into separate file; move some helpers around	Bryan Newbold	2021-11-10	10	-702/+682
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	- MAX_ABSTRACT_LENGTH set in a single place (importer common) - merge datacite license slug table in to common table, removing some TDM-specific licenses (which do not apply in the context of preserving the full work)
\| *	importers: refactor imports of clean() and other normalization helpers	Bryan Newbold	2021-11-10	12	-95/+104
\| \|
\| *	remove cdl_dash_dat and wayback_static importers	Bryan Newbold	2021-11-10	3	-510/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Cleaning out dead code. These importers were used to create demonstration fileset and webcapture entities early in development. They have been replaced by the fileset and webcapture ingest importers.
\| *	datacite import: store less subject metadata	Bryan Newbold	2021-11-10	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Many of these 'subject' objects have the equivalent of several lines of text, with complex URLs that don't compress well. I think it is fine we have included these thus far instead of parsing more deeply, but going forward I don't think this nested 'extra' metadata is worth the database space.
\| *	importers: use clean_doi() in many more (all?) importers	Bryan Newbold	2021-11-09	6	-12/+29
\| \|
\| *	clean_doi: stop mutating double-slash DOIs, except for 10.1037 prefix	Bryan Newbold	2021-11-09	1	-1/+2
\| \|
\| *	remove deprecated extid sqlite3 lookup table feature from importers	Bryan Newbold	2021-11-09	3	-160/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This was used during initial bulk imports, but is no longer used and could create serious metadata problems if used accidentially. In retrospect, it also made metadata provenance less transparent, and may have done more harm than good overall.