fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	ES: rename fatcat_ref.json to ref_schema.json for consistency; add to README	Bryan Newbold	2021-04-08	2	-1/+4
\|
*	release ES schema: fix typo with shard/replica configuration	Bryan Newbold	2021-04-08	1	-1/+1
\|
*	sitemaps: filter to releases with PDF fulltext (for now)	Bryan Newbold	2021-04-07	1	-0/+2
\|
*	container search schema: preservation stats, new fields	Bryan Newbold	2021-04-06	1	-8/+9
\| \| \| \|	Includes transform code updates and partial test coverage.
*	release ES: add discipline field	Bryan Newbold	2021-04-06	1	-0/+1
\|
*	ES schemas: add doc_index_ts to all mappings	Bryan Newbold	2021-04-06	5	-0/+9
\|
*	elasticsearch schema, docs, docker: update from ES 6.x to ES 7.x	Bryan Newbold	2021-04-06	7	-125/+24
\| \| \| \| \|	Including removing index document names (use '_doc' instead during transition)
*	add es draft schema for references	Martin Czygan	2021-03-30	1	-0/+106
\|
*	SQL dump timing note	Bryan Newbold	2021-03-10	1	-0/+3
\|
*	sql dump recent timing note	Bryan Newbold	2021-03-08	1	-1/+2
\|
*	elasticsearch: simple new dblp and doaj fields	Bryan Newbold	2021-01-20	1	-0/+3
\|
*	Merge branch 'bnewbold-ci-cleanups' into 'master'	bnewbold	2021-01-05	1	-5/+11
\|\ \| \| \| \| \| \| \| \|	Gitlab CI and docker base image cleanups See merge request webgroup/fatcat!94
\| *	docker xenial: use get-pipenv.py to install pipenv et al	Bryan Newbold	2020-12-22	1	-5/+6
\| \|
\| *	docker xenial: switch to rust 1.43.0	Bryan Newbold	2020-12-22	1	-1/+1
\| \|
\| *	docker xenial: include python3.8	Bryan Newbold	2020-12-22	1	-1/+6
\| \|
* \|	update stats (post DOAJ and dblp imports)	Bryan Newbold	2020-12-29	2	-0/+47
\| \|
* \|	DOAJ import notes, and SQL/stats update	Bryan Newbold	2020-12-23	4	-0/+94
\|/
*	dblp: polish HTML scrape/extract pipeline	Bryan Newbold	2020-12-17	3	-3/+16
\|
*	dblp: script and notes on container metadata generation	Bryan Newbold	2020-12-17	4	-0/+134
\|
*	Merge pull request #65 from ibnesayeed/patch-1	bnewbold	2020-12-17	1	-1/+1
\|\ \| \| \| \|	Improve status counting efficiency
\| *	Improve status counting efficiency	Sawood Alam	2020-12-17	1	-1/+1
\| \| \| \| \| \|	When the input is large with a small number of unique items to be counted then counting as we go would be linear and more efficient approach than sorting and unique counting.
* \|	Revert "docker xenial base image: include python3.8"	Bryan Newbold	2020-12-11	1	-6/+1
\| \| \| \| \| \| \| \|	This reverts commit 91628426678a635f26cf992dbd5caedb4a3ae24b.
* \|	docker xenial base image: include python3.8	Bryan Newbold	2020-12-11	1	-1/+6
\| \|
* \|	docker: how to push to dockerhub	Bryan Newbold	2020-12-11	1	-0/+4
\|/
*	update database/table stats	Bryan Newbold	2020-10-12	2	-0/+48
\|
*	update stats snapshot	Bryan Newbold	2020-09-03	2	-0/+47
\|
*	sitemap fixes from testing	Bryan Newbold	2020-08-19	3	-4/+15
\|
*	iterate on sitemap generation	Bryan Newbold	2020-08-19	6	-7/+119
\|
*	initial sitemap.xml notes/template	Bryan Newbold	2020-08-19	2	-0/+29
\|
*	include releases_by_work in ident tarball	Bryan Newbold	2020-08-04	1	-1/+2
\|
*	update SQL dump docs with group-by-work command (by default)	Bryan Newbold	2020-08-04	1	-1/+1
\|
*	WIP: sorted release ident dumps	Bryan Newbold	2020-08-04	1	-0/+16
\|
*	update table/database size stats	Bryan Newbold	2020-07-22	2	-0/+48
\|
*	commit example of an elasticsearch SQL query	Bryan Newbold	2020-07-01	1	-0/+8
\|
*	commit old README about bulk downloads	Bryan Newbold	2020-07-01	1	-0/+40
\|
*	ES schema: add best_url to file schema	Bryan Newbold	2020-06-04	1	-0/+1
\| \| \| \| \| \| \| \| \|	This will increase index size (URLs are often long in our corpus, and we have many file entities), but seems worth it. Initially added `ia_url` as a second field, guaranteed to always be an *.archive.org URL, but `best_url` defaults to that anyways so didn't seem worthwhile.
*	sql: really don't double-dump requests	Bryan Newbold	2020-05-26	1	-1/+0
\| \| \| \| \| \|	I guess we were dumping 3 times originally; already had an earlier commit that removed one row from this README (that I copypaste to CLI every time)
*	2020-05-26 prod database size and stats	Bryan Newbold	2020-05-26	2	-0/+48
\|
*	update prod stats	Bryan Newbold	2020-04-17	7	-0/+149
\|
*	Add missing packages to Dockerfile and CI file	Bryan Newbold	2020-04-16	1	-1/+1
\|
*	test-base Dockerfile	Bryan Newbold	2020-04-16	2	-0/+51
\| \| \| \|	Used to create bnewbold/fatcat-test-base image
*	update bulk export instructions	Bryan Newbold	2020-04-07	1	-4/+2
\| \| \| \| \|	- don't do expanded and regular release dumps - default to sqldump_public for item name (as that is common-case)
*	sql_dumps: stop doing redundant release dumps	Bryan Newbold	2020-04-01	1	-1/+3
\|
*	bulk exports README different from SQL README	Bryan Newbold	2020-03-17	1	-1/+1
\|
*	ES README: really need to limit to 1k esbulk batches	Bryan Newbold	2020-02-26	1	-3/+3
\|
*	Merge branch 'bnewbold-elastic-v03b'	Bryan Newbold	2020-02-26	5	-61/+203
\|\
\| *	update ES transform README	Bryan Newbold	2020-02-26	1	-2/+3
\| \| \| \| \| \| \| \| \| \|	- smaller batch sizes to prevent esbulk errors - file transform/index
\| *	ES container last tweaks	Bryan Newbold	2020-02-26	1	-3/+4
\| \|
\| *	ES release: last minor tweaks	Bryan Newbold	2020-02-26	1	-3/+5
\| \|
\| *	release schema: do doc_value on DOIs	Bryan Newbold	2020-02-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Because DOIs are pseudo-structured (prefix, and often structure within the publisher-controlled area), I suspect we will in fact be wanting to do analytics over these strings.