fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	ISSN-L dupes check: output all matches	Bryan Newbold	2021-11-17	1	-1/+1
\|
*	sitemap generation improvements	Bryan Newbold	2021-11-10	2	-1/+2
\|
*	elasticsearch schema changes	Bryan Newbold	2021-10-13	2	-3/+13
\|
*	update stats	Bryan Newbold	2021-10-11	3	-0/+48
\|
*	sql_dumps: set collection at upload time	Bryan Newbold	2021-09-02	1	-2/+5
\|
*	prod stats snapshot	Bryan Newbold	2021-08-06	4	-0/+47
\|
*	stats snapshot (2021-06-23)	Bryan Newbold	2021-06-23	2	-0/+47
\|
*	SQL dumps: more pigz (vs. gzip) for speed	Bryan Newbold	2021-06-17	1	-2/+2
\|
*	fatcat_ref ES schema: more doc_values; source_year not source_release_year	Bryan Newbold	2021-06-17	1	-5/+2
\|
*	update dblp pre-import notes and pipenv python version (3.8)	Bryan Newbold	2021-06-03	2	-6/+11
\|
*	elasticsearch ref schema: 6 shards, not 12	Bryan Newbold	2021-05-18	1	-1/+1
\|
*	fix 'colected' typos	Bryan Newbold	2021-04-13	1	-1/+1
\| \| \| \|	Thanks for the catch martin
*	update elasticsearch bootstrap indexing notes	Bryan Newbold	2021-04-09	1	-8/+16
\|
*	ES: rename fatcat_ref.json to ref_schema.json for consistency; add to README	Bryan Newbold	2021-04-08	2	-1/+4
\|
*	release ES schema: fix typo with shard/replica configuration	Bryan Newbold	2021-04-08	1	-1/+1
\|
*	sitemaps: filter to releases with PDF fulltext (for now)	Bryan Newbold	2021-04-07	1	-0/+2
\|
*	container search schema: preservation stats, new fields	Bryan Newbold	2021-04-06	1	-8/+9
\| \| \| \|	Includes transform code updates and partial test coverage.
*	release ES: add discipline field	Bryan Newbold	2021-04-06	1	-0/+1
\|
*	ES schemas: add doc_index_ts to all mappings	Bryan Newbold	2021-04-06	5	-0/+9
\|
*	elasticsearch schema, docs, docker: update from ES 6.x to ES 7.x	Bryan Newbold	2021-04-06	7	-125/+24
\| \| \| \| \|	Including removing index document names (use '_doc' instead during transition)
*	add es draft schema for references	Martin Czygan	2021-03-30	1	-0/+106
\|
*	SQL dump timing note	Bryan Newbold	2021-03-10	1	-0/+3
\|
*	sql dump recent timing note	Bryan Newbold	2021-03-08	1	-1/+2
\|
*	elasticsearch: simple new dblp and doaj fields	Bryan Newbold	2021-01-20	1	-0/+3
\|
*	Merge branch 'bnewbold-ci-cleanups' into 'master'	bnewbold	2021-01-05	1	-5/+11
\|\ \| \| \| \| \| \| \| \|	Gitlab CI and docker base image cleanups See merge request webgroup/fatcat!94
\| *	docker xenial: use get-pipenv.py to install pipenv et al	Bryan Newbold	2020-12-22	1	-5/+6
\| \|
\| *	docker xenial: switch to rust 1.43.0	Bryan Newbold	2020-12-22	1	-1/+1
\| \|
\| *	docker xenial: include python3.8	Bryan Newbold	2020-12-22	1	-1/+6
\| \|
* \|	update stats (post DOAJ and dblp imports)	Bryan Newbold	2020-12-29	2	-0/+47
\| \|
* \|	DOAJ import notes, and SQL/stats update	Bryan Newbold	2020-12-23	4	-0/+94
\|/
*	dblp: polish HTML scrape/extract pipeline	Bryan Newbold	2020-12-17	3	-3/+16
\|
*	dblp: script and notes on container metadata generation	Bryan Newbold	2020-12-17	4	-0/+134
\|
*	Merge pull request #65 from ibnesayeed/patch-1	bnewbold	2020-12-17	1	-1/+1
\|\ \| \| \| \|	Improve status counting efficiency
\| *	Improve status counting efficiency	Sawood Alam	2020-12-17	1	-1/+1
\| \| \| \| \| \|	When the input is large with a small number of unique items to be counted then counting as we go would be linear and more efficient approach than sorting and unique counting.
* \|	Revert "docker xenial base image: include python3.8"	Bryan Newbold	2020-12-11	1	-6/+1
\| \| \| \| \| \| \| \|	This reverts commit 91628426678a635f26cf992dbd5caedb4a3ae24b.
* \|	docker xenial base image: include python3.8	Bryan Newbold	2020-12-11	1	-1/+6
\| \|
* \|	docker: how to push to dockerhub	Bryan Newbold	2020-12-11	1	-0/+4
\|/
*	update database/table stats	Bryan Newbold	2020-10-12	2	-0/+48
\|
*	update stats snapshot	Bryan Newbold	2020-09-03	2	-0/+47
\|
*	sitemap fixes from testing	Bryan Newbold	2020-08-19	3	-4/+15
\|
*	iterate on sitemap generation	Bryan Newbold	2020-08-19	6	-7/+119
\|
*	initial sitemap.xml notes/template	Bryan Newbold	2020-08-19	2	-0/+29
\|
*	include releases_by_work in ident tarball	Bryan Newbold	2020-08-04	1	-1/+2
\|
*	update SQL dump docs with group-by-work command (by default)	Bryan Newbold	2020-08-04	1	-1/+1
\|
*	WIP: sorted release ident dumps	Bryan Newbold	2020-08-04	1	-0/+16
\|
*	update table/database size stats	Bryan Newbold	2020-07-22	2	-0/+48
\|
*	commit example of an elasticsearch SQL query	Bryan Newbold	2020-07-01	1	-0/+8
\|
*	commit old README about bulk downloads	Bryan Newbold	2020-07-01	1	-0/+40
\|
*	ES schema: add best_url to file schema	Bryan Newbold	2020-06-04	1	-0/+1
\| \| \| \| \| \| \| \| \|	This will increase index size (URLs are often long in our corpus, and we have many file entities), but seems worth it. Initially added `ia_url` as a second field, guaranteed to always be an *.archive.org URL, but `best_url` defaults to that anyways so didn't seem worthwhile.
*	sql: really don't double-dump requests	Bryan Newbold	2020-05-26	1	-1/+0
\| \| \| \| \| \|	I guess we were dumping 3 times originally; already had an earlier commit that removed one row from this README (that I copypaste to CLI every time)