fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
...
* \|	Merge branch 'martin-kafka-bs4-import' into 'master'	Martin Czygan	2020-03-10	10	-43/+428
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \|	pubmed and arxiv harvest preparations See merge request webgroup/fatcat!28
\| * \|	common: use smaller batch size since XML parsing may be slow	Martin Czygan	2020-03-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Address kafka tradeoff between long and short time-outs. Shorter time-outs would facilitate > consumer group re-balances and other consumer group state changes [...] in a reasonable human time-frame.
\| * \|	pubmed: log to stderr	Martin Czygan	2020-03-10	1	-1/+1
\| \| \|
\| * \|	pubmed: move mapping generation out of fetch_date	Martin Czygan	2020-03-10	2	-7/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* fetch_date will fail on missing mapping * adjust tests (test will require access to pubmed ftp)
\| * \|	harvest: fix imports from HarvestPubmedWorker cleanup	Martin Czygan	2020-03-10	2	-4/+4
\| \| \|
\| * \|	pubmed: citations is a bit more precise	Martin Czygan	2020-03-09	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	> Each day, NLM produces update files that include new, revised and deleted citations. -- ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/README.txt
\| * \|	pubmed: we sync from FTP	Martin Czygan	2020-03-09	1	-1/+1
\| \| \|
\| * \|	oaipmh: HarvestPubmedWorker obsoleted by PubmedFTPWorker	Martin Czygan	2020-03-09	1	-34/+0
\| \| \|
\| * \|	fatcat_import: address potential hanging, if stdin is empty	Martin Czygan	2020-03-09	1	-0/+2
\| \| \|
\| * \|	more pubmed adjustments	Martin Czygan	2020-02-22	6	-71/+197
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* regenerate map in continuous mode * add tests
\| * \|	pubmed ftp: fix url	Martin Czygan	2020-02-19	1	-4/+6
\| \| \|
\| * \|	pubmed ftp harvest and KafkaBs4XmlPusher	Martin Czygan	2020-02-19	6	-21/+307
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* add PubmedFTPWorker * utils are currently stored alongside pubmed (e.g. ftpretr, xmlstream) but may live elsewhere, as they are more generic * add KafkaBs4XmlPusher
* \| \|	add --force-crawl flag to ingest tool	Bryan Newbold	2020-03-02	1	-0/+5
\| \|/ \|/\|
* \|	pipenv: lock authlib to less than v0.13; rebuild lock file	Bryan Newbold	2020-02-28	2	-112/+109
\| \|
* \|	ES README: really need to limit to 1k esbulk batches	Bryan Newbold	2020-02-26	1	-3/+3
\| \|
* \|	Merge branch 'bnewbold-elastic-v03b'	Bryan Newbold	2020-02-26	16	-257/+674
\|\ \
\| * \|	improve is_oa flag accuracy	Bryan Newbold	2020-02-26	2	-10/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Particularly, the ezb=green match seems mostly incorrect. Note that pmcid being assigned could still be in an embargo window?
\| * \|	update ES transform README	Bryan Newbold	2020-02-26	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- smaller batch sizes to prevent esbulk errors - file transform/index
\| * \|	fix fatcat_transform state filters	Bryan Newbold	2020-02-26	1	-4/+4
\| \| \|
\| * \|	bulk ES transform: skip non-active entities	Bryan Newbold	2020-02-26	1	-0/+8
\| \| \|
\| * \|	ES container last tweaks	Bryan Newbold	2020-02-26	2	-3/+7
\| \| \|
\| * \|	ES release: last minor tweaks	Bryan Newbold	2020-02-26	2	-5/+7
\| \| \|
\| * \|	ES updates: fix tests to accept archive.org in host/domain	Bryan Newbold	2020-02-14	1	-2/+3
\| \| \|
\| * \|	release schema: do doc_value on DOIs	Bryan Newbold	2020-02-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Because DOIs are pseudo-structured (prefix, and often structure within the publisher-controlled area), I suspect we will in fact be wanting to do analytics over these strings.
\| * \|	ES files: don't remove archive.org domains/hosts	Bryan Newbold	2020-02-07	1	-5/+0
\| \| \|
\| * \|	ES release: actually do want doc_values for work_id	Bryan Newbold	2020-02-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Eg, for fast "unique count"
\| * \|	fix axiv/arxiv typo in release schema	Bryan Newbold	2020-02-04	1	-1/+1
\| \| \|
\| * \|	ES release schema: fix typo	Bryan Newbold	2020-01-31	1	-1/+1
\| \| \|
\| * \|	ES releases: host/domain fixes	Bryan Newbold	2020-01-31	2	-2/+5
\| \| \|
\| * \|	pipenv: lock zipp version to work around python3.6 requirement	Bryan Newbold	2020-01-30	2	-7/+20
\| \| \|
\| * \|	fix release es transform missing 'issue'	Bryan Newbold	2020-01-30	1	-0/+1
\| \| \|
\| * \|	fix json typos in changelog schema	Bryan Newbold	2020-01-30	1	-2/+2
\| \| \|
\| * \|	add upper-case work-around from kibana map join	Bryan Newbold	2020-01-30	2	-0/+2
\| \| \|
\| * \|	JSON typo in release mapping	Bryan Newbold	2020-01-30	1	-1/+0
\| \| \|
\| * \|	ES schemas: make keywords case-insensitive by default	Bryan Newbold	2020-01-30	4	-66/+115
\| \| \| \| \| \| \| \| \| \| \| \|	But not applying asciifolding; don't see any need to do so?
\| * \|	tweak file ES archive.org domain tracking	Bryan Newbold	2020-01-30	2	-0/+7
\| \| \|
\| * \|	implement host+domain parsing for file ES transform	Bryan Newbold	2020-01-30	2	-13/+8
\| \| \|
\| * \|	pipenv: add tldextract (url parser) and update deps	Bryan Newbold	2020-01-30	2	-136/+159
\| \| \|
\| * \|	fix ES file schema plural field names	Bryan Newbold	2020-01-29	2	-5/+4
\| \| \|
\| * \|	new biblio-only general search	Bryan Newbold	2020-01-29	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	The other fields are now "copy_to" the merged biblio field.
\| * \|	elastic schema fixes	Bryan Newbold	2020-01-29	3	-7/+12
\| \| \|
\| * \|	add country to v03b release schema	Bryan Newbold	2020-01-29	2	-0/+3
\| \| \|
\| * \|	update ES docs and proposal	Bryan Newbold	2020-01-29	2	-4/+6
\| \| \|
\| * \|	actually implement changelog transform	Bryan Newbold	2020-01-29	3	-19/+78
\| \| \|
\| * \|	fix some transform bugs, add some tests	Bryan Newbold	2020-01-29	6	-13/+48
\| \| \|
\| * \|	ES release schema updates	Bryan Newbold	2020-01-29	2	-28/+122
\| \| \|
\| * \|	container ES schema changes	Bryan Newbold	2020-01-29	2	-29/+38
\| \| \|
\| * \|	first implementation of ES file schema	Bryan Newbold	2020-01-29	4	-3/+115
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Includes a trivial test and transform, but not any workers or doc updates.
* \| \|	Merge branch 'bnewbold-more-ingest' into 'master'	bnewbold	2020-02-25	1	-1/+37
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	entity worker: ingest more Datacite releases; filter some out See merge request webgroup/fatcat!29
\| * \| \|	entity worker: ingest more releases	Bryan Newbold	2020-02-22	1	-1/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If release is a dataset or image, don't do a pdf ingest request. If release is a datacite DOI, and release_type is a "document", crawl regardless of is_oa detection. This is mostly to crawl repositories (institutional or subject).