summaryrefslogtreecommitdiffstats
Commit message (Expand)AuthorAgeFilesLines
* cleanup unused code in fatcat_harvest.pyBryan Newbold2020-03-231-7/+0
* jalc: avoid meaningless pages valuesBryan Newbold2020-03-231-4/+8
* Merge branch 'bnewbold-datacite-year-limits' into 'master'Martin Czygan2020-03-231-0/+7
|\
| * datacite: add year sanity restrictionsbnewbold2020-03-231-0/+7
|/
* notes on arxiv+pubmed backfillBryan Newbold2020-03-201-0/+37
* pubmed: handle multiple ReferenceListBryan Newbold2020-03-203-1/+222
* pubmed: update many more metadata fieldsBryan Newbold2020-03-191-0/+22
* crossref: skip stub OUP titleBryan Newbold2020-03-191-0/+8
* ingest: always try some lancet journalsBryan Newbold2020-03-191-0/+3
* Merge branch 'martin-lookup-by-identifier-issn-link' into 'master'bnewbold2020-03-181-4/+3
|\
| * container lookup: link to issn portal searchMartin Czygan2020-03-181-4/+3
|/
* Merge branch 'bnewbold-update-stats' into 'master'Martin Czygan2020-03-181-3/+3
|\
| * update front-page statsBryan Newbold2020-03-171-3/+3
|/
* bulk exports README different from SQL READMEBryan Newbold2020-03-171-1/+1
* Merge branch 'martin-kafka-bs4-import' into 'master'Martin Czygan2020-03-1010-43/+428
|\
| * common: use smaller batch size since XML parsing may be slowMartin Czygan2020-03-101-1/+1
| * pubmed: log to stderrMartin Czygan2020-03-101-1/+1
| * pubmed: move mapping generation out of fetch_dateMartin Czygan2020-03-102-7/+10
| * harvest: fix imports from HarvestPubmedWorker cleanupMartin Czygan2020-03-102-4/+4
| * pubmed: citations is a bit more preciseMartin Czygan2020-03-091-1/+1
| * pubmed: we sync from FTPMartin Czygan2020-03-091-1/+1
| * oaipmh: HarvestPubmedWorker obsoleted by PubmedFTPWorkerMartin Czygan2020-03-091-34/+0
| * fatcat_import: address potential hanging, if stdin is emptyMartin Czygan2020-03-091-0/+2
| * more pubmed adjustmentsMartin Czygan2020-02-226-71/+197
| * pubmed ftp: fix urlMartin Czygan2020-02-191-4/+6
| * pubmed ftp harvest and KafkaBs4XmlPusherMartin Czygan2020-02-196-21/+307
* | add --force-crawl flag to ingest toolBryan Newbold2020-03-021-0/+5
* | pipenv: lock authlib to less than v0.13; rebuild lock fileBryan Newbold2020-02-282-112/+109
* | ES README: really need to limit to 1k esbulk batchesBryan Newbold2020-02-261-3/+3
* | Merge branch 'bnewbold-elastic-v03b'Bryan Newbold2020-02-2616-257/+674
|\ \
| * | improve is_oa flag accuracyBryan Newbold2020-02-262-10/+6
| * | update ES transform READMEBryan Newbold2020-02-261-2/+3
| * | fix fatcat_transform state filtersBryan Newbold2020-02-261-4/+4
| * | bulk ES transform: skip non-active entitiesBryan Newbold2020-02-261-0/+8
| * | ES container last tweaksBryan Newbold2020-02-262-3/+7
| * | ES release: last minor tweaksBryan Newbold2020-02-262-5/+7
| * | ES updates: fix tests to accept archive.org in host/domainBryan Newbold2020-02-141-2/+3
| * | release schema: do doc_value on DOIsBryan Newbold2020-02-131-1/+1
| * | ES files: don't remove archive.org domains/hostsBryan Newbold2020-02-071-5/+0
| * | ES release: actually do want doc_values for work_idBryan Newbold2020-02-051-1/+1
| * | fix axiv/arxiv typo in release schemaBryan Newbold2020-02-041-1/+1
| * | ES release schema: fix typoBryan Newbold2020-01-311-1/+1
| * | ES releases: host/domain fixesBryan Newbold2020-01-312-2/+5
| * | pipenv: lock zipp version to work around python3.6 requirementBryan Newbold2020-01-302-7/+20
| * | fix release es transform missing 'issue'Bryan Newbold2020-01-301-0/+1
| * | fix json typos in changelog schemaBryan Newbold2020-01-301-2/+2
| * | add upper-case work-around from kibana map joinBryan Newbold2020-01-302-0/+2
| * | JSON typo in release mappingBryan Newbold2020-01-301-1/+0
| * | ES schemas: make keywords case-insensitive by defaultBryan Newbold2020-01-304-66/+115
| * | tweak file ES archive.org domain trackingBryan Newbold2020-01-302-0/+7