summaryrefslogtreecommitdiffstats
path: root/extra
Commit message (Collapse)AuthorAgeFilesLines
* include releases_by_work in ident tarballBryan Newbold2020-08-041-1/+2
|
* update SQL dump docs with group-by-work command (by default)Bryan Newbold2020-08-041-1/+1
|
* WIP: sorted release ident dumpsBryan Newbold2020-08-041-0/+16
|
* update table/database size statsBryan Newbold2020-07-222-0/+48
|
* commit example of an elasticsearch SQL queryBryan Newbold2020-07-011-0/+8
|
* commit old README about bulk downloadsBryan Newbold2020-07-011-0/+40
|
* ES schema: add best_url to file schemaBryan Newbold2020-06-041-0/+1
| | | | | | | | | This will increase index size (URLs are often long in our corpus, and we have many file entities), but seems worth it. Initially added `ia_url` as a second field, guaranteed to always be an *.archive.org URL, but `best_url` defaults to that anyways so didn't seem worthwhile.
* sql: really don't double-dump requestsBryan Newbold2020-05-261-1/+0
| | | | | | I guess we were dumping 3 times originally; already had an earlier commit that removed one row from this README (that I copypaste to CLI every time)
* 2020-05-26 prod database size and statsBryan Newbold2020-05-262-0/+48
|
* update prod statsBryan Newbold2020-04-177-0/+149
|
* Add missing packages to Dockerfile and CI fileBryan Newbold2020-04-161-1/+1
|
* test-base DockerfileBryan Newbold2020-04-162-0/+51
| | | | Used to create bnewbold/fatcat-test-base image
* update bulk export instructionsBryan Newbold2020-04-071-4/+2
| | | | | - don't do expanded and regular release dumps - default to sqldump_public for item name (as that is common-case)
* sql_dumps: stop doing redundant release dumpsBryan Newbold2020-04-011-1/+3
|
* bulk exports README different from SQL READMEBryan Newbold2020-03-171-1/+1
|
* ES README: really need to limit to 1k esbulk batchesBryan Newbold2020-02-261-3/+3
|
* Merge branch 'bnewbold-elastic-v03b'Bryan Newbold2020-02-265-61/+203
|\
| * update ES transform READMEBryan Newbold2020-02-261-2/+3
| | | | | | | | | | - smaller batch sizes to prevent esbulk errors - file transform/index
| * ES container last tweaksBryan Newbold2020-02-261-3/+4
| |
| * ES release: last minor tweaksBryan Newbold2020-02-261-3/+5
| |
| * release schema: do doc_value on DOIsBryan Newbold2020-02-131-1/+1
| | | | | | | | | | | | Because DOIs are pseudo-structured (prefix, and often structure within the publisher-controlled area), I suspect we will in fact be wanting to do analytics over these strings.
| * ES release: actually do want doc_values for work_idBryan Newbold2020-02-051-1/+1
| | | | | | | | Eg, for fast "unique count"
| * fix axiv/arxiv typo in release schemaBryan Newbold2020-02-041-1/+1
| |
| * ES release schema: fix typoBryan Newbold2020-01-311-1/+1
| |
| * fix json typos in changelog schemaBryan Newbold2020-01-301-2/+2
| |
| * add upper-case work-around from kibana map joinBryan Newbold2020-01-301-0/+1
| |
| * JSON typo in release mappingBryan Newbold2020-01-301-1/+0
| |
| * ES schemas: make keywords case-insensitive by defaultBryan Newbold2020-01-304-66/+115
| | | | | | | | But not applying asciifolding; don't see any need to do so?
| * tweak file ES archive.org domain trackingBryan Newbold2020-01-301-0/+1
| |
| * elastic schema fixesBryan Newbold2020-01-292-7/+7
| |
| * add country to v03b release schemaBryan Newbold2020-01-291-0/+1
| |
| * update ES docs and proposalBryan Newbold2020-01-291-0/+2
| |
| * actually implement changelog transformBryan Newbold2020-01-291-1/+10
| |
| * ES release schema updatesBryan Newbold2020-01-291-23/+46
| |
| * container ES schema changesBryan Newbold2020-01-291-13/+20
| |
| * first implementation of ES file schemaBryan Newbold2020-01-291-0/+46
| | | | | | | | | | Includes a trivial test and transform, but not any workers or doc updates.
* | table size snapshotsBryan Newbold2020-02-192-0/+47
|/
* stats: remove internal PG table sizes from old dumpsBryan Newbold2020-01-192-292/+0
| | | | For ease of reading and comparison
* update stats and table sizesBryan Newbold2020-01-194-0/+96
|
* sql table size script: shorter outputBryan Newbold2020-01-151-0/+1
| | | | This skips postgres-internal tables in size output
* 2019-01-07 status updateBryan Newbold2020-01-072-0/+36
|
* DB loads take a long time nowBryan Newbold2019-12-211-1/+1
|
* add 2019-12-20 statsBryan Newbold2019-12-202-0/+148
|
* add kafka-pixy to docker-compose fileBryan Newbold2019-12-101-0/+8
|
* tweaks to docker-compose imageBryan Newbold2019-12-101-0/+5
| | | | | - don't start kafka image until zookeeper is running - set very liberal "watermarks" for elasticsearch disk monitoring
* increase max.message.bytes in containerMartin Czygan2019-12-051-0/+1
| | | | | While working on datacite, some message were larger than the default of 1000012 bytes.
* export raw affiliation strings for analysisBryan Newbold2019-10-031-0/+17
|
* docker-compose: kafka 2.0, and -dev topic namesBryan Newbold2019-09-201-3/+2
|
* document release publish processv0.3.1Bryan Newbold2019-09-181-0/+48
|
* create new collection just for fatcat exportsBryan Newbold2019-09-091-1/+1
|