Commit message (Collapse) | Author | Age | Files | Lines | ||
---|---|---|---|---|---|---|
... | ||||||
* | include releases_by_work in ident tarball | Bryan Newbold | 2020-08-04 | 1 | -1/+2 | |
| | ||||||
* | update SQL dump docs with group-by-work command (by default) | Bryan Newbold | 2020-08-04 | 1 | -1/+1 | |
| | ||||||
* | WIP: sorted release ident dumps | Bryan Newbold | 2020-08-04 | 1 | -0/+16 | |
| | ||||||
* | update table/database size stats | Bryan Newbold | 2020-07-22 | 2 | -0/+48 | |
| | ||||||
* | commit example of an elasticsearch SQL query | Bryan Newbold | 2020-07-01 | 1 | -0/+8 | |
| | ||||||
* | commit old README about bulk downloads | Bryan Newbold | 2020-07-01 | 1 | -0/+40 | |
| | ||||||
* | ES schema: add best_url to file schema | Bryan Newbold | 2020-06-04 | 1 | -0/+1 | |
| | | | | | | | | | This will increase index size (URLs are often long in our corpus, and we have many file entities), but seems worth it. Initially added `ia_url` as a second field, guaranteed to always be an *.archive.org URL, but `best_url` defaults to that anyways so didn't seem worthwhile. | |||||
* | sql: really don't double-dump requests | Bryan Newbold | 2020-05-26 | 1 | -1/+0 | |
| | | | | | | I guess we were dumping 3 times originally; already had an earlier commit that removed one row from this README (that I copypaste to CLI every time) | |||||
* | 2020-05-26 prod database size and stats | Bryan Newbold | 2020-05-26 | 2 | -0/+48 | |
| | ||||||
* | update prod stats | Bryan Newbold | 2020-04-17 | 7 | -0/+149 | |
| | ||||||
* | Add missing packages to Dockerfile and CI file | Bryan Newbold | 2020-04-16 | 1 | -1/+1 | |
| | ||||||
* | test-base Dockerfile | Bryan Newbold | 2020-04-16 | 2 | -0/+51 | |
| | | | | Used to create bnewbold/fatcat-test-base image | |||||
* | update bulk export instructions | Bryan Newbold | 2020-04-07 | 1 | -4/+2 | |
| | | | | | - don't do expanded and regular release dumps - default to sqldump_public for item name (as that is common-case) | |||||
* | sql_dumps: stop doing redundant release dumps | Bryan Newbold | 2020-04-01 | 1 | -1/+3 | |
| | ||||||
* | bulk exports README different from SQL README | Bryan Newbold | 2020-03-17 | 1 | -1/+1 | |
| | ||||||
* | ES README: really need to limit to 1k esbulk batches | Bryan Newbold | 2020-02-26 | 1 | -3/+3 | |
| | ||||||
* | Merge branch 'bnewbold-elastic-v03b' | Bryan Newbold | 2020-02-26 | 5 | -61/+203 | |
|\ | ||||||
| * | update ES transform README | Bryan Newbold | 2020-02-26 | 1 | -2/+3 | |
| | | | | | | | | | | - smaller batch sizes to prevent esbulk errors - file transform/index | |||||
| * | ES container last tweaks | Bryan Newbold | 2020-02-26 | 1 | -3/+4 | |
| | | ||||||
| * | ES release: last minor tweaks | Bryan Newbold | 2020-02-26 | 1 | -3/+5 | |
| | | ||||||
| * | release schema: do doc_value on DOIs | Bryan Newbold | 2020-02-13 | 1 | -1/+1 | |
| | | | | | | | | | | | | Because DOIs are pseudo-structured (prefix, and often structure within the publisher-controlled area), I suspect we will in fact be wanting to do analytics over these strings. | |||||
| * | ES release: actually do want doc_values for work_id | Bryan Newbold | 2020-02-05 | 1 | -1/+1 | |
| | | | | | | | | Eg, for fast "unique count" | |||||
| * | fix axiv/arxiv typo in release schema | Bryan Newbold | 2020-02-04 | 1 | -1/+1 | |
| | | ||||||
| * | ES release schema: fix typo | Bryan Newbold | 2020-01-31 | 1 | -1/+1 | |
| | | ||||||
| * | fix json typos in changelog schema | Bryan Newbold | 2020-01-30 | 1 | -2/+2 | |
| | | ||||||
| * | add upper-case work-around from kibana map join | Bryan Newbold | 2020-01-30 | 1 | -0/+1 | |
| | | ||||||
| * | JSON typo in release mapping | Bryan Newbold | 2020-01-30 | 1 | -1/+0 | |
| | | ||||||
| * | ES schemas: make keywords case-insensitive by default | Bryan Newbold | 2020-01-30 | 4 | -66/+115 | |
| | | | | | | | | But not applying asciifolding; don't see any need to do so? | |||||
| * | tweak file ES archive.org domain tracking | Bryan Newbold | 2020-01-30 | 1 | -0/+1 | |
| | | ||||||
| * | elastic schema fixes | Bryan Newbold | 2020-01-29 | 2 | -7/+7 | |
| | | ||||||
| * | add country to v03b release schema | Bryan Newbold | 2020-01-29 | 1 | -0/+1 | |
| | | ||||||
| * | update ES docs and proposal | Bryan Newbold | 2020-01-29 | 1 | -0/+2 | |
| | | ||||||
| * | actually implement changelog transform | Bryan Newbold | 2020-01-29 | 1 | -1/+10 | |
| | | ||||||
| * | ES release schema updates | Bryan Newbold | 2020-01-29 | 1 | -23/+46 | |
| | | ||||||
| * | container ES schema changes | Bryan Newbold | 2020-01-29 | 1 | -13/+20 | |
| | | ||||||
| * | first implementation of ES file schema | Bryan Newbold | 2020-01-29 | 1 | -0/+46 | |
| | | | | | | | | | | Includes a trivial test and transform, but not any workers or doc updates. | |||||
* | | table size snapshots | Bryan Newbold | 2020-02-19 | 2 | -0/+47 | |
|/ | ||||||
* | stats: remove internal PG table sizes from old dumps | Bryan Newbold | 2020-01-19 | 2 | -292/+0 | |
| | | | | For ease of reading and comparison | |||||
* | update stats and table sizes | Bryan Newbold | 2020-01-19 | 4 | -0/+96 | |
| | ||||||
* | sql table size script: shorter output | Bryan Newbold | 2020-01-15 | 1 | -0/+1 | |
| | | | | This skips postgres-internal tables in size output | |||||
* | 2019-01-07 status update | Bryan Newbold | 2020-01-07 | 2 | -0/+36 | |
| | ||||||
* | DB loads take a long time now | Bryan Newbold | 2019-12-21 | 1 | -1/+1 | |
| | ||||||
* | add 2019-12-20 stats | Bryan Newbold | 2019-12-20 | 2 | -0/+148 | |
| | ||||||
* | add kafka-pixy to docker-compose file | Bryan Newbold | 2019-12-10 | 1 | -0/+8 | |
| | ||||||
* | tweaks to docker-compose image | Bryan Newbold | 2019-12-10 | 1 | -0/+5 | |
| | | | | | - don't start kafka image until zookeeper is running - set very liberal "watermarks" for elasticsearch disk monitoring | |||||
* | increase max.message.bytes in container | Martin Czygan | 2019-12-05 | 1 | -0/+1 | |
| | | | | | While working on datacite, some message were larger than the default of 1000012 bytes. | |||||
* | export raw affiliation strings for analysis | Bryan Newbold | 2019-10-03 | 1 | -0/+17 | |
| | ||||||
* | docker-compose: kafka 2.0, and -dev topic names | Bryan Newbold | 2019-09-20 | 1 | -3/+2 | |
| | ||||||
* | document release publish processv0.3.1 | Bryan Newbold | 2019-09-18 | 1 | -0/+48 | |
| | ||||||
* | create new collection just for fatcat exports | Bryan Newbold | 2019-09-09 | 1 | -1/+1 | |
| |