summaryrefslogtreecommitdiffstats
path: root/extra
Commit message (Collapse)AuthorAgeFilesLines
* release ES schema: fix typo with shard/replica configurationBryan Newbold2021-04-081-1/+1
|
* sitemaps: filter to releases with PDF fulltext (for now)Bryan Newbold2021-04-071-0/+2
|
* container search schema: preservation stats, new fieldsBryan Newbold2021-04-061-8/+9
| | | | Includes transform code updates and partial test coverage.
* release ES: add discipline fieldBryan Newbold2021-04-061-0/+1
|
* ES schemas: add doc_index_ts to all mappingsBryan Newbold2021-04-065-0/+9
|
* elasticsearch schema, docs, docker: update from ES 6.x to ES 7.xBryan Newbold2021-04-067-125/+24
| | | | | Including removing index document names (use '_doc' instead during transition)
* add es draft schema for referencesMartin Czygan2021-03-301-0/+106
|
* SQL dump timing noteBryan Newbold2021-03-101-0/+3
|
* sql dump recent timing noteBryan Newbold2021-03-081-1/+2
|
* elasticsearch: simple new dblp and doaj fieldsBryan Newbold2021-01-201-0/+3
|
* Merge branch 'bnewbold-ci-cleanups' into 'master'bnewbold2021-01-051-5/+11
|\ | | | | | | | | Gitlab CI and docker base image cleanups See merge request webgroup/fatcat!94
| * docker xenial: use get-pipenv.py to install pipenv et alBryan Newbold2020-12-221-5/+6
| |
| * docker xenial: switch to rust 1.43.0Bryan Newbold2020-12-221-1/+1
| |
| * docker xenial: include python3.8Bryan Newbold2020-12-221-1/+6
| |
* | update stats (post DOAJ and dblp imports)Bryan Newbold2020-12-292-0/+47
| |
* | DOAJ import notes, and SQL/stats updateBryan Newbold2020-12-234-0/+94
|/
* dblp: polish HTML scrape/extract pipelineBryan Newbold2020-12-173-3/+16
|
* dblp: script and notes on container metadata generationBryan Newbold2020-12-174-0/+134
|
* Merge pull request #65 from ibnesayeed/patch-1bnewbold2020-12-171-1/+1
|\ | | | | Improve status counting efficiency
| * Improve status counting efficiencySawood Alam2020-12-171-1/+1
| | | | | | When the input is large with a small number of unique items to be counted then counting as we go would be linear and more efficient approach than sorting and unique counting.
* | Revert "docker xenial base image: include python3.8"Bryan Newbold2020-12-111-6/+1
| | | | | | | | This reverts commit 91628426678a635f26cf992dbd5caedb4a3ae24b.
* | docker xenial base image: include python3.8Bryan Newbold2020-12-111-1/+6
| |
* | docker: how to push to dockerhubBryan Newbold2020-12-111-0/+4
|/
* update database/table statsBryan Newbold2020-10-122-0/+48
|
* update stats snapshotBryan Newbold2020-09-032-0/+47
|
* sitemap fixes from testingBryan Newbold2020-08-193-4/+15
|
* iterate on sitemap generationBryan Newbold2020-08-196-7/+119
|
* initial sitemap.xml notes/templateBryan Newbold2020-08-192-0/+29
|
* include releases_by_work in ident tarballBryan Newbold2020-08-041-1/+2
|
* update SQL dump docs with group-by-work command (by default)Bryan Newbold2020-08-041-1/+1
|
* WIP: sorted release ident dumpsBryan Newbold2020-08-041-0/+16
|
* update table/database size statsBryan Newbold2020-07-222-0/+48
|
* commit example of an elasticsearch SQL queryBryan Newbold2020-07-011-0/+8
|
* commit old README about bulk downloadsBryan Newbold2020-07-011-0/+40
|
* ES schema: add best_url to file schemaBryan Newbold2020-06-041-0/+1
| | | | | | | | | This will increase index size (URLs are often long in our corpus, and we have many file entities), but seems worth it. Initially added `ia_url` as a second field, guaranteed to always be an *.archive.org URL, but `best_url` defaults to that anyways so didn't seem worthwhile.
* sql: really don't double-dump requestsBryan Newbold2020-05-261-1/+0
| | | | | | I guess we were dumping 3 times originally; already had an earlier commit that removed one row from this README (that I copypaste to CLI every time)
* 2020-05-26 prod database size and statsBryan Newbold2020-05-262-0/+48
|
* update prod statsBryan Newbold2020-04-177-0/+149
|
* Add missing packages to Dockerfile and CI fileBryan Newbold2020-04-161-1/+1
|
* test-base DockerfileBryan Newbold2020-04-162-0/+51
| | | | Used to create bnewbold/fatcat-test-base image
* update bulk export instructionsBryan Newbold2020-04-071-4/+2
| | | | | - don't do expanded and regular release dumps - default to sqldump_public for item name (as that is common-case)
* sql_dumps: stop doing redundant release dumpsBryan Newbold2020-04-011-1/+3
|
* bulk exports README different from SQL READMEBryan Newbold2020-03-171-1/+1
|
* ES README: really need to limit to 1k esbulk batchesBryan Newbold2020-02-261-3/+3
|
* Merge branch 'bnewbold-elastic-v03b'Bryan Newbold2020-02-265-61/+203
|\
| * update ES transform READMEBryan Newbold2020-02-261-2/+3
| | | | | | | | | | - smaller batch sizes to prevent esbulk errors - file transform/index
| * ES container last tweaksBryan Newbold2020-02-261-3/+4
| |
| * ES release: last minor tweaksBryan Newbold2020-02-261-3/+5
| |
| * release schema: do doc_value on DOIsBryan Newbold2020-02-131-1/+1
| | | | | | | | | | | | Because DOIs are pseudo-structured (prefix, and often structure within the publisher-controlled area), I suspect we will in fact be wanting to do analytics over these strings.
| * ES release: actually do want doc_values for work_idBryan Newbold2020-02-051-1/+1
| | | | | | | | Eg, for fast "unique count"