aboutsummaryrefslogtreecommitdiffstats
path: root/extra
Commit message (Collapse)AuthorAgeFilesLines
* chocula update notesBryan Newbold2021-11-301-0/+61
|
* container ISSN-L dedupe notesBryan Newbold2021-11-301-0/+198
|
* add stats (before re-indexing), and rename old files for consistencyBryan Newbold2021-11-306-0/+47
|
* cleanups: springer 'page-one' sample PDFsBryan Newbold2021-11-292-0/+129
|
* cleanups: truncated wayback PDFs from common crawlBryan Newbold2021-11-292-0/+292
|
* update to truncated wayback timestamp issueBryan Newbold2021-11-291-0/+24
|
* update to file short wayback timestamp cleanupBryan Newbold2021-11-292-1/+30
|
* commit old 2021-11-11 stats fileBryan Newbold2021-11-291-0/+1
|
* clean up extra/ folder a bitBryan Newbold2021-11-2911-24/+0
|
* move notes/bulk_edits/ to extra/bulk_edits/Bryan Newbold2021-11-2923-0/+1743
|
* move 'cleanups' directory from notes to extra/Bryan Newbold2021-11-2911-0/+1306
|
* codespell fixes to various other docsBryan Newbold2021-11-243-4/+4
|
* content_scope: include in file ES schema and transformBryan Newbold2021-11-171-0/+1
|
* ISSN-L dupes check: output all matchesBryan Newbold2021-11-171-1/+1
|
* sitemap generation improvementsBryan Newbold2021-11-102-1/+2
|
* elasticsearch schema changesBryan Newbold2021-10-132-3/+13
|
* update statsBryan Newbold2021-10-113-0/+48
|
* sql_dumps: set collection at upload timeBryan Newbold2021-09-021-2/+5
|
* prod stats snapshotBryan Newbold2021-08-064-0/+47
|
* stats snapshot (2021-06-23)Bryan Newbold2021-06-232-0/+47
|
* SQL dumps: more pigz (vs. gzip) for speedBryan Newbold2021-06-171-2/+2
|
* fatcat_ref ES schema: more doc_values; source_year not source_release_yearBryan Newbold2021-06-171-5/+2
|
* update dblp pre-import notes and pipenv python version (3.8)Bryan Newbold2021-06-032-6/+11
|
* elasticsearch ref schema: 6 shards, not 12Bryan Newbold2021-05-181-1/+1
|
* fix 'colected' typosBryan Newbold2021-04-131-1/+1
| | | | Thanks for the catch martin
* update elasticsearch bootstrap indexing notesBryan Newbold2021-04-091-8/+16
|
* ES: rename fatcat_ref.json to ref_schema.json for consistency; add to READMEBryan Newbold2021-04-082-1/+4
|
* release ES schema: fix typo with shard/replica configurationBryan Newbold2021-04-081-1/+1
|
* sitemaps: filter to releases with PDF fulltext (for now)Bryan Newbold2021-04-071-0/+2
|
* container search schema: preservation stats, new fieldsBryan Newbold2021-04-061-8/+9
| | | | Includes transform code updates and partial test coverage.
* release ES: add discipline fieldBryan Newbold2021-04-061-0/+1
|
* ES schemas: add doc_index_ts to all mappingsBryan Newbold2021-04-065-0/+9
|
* elasticsearch schema, docs, docker: update from ES 6.x to ES 7.xBryan Newbold2021-04-067-125/+24
| | | | | Including removing index document names (use '_doc' instead during transition)
* add es draft schema for referencesMartin Czygan2021-03-301-0/+106
|
* SQL dump timing noteBryan Newbold2021-03-101-0/+3
|
* sql dump recent timing noteBryan Newbold2021-03-081-1/+2
|
* elasticsearch: simple new dblp and doaj fieldsBryan Newbold2021-01-201-0/+3
|
* Merge branch 'bnewbold-ci-cleanups' into 'master'bnewbold2021-01-051-5/+11
|\ | | | | | | | | Gitlab CI and docker base image cleanups See merge request webgroup/fatcat!94
| * docker xenial: use get-pipenv.py to install pipenv et alBryan Newbold2020-12-221-5/+6
| |
| * docker xenial: switch to rust 1.43.0Bryan Newbold2020-12-221-1/+1
| |
| * docker xenial: include python3.8Bryan Newbold2020-12-221-1/+6
| |
* | update stats (post DOAJ and dblp imports)Bryan Newbold2020-12-292-0/+47
| |
* | DOAJ import notes, and SQL/stats updateBryan Newbold2020-12-234-0/+94
|/
* dblp: polish HTML scrape/extract pipelineBryan Newbold2020-12-173-3/+16
|
* dblp: script and notes on container metadata generationBryan Newbold2020-12-174-0/+134
|
* Merge pull request #65 from ibnesayeed/patch-1bnewbold2020-12-171-1/+1
|\ | | | | Improve status counting efficiency
| * Improve status counting efficiencySawood Alam2020-12-171-1/+1
| | | | | | When the input is large with a small number of unique items to be counted then counting as we go would be linear and more efficient approach than sorting and unique counting.
* | Revert "docker xenial base image: include python3.8"Bryan Newbold2020-12-111-6/+1
| | | | | | | | This reverts commit 91628426678a635f26cf992dbd5caedb4a3ae24b.
* | docker xenial base image: include python3.8Bryan Newbold2020-12-111-1/+6
| |
* | docker: how to push to dockerhubBryan Newbold2020-12-111-0/+4
|/