Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | some weekly crawl numbers (not very helpful) | Bryan Newbold | 2022-05-03 | 1 | -0/+191 |
| | |||||
* | switch default kafka-broker host from wbgrp-svc263 to wbgrp-svc350 | Bryan Newbold | 2022-05-03 | 4 | -4/+4 |
| | |||||
* | April 2022 sandcrawler DB stats | Bryan Newbold | 2022-04-27 | 1 | -0/+432 |
| | |||||
* | sql: add source/created index on ingest_request table | Bryan Newbold | 2022-04-04 | 1 | -0/+1 |
| | |||||
* | sql: fix reingest query missing type on LEFT JOIN; wrap in read-only transaction | Bryan Newbold | 2022-04-04 | 5 | -5/+27 |
| | |||||
* | sql: script to reingest recent spn2 lookup failure in bulk mode | Bryan Newbold | 2022-02-08 | 5 | -18/+71 |
| | |||||
* | 2021-12-02 database table size stats | Bryan Newbold | 2021-12-07 | 1 | -0/+22 |
| | |||||
* | sandcrawler SQL dump and upload updates | Bryan Newbold | 2021-12-07 | 1 | -4/+12 |
| | |||||
* | update fatcat_file SQL table schema, and add backfill notes | Bryan Newbold | 2021-12-07 | 1 | -1/+3 |
| | |||||
* | update fatcat_file SQL table schema, and add backfill notes | Bryan Newbold | 2021-12-01 | 1 | -0/+13 |
| | |||||
* | sandcrawler SQL stats | Bryan Newbold | 2021-11-27 | 2 | -12/+425 |
| | |||||
* | sql: grobid_refs table JSON as 'JSON' not 'JSONB' | Bryan Newbold | 2021-11-04 | 1 | -1/+1 |
| | | | | | I keep flip-flopping on this, but our disk usage is really large, and if 'JSON' is smaller than 'JSONB' in postgresql at all it is worth it. | ||||
* | record SQL table sizes at start of crossref re-ingest | Bryan Newbold | 2021-11-04 | 1 | -0/+19 |
| | |||||
* | add grobid_refs and crossref_with_refs to sandcrawler-db SQL schema | Bryan Newbold | 2021-11-04 | 1 | -0/+21 |
| | |||||
* | SPN reingest: 6 hour minimum, 6 month max | Bryan Newbold | 2021-11-03 | 1 | -2/+2 |
| | |||||
* | sql: fix typo in quarterly (not weekly) script | Bryan Newbold | 2021-11-03 | 1 | -1/+1 |
| | |||||
* | sql: fixes to ingest_fileset_platform schema (from table creation) | Bryan Newbold | 2021-11-01 | 1 | -6/+6 |
| | |||||
* | commit old ingest domain summary | Bryan Newbold | 2021-10-15 | 1 | -0/+345 |
| | |||||
* | sql fileset ingest table iteration | Bryan Newbold | 2021-10-15 | 1 | -12/+11 |
| | |||||
* | sql: initial ingest fileset table | Bryan Newbold | 2021-10-15 | 1 | -0/+38 |
| | |||||
* | sql: fix typo in CHECK statement | Bryan Newbold | 2021-10-15 | 1 | -1/+1 |
| | |||||
* | new SQL recent SPN request monitoring query | Bryan Newbold | 2021-10-04 | 1 | -0/+32 |
| | |||||
* | refactor reingest scripts | Bryan Newbold | 2021-09-30 | 6 | -150/+90 |
| | |||||
* | new 'daily' and 'priority' ingest request topics | Bryan Newbold | 2021-09-30 | 2 | -2/+2 |
| | | | | | | | | | The old ingest request queue was always getting lopsided, suspect because it was scaled up (additional partitions) at some point in the past, hoping new topics will fix this. New '-priority' queue is like '-bulk', but for smaller-volume SPN-like requests. Eg, interactive mode. | ||||
* | reingest: skip spn2 'unknown' errors | Bryan Newbold | 2021-07-21 | 2 | -0/+2 |
| | |||||
* | crossref DB proposal, and include in SQL schema | Bryan Newbold | 2021-06-02 | 1 | -0/+7 |
| | |||||
* | sql: do periodically retry spn2-wayback-error | Bryan Newbold | 2021-04-27 | 2 | -2/+0 |
| | |||||
* | reingest scripts to run as sandcrawler | Bryan Newbold | 2021-04-09 | 2 | -12/+12 |
| | |||||
* | sql: notes on sql restore | Bryan Newbold | 2021-04-09 | 1 | -0/+9 |
| | |||||
* | sql: update paths to work with svc506 machine | Bryan Newbold | 2021-04-09 | 12 | -49/+49 |
| | |||||
* | sql: before/after pg13 table size stats | Bryan Newbold | 2021-04-09 | 2 | -1/+43 |
| | |||||
* | sql: update periodic retry/reingest scripts | Bryan Newbold | 2021-04-09 | 4 | -6/+14 |
| | |||||
* | SQL snapshot doc update | Bryan Newbold | 2021-04-07 | 1 | -2/+5 |
| | |||||
* | 2021-04-07 sandcrawler DB stats | Bryan Newbold | 2021-04-07 | 1 | -0/+428 |
| | |||||
* | SQL: more ingest monitoring | Bryan Newbold | 2020-11-16 | 3 | -1/+660 |
| | |||||
* | tweak html_meta SQL schema | Bryan Newbold | 2020-11-03 | 1 | -2/+2 |
| | |||||
* | SQL: unmatched glutton query (old) | Bryan Newbold | 2020-11-03 | 1 | -0/+19 |
| | |||||
* | monitoring: past-7-days summary query | Bryan Newbold | 2020-11-03 | 1 | -0/+26 |
| | |||||
* | html: start on SQL table | Bryan Newbold | 2020-11-03 | 1 | -0/+15 |
| | |||||
* | SQL: update weekly/quarterly ingest retry scripts | Bryan Newbold | 2020-10-21 | 5 | -18/+119 |
| | |||||
* | sql stats: larger limits (more complete lists) | Bryan Newbold | 2020-10-21 | 1 | -8/+8 |
| | |||||
* | update SQL ingest monitoring commands to be past-month by default | Bryan Newbold | 2020-10-17 | 1 | -5/+5 |
| | |||||
* | dump_file_meta helper | Bryan Newbold | 2020-10-01 | 1 | -0/+12 |
| | |||||
* | updated sandcrawler-db stats | Bryan Newbold | 2020-09-15 | 2 | -6/+346 |
| | |||||
* | WIP weekly re-ingest script | Bryan Newbold | 2020-08-17 | 2 | -0/+97 |
| | |||||
* | grobid+pdftext missing catch-up commands | Bryan Newbold | 2020-08-05 | 4 | -10/+49 |
| | |||||
* | commit stats from a couple weeks back | Bryan Newbold | 2020-08-05 | 1 | -0/+347 |
| | |||||
* | sql stats commands updates | Bryan Newbold | 2020-08-05 | 1 | -2/+2 |
| | |||||
* | commented special modes for dump_unextracted_pdf.sql | Bryan Newbold | 2020-06-25 | 1 | -1/+4 |
| | |||||
* | pdftrio SQL queries | Bryan Newbold | 2020-06-25 | 1 | -0/+65 |
| |