aboutsummaryrefslogtreecommitdiffstats
path: root/sql/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'sql/README.md')
-rw-r--r--sql/README.md42
1 files changed, 42 insertions, 0 deletions
diff --git a/sql/README.md b/sql/README.md
index b171614..e488006 100644
--- a/sql/README.md
+++ b/sql/README.md
@@ -5,6 +5,21 @@ No primary storage of anything in this table. Everything should be rapidly
re-creatable from dumps, kafka topics (compressed), CDX, petabox metadata, etc.
This is a secondary view on all of that.
+## Create Database and User
+
+Create system user with your username like:
+
+ sudo su postgres
+ createuser -s bnewbold
+
+Create database using `diesel` tool (see fatcat rust docs for install notes):
+
+ # DANGER: will delete/recreate entire database
+ diesel database reset
+
+In the future would probably be better to create a real role/password and
+supply these via `DATABASE_URL` env variable.
+
## Schema
schema/database name is 'sandcrawler'
@@ -124,3 +139,30 @@ Questions we might want to answer
http get :3030/cdx?url=eq.https://coleccionables.mercadolibre.com.ar/arduino-pdf_Installments_NoInterest_BestSellers_YES
http get :3030/file_meta?sha1hex=eq.120582c855a7cc3c70a8527c560d7f27e6027278
+
+## Full SQL Database Dumps
+
+Run a dump in compressed, postgres custom format, not including `crossref` table (which is large and redundant):
+
+ export DATESLUG="`date +%Y-%m-%d.%H%M%S`"
+ time sudo -u postgres pg_dump --verbose --format=custom --exclude-table-data=crossref sandcrawler > sandcrawler_full_dbdump_${DATESLUG}.pgdump
+
+As of 2021-12-03, this process runs for about 6 hours and the compressed
+snapshot is 102 GBytes (compared with 940GB database disk consumption,
+including crossref).
+
+Then, upload to petabox as a backup:
+
+ ia upload sandcrawler_full_dbdump_YYYY-MM-DD -m mediatype:data -m collection:webgroup-internal-backups -m title:"Sandcrawler SQL Dump (YYYY-MM-DD)" sandcrawler_full_dbdump_${DATESLUG}.pgdump
+
+
+## SQL Database Restore
+
+To restore a dump (which will delete local database content, if any):
+
+ sudo su postgres
+ createuser --no-login web_anon
+ createuser -s sandcrawler
+ time pg_restore --jobs=4 --verbose --clean --if-exists --create --exit-on-error -d postgres sandcrawler_full_dbdump_2021-04-08.003952.pgdump
+
+Took about 2.5 hours.