aboutsummaryrefslogtreecommitdiffstats
path: root/notes/database_dumps_backups.txt
blob: 60d4bba07ef1da97b1e8d9550164fb1b81e3e6a3 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

## Dumps and Backups

There are a few different database dump formats folks might want:

- raw native database backups, for disaster recovery (would include
  volatile/unsupported schema details, user API credentials, full history,
  in-process edits, comments, etc)
- a sanitized version of the above: roughly per-table dumps of the full state
  of the database. Could use per-table SQL expressions with sub-queries to pull
  in small tables ("partial transform") and export JSON for each table; would
  be extra work to maintain, so not pursuing for now.
- full history, full public schema exports, in a form that might be used to
  mirror or enitrely fork the project. Propose supplying the full "changelog"
  in API schema format, in a single file to capture all entity history, without
  "hydrating" any inter-entity references. Rely on separate dumps of
  non-entity, non-versioned tables (editors, abstracts, etc). Note that a
  variant of this could use the public interface, in particular to do
  incremental updates (though that wouldn't capture schema changes).
- transformed exports of the current state of the database (aka, without
  history). Useful for data analysis, search engines, etc. Propose supplying
  just the Release table in a fully "hydrated" state to start. Unclear if
  should be on a work or release basis; will go with release for now. Harder to
  do using public interface because of the need for transaction locking.

## Full Postgres Backup

Backing up the entire database using `pg_dump`, with parallelism 1 (use more on
larger machine with fast disks; try 4 or 8?), assuming the database name is
'fatcat', and the current user has access:

    pg_dump -j1 -Fd -f test-dump fatcat

## Identifier Dumps

The `extras/quick_dump.sql` script will dump abstracts and identifiers as TSV
files to `/tmp/`. Pretty quick; takes about 15 GB of disk space (uncompressed).

## Releases Export

    # simple command
    ./fatcat_export.py releases /tmp/fatcat_ident_releases.tsv /tmp/releases-dump.json

    # usual command
    time ./fatcat_export.py releases /tmp/fatcat_ident_releases.tsv - | pv -l | wc

## Changelog Export

    # simple command
    ./fatcat_export.py changelog /tmp/changelog-dump.json

    # usual command
    time ./fatcat_export.py changelog - | pv -l | wc