aboutsummaryrefslogtreecommitdiffstats
path: root/extra/sql_dumps/README.md
blob: 2c73fd204863ec5600182b31946ccb1294c9d183 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133

## HOWTO: Ident Table Snapshots

This will take somewhere around 15-25 GB of disk space on the database server
(under /tmp). It would probably be better to stream this transaction over a
network connection (saving database disk I/O), but I can't figure out how to do
that with plain SQL (multiple table dumps in a single session), so would need
to be a custom client.

    ./ident_table_snapshot.sh

Or, in production:

    sudo su postgres
    DATABASE_URL=fatcat_prod ./ident_table_snapshot.sh /srv/fatcat/snapshots/

## HOWTO: Entity Dumps

First create the entity ident table dumps (above). Note that *most* of the
metadata will be pinned to the consistent ident dump snapshot transaction
(based on revision references), but that "expanded" entities will be the most
recent version, which may have been updated. This mostly impacts expanded
releases (containers, files).

Dump locally to stdout, eg:

    # local/development
    cat /tmp/fatcat_ident_releases.tsv | ./target/debug/fatcat-export releases

Or, in production:

    # production, as 'fatcat' user, in /srv/fatcat/src/rust:
    cat /srv/fatcat/snapshots/fatcat_ident_releases_by_work.tsv | ./target/release/fatcat-export releasebywork --expand files,filesets,webcaptures,container -j8 | pigz > /srv/fatcat/snapshots/release_export_expanded.json.gz
    cat /srv/fatcat/snapshots/fatcat_ident_creators.tsv | ./target/release/fatcat-export creator -j8 | pigz > /srv/fatcat/snapshots/creator_export.json.gz
    cat /srv/fatcat/snapshots/fatcat_ident_containers.tsv | ./target/release/fatcat-export container -j8 | pigz > /srv/fatcat/snapshots/container_export.json.gz
    cat /srv/fatcat/snapshots/fatcat_ident_files.tsv | ./target/release/fatcat-export file -j8 | pigz > /srv/fatcat/snapshots/file_export.json.gz
    cat /srv/fatcat/snapshots/fatcat_ident_filesets.tsv | ./target/release/fatcat-export fileset -j8 | pigz > /srv/fatcat/snapshots/fileset_export.json.gz
    cat /srv/fatcat/snapshots/fatcat_ident_webcaptures.tsv | ./target/release/fatcat-export webcapture -j8 | pigz > /srv/fatcat/snapshots/webcapture_export.json.gz

As of March 2021, all these entity dumps serially take almost 40 hours, which
is pretty slow.

## HOWTO: Dump abstracts, release identifiers, file hashes, etc

These are run as regular old commands, and can run across the network in a
couple different ways. We might not want database ports open to the network
(even cluster/VPN); on the other hand we could proabably do SSH port
forwarding anyways.

    # Locally, or client running on a remote machine
    psql fatcat < dump_abstracts.sql | egrep -v ^BEGIN$ | egrep -v ^ROLLBACK$ | pv -l | pigz > abstracts.json.gz

    # Run on database server, write to file on remote host
    psql fatcat < dump_abstracts.sql | egrep -v ^BEGIN$ | egrep -v ^ROLLBACK$ | pv -l | pigz | ssh user@host 'cat > abstracts.json.gz'

In production:

    sudo -u postgres psql fatcat_prod < dump_abstracts.sql | egrep -v ^BEGIN$ | egrep -v ^ROLLBACK$ | pv -l | pigz > /srv/fatcat/snapshots/abstracts.json.gz
    sudo -u postgres psql fatcat_prod < dump_file_hashes.sql | egrep -v ^BEGIN$ | egrep -v ^ROLLBACK$ | pv -l | pigz > /srv/fatcat/snapshots/file_hashes.tsv.gz
    sudo -u postgres psql fatcat_prod < dump_release_extid.sql | egrep -v ^BEGIN$ | egrep -v ^ROLLBACK$ | pv -l | pigz > /srv/fatcat/snapshots/release_extid.tsv.gz

## HOWTO: Full ("private") database backup

    export DATESLUG="`date +%Y-%m-%d.%H%M%S`"
    time sudo -u postgres pg_dump --verbose --format=tar fatcat_prod | pigz > /srv/fatcat/snapshots/fatcat_full_dbdump_${DATESLUG}.tar.gz

In production (as of 2019-04-24, with a 283 GByte PostgreSQL database), this
takes 1h40m. As of 2021-03-06, with a 705.19G PostgreSQL database (172 GByte
compressed dump), it takes 2h40m.

NOTE: by using the "directory" export (along with `--file`) instead of "tar"
export, it would be possible to use parallel dumping. However, this would put
additional load on both the database and underlying disk. Could also cause
issues with users/permissions.

## HOWTO: Restore full database backup

To restore, CAREFULLY, run:

    sudo -u postgres pg_restore --clean --if-exists --create --exit-on-error --jobs=16 -f DUMP_FILE.tar

Or, in production:

    sudo su postgres
    time zcat fatcat_full_dbdump_2020-02-02.022209.tar.gz  | pg_restore --exit-on-error --clean --if-exists --dbname fatcat_prod

In QA (as of 2019-12-13 dump), this takes about 8 hours.

To just inspect a dump:

    pg_restore -l DUMP_FILE.tar.gz

## HOWTO: Public database dump

This dump will contain all tables in the backend schema, except for "private"
authentication tables. For local or non-production machines, might need to
replace the `fatcat_prod` database name.

Note that prior to 2022, public dumps were in `--format=tar`, but this results
in many temporary files being written to disk, which causes unnecessarly load
on the server.

    # TODO: for production, probably want consistent serialization mode
    export DATESLUG="`date +%Y-%m-%d.%H%M%S`"
    sudo -u postgres pg_dump --verbose --format=custom --exclude-table-data=auth_oidc fatcat_prod > /srv/fatcat/snapshots/fatcat_public_dbdump_${DATESLUG}.pgdump

Can also run using the remote/SSH options above.

## Uploading to Internet Archive

The `./ia_item_exports_readme.md` and `sqldump` files should be included as a
`README.md` when appropriate:

    ia upload fatcat_bulk_exports_YYYY-MM-DD ia_exports_item_readme.md --remote-name=README.md -m collection:fatcat_snapshots_and_exports -m mediatype:data -m creator:"Internet Archive Web Group" -m date:YYYY-MM-DD -m title:"Fatcat Bulk Metadata Exports (YYYY-MM-DD)"
    ia upload fatcat_sqldump_public_YYYY-MM-DD ia_sqldump_item_readme.md --remote-name=README.md -m collection:fatcat_snapshots_and_exports -m mediatype:data -m creator:"Internet Archive Web Group" -m date:YYYY-MM-DD -m title:"Fatcat Public Database Snapshot (YYYY-MM-DD)"

Then upload the content:

    ia upload fatcat_bulk_exports_YYYY-MM-DD --no-derive *.gz

Uploads should can be `--no-derive` to save cluster time.

Metadata should be set as:

- item name: `fatcat_bulk_exports_YYYY-MM-DD` or `fatcat_sqldump_public_YYYY-MM-DD` (or sometimes `fatcat_sqldump_full`)
- collection: `fatcat_snapshots_and_exports`
- creator: `Internet Archive Web Group`
- date: that the dump started (UTC)
- title: "Fatcat Bulk Metadata Exports (YYYY-MM-DD)" or "Fatcat Public Database Snapshot (YYYY-MM-DD)"

## HOWTO: Upload refcat to archive.org

    ia upload refcat_YYYY-MM-DD -m date:YYYY-MM-DD -m collection:fatcat_snapshots_and_exports -m mediatype:data -m title:"refcat Citation Graph Dataset" daterefcat-brefcombined-YYYY-MM-DD.json.zst