guide/src/bulk_exports.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69

# Bulk Exports

There are several types of bulk exports and database dumps folks might be
interested in:

- raw, native-format database backups: for disaster recovery (would include
  volatile/unsupported schema details, user API credentials, full history,
  in-process edits, comments, etc)
- a sanitized version of the above: roughly per-table dumps of the full state
  of the database. Could use per-table SQL expressions with sub-queries to pull
  in small tables ("partial transform") and export JSON for each table; would
  be extra work to maintain, so not pursuing for now.
- full history, full public schema exports, in a form that might be used to
  mirror or enitrely fork the project. Propose supplying the full "changelog"
  in API schema format, in a single file to capture all entity history, without
  "hydrating" any inter-entity references. Rely on separate dumps of
  non-entity, non-versioned tables (editors, abstracts, etc). Note that a
  variant of this could use the public interface, in particular to do
  incremental updates (though that wouldn't capture schema changes).
- transformed exports of the current state of the database (aka, without
  history). Useful for data analysis, search engines, etc. Propose supplying
  just the Release table in a fully "hydrated" state to start. Unclear if
  should be on a work or release basis; will go with release for now. Harder to
  do using public interface because of the need for transaction locking.

## Identifier Snapshots

One form of bulk export is a fast, consistent (single database transaction)
snapshot of all "live" entity identifiers and their current revisions. This
snapshot can be used by non-blocking background scripts to generate full bulk
exports that will be consistent.

These exports are generated by the `./extra/sql_dumps/ident_table_snapshot.sh`
script, run on a primary database machine, and result in a single tarball,
which gets uploaded to archive.org. The format is TSV (tab-separated). Unlike
all other dumps and public formats, the fatcat identifiers in these dumps are
in raw UUID format (not base32-encoded).

A variant of these dumps is to include external identifiers, resulting in files
that map, eg, (release ID, DOI, PubMed identifiers, Wikidata QID).

## Abstract Table Dumps

The `./extra/sql_dumps/dump_abstracts.sql` file, when run from the primary
database machine, outputs all raw abstract strings in JSON format,
one-object-per-line.

Abstracts are immutable and referenced by hash in the database, so the
consistency of these dumps is not as much of a concern as with other exports.
See the [Policy](./policy.md) page for more context around abstract exports.

## "Expanded" Entity Dumps

Using the above identifier snapshots, the `fatcat-export` script outputs
single-entity-per-line JSON files with the same schema as the HTTP API. The
most useful version of these for most users are the "expanded" (including
container and file metadata) release exports.

These exports are compressed and uploaded to archive.org.

## Changelog Entity Dumps

A final export type are changelog dumps. Currently these are implemented in
python, and anybody can create them. They contain JSON,
one-line-per-changelog-entry, with the full list of entity edits and editgroup
metadata for the given changelog entry. Changelog history is immutable; this
script works by iterating up the (monotonic) changelog counter until it
encounters a 404.