aboutsummaryrefslogtreecommitdiffstats
path: root/guide/src/bulk_exports.md
diff options
context:
space:
mode:
Diffstat (limited to 'guide/src/bulk_exports.md')
-rw-r--r--guide/src/bulk_exports.md50
1 files changed, 48 insertions, 2 deletions
diff --git a/guide/src/bulk_exports.md b/guide/src/bulk_exports.md
index 0aac4475..21cb8226 100644
--- a/guide/src/bulk_exports.md
+++ b/guide/src/bulk_exports.md
@@ -1,8 +1,9 @@
# Bulk Exports
-There are a few different database dump formats folks might want:
+There are several types of bulk exports and database dumps folks might be
+interested in:
-- raw native database backups, for disaster recovery (would include
+- raw, native-format database backups: for disaster recovery (would include
volatile/unsupported schema details, user API credentials, full history,
in-process edits, comments, etc)
- a sanitized version of the above: roughly per-table dumps of the full state
@@ -21,3 +22,48 @@ There are a few different database dump formats folks might want:
just the Release table in a fully "hydrated" state to start. Unclear if
should be on a work or release basis; will go with release for now. Harder to
do using public interface because of the need for transaction locking.
+
+## Identifier Snapshots
+
+One form of bulk export is a fast, consistent (single database transaction)
+snapshot of all "live" entity identifiers and their current revisions. This
+snapshot can be used by non-blocking background scripts to generate full bulk
+exports that will be consistent.
+
+These exports are generated by the `./extra/sql_dumps/ident_table_snapshot.sh`
+script, run on a primary database machine, and result in a single tarball,
+which gets uploaded to archive.org. The format is TSV (tab-separated). Unlike
+all other dumps and public formats, the fatcat identifiers in these dumps are
+in raw UUID format (not base32-encoded).
+
+A variant of these dumps is to include external identifiers, resulting in files
+that map, eg, (release ID, DOI, PubMed identifiers, Wikidata QID).
+
+## Abstract Table Dumps
+
+The `./extra/sql_dumps/dump_abstracts.sql` file, when run from the primary
+database machine, outputs all raw abstract strings in JSON format,
+one-object-per-line.
+
+Abstracts are immutable and referenced by hash in the database, so the
+consistency of these dumps is not as much of a concern as with other exports.
+See the [Policy](./policy.md) page for more context around abstract exports.
+
+## "Expanded" Entity Dumps
+
+Using the above identifier snapshots, the `fatcat-export` script outputs
+single-entity-per-line JSON files with the same schema as the HTTP API. The
+most useful version of these for most users are the "expanded" (including
+container and file metadata) release exports.
+
+These exports are compressed and uploaded to archive.org.
+
+## Changelog Entity Dumps
+
+A final export type are changelog dumps. Currently these are implemented in
+python, and anybody can create them. They contain JSON,
+one-line-per-changelog-entry, with the full list of entity edits and editgroup
+metadata for the given changelog entry. Changelog history is immutable; this
+script works by iterating up the (monotonic) changelog counter until it
+encounters a 404.
+