diff options
Diffstat (limited to 'guide/src/bulk_exports.md')
-rw-r--r-- | guide/src/bulk_exports.md | 50 |
1 files changed, 48 insertions, 2 deletions
diff --git a/guide/src/bulk_exports.md b/guide/src/bulk_exports.md index 0aac4475..21cb8226 100644 --- a/guide/src/bulk_exports.md +++ b/guide/src/bulk_exports.md @@ -1,8 +1,9 @@ # Bulk Exports -There are a few different database dump formats folks might want: +There are several types of bulk exports and database dumps folks might be +interested in: -- raw native database backups, for disaster recovery (would include +- raw, native-format database backups: for disaster recovery (would include volatile/unsupported schema details, user API credentials, full history, in-process edits, comments, etc) - a sanitized version of the above: roughly per-table dumps of the full state @@ -21,3 +22,48 @@ There are a few different database dump formats folks might want: just the Release table in a fully "hydrated" state to start. Unclear if should be on a work or release basis; will go with release for now. Harder to do using public interface because of the need for transaction locking. + +## Identifier Snapshots + +One form of bulk export is a fast, consistent (single database transaction) +snapshot of all "live" entity identifiers and their current revisions. This +snapshot can be used by non-blocking background scripts to generate full bulk +exports that will be consistent. + +These exports are generated by the `./extra/sql_dumps/ident_table_snapshot.sh` +script, run on a primary database machine, and result in a single tarball, +which gets uploaded to archive.org. The format is TSV (tab-separated). Unlike +all other dumps and public formats, the fatcat identifiers in these dumps are +in raw UUID format (not base32-encoded). + +A variant of these dumps is to include external identifiers, resulting in files +that map, eg, (release ID, DOI, PubMed identifiers, Wikidata QID). + +## Abstract Table Dumps + +The `./extra/sql_dumps/dump_abstracts.sql` file, when run from the primary +database machine, outputs all raw abstract strings in JSON format, +one-object-per-line. + +Abstracts are immutable and referenced by hash in the database, so the +consistency of these dumps is not as much of a concern as with other exports. +See the [Policy](./policy.md) page for more context around abstract exports. + +## "Expanded" Entity Dumps + +Using the above identifier snapshots, the `fatcat-export` script outputs +single-entity-per-line JSON files with the same schema as the HTTP API. The +most useful version of these for most users are the "expanded" (including +container and file metadata) release exports. + +These exports are compressed and uploaded to archive.org. + +## Changelog Entity Dumps + +A final export type are changelog dumps. Currently these are implemented in +python, and anybody can create them. They contain JSON, +one-line-per-changelog-entry, with the full list of entity edits and editgroup +metadata for the given changelog entry. Changelog history is immutable; this +script works by iterating up the (monotonic) changelog counter until it +encounters a 404. + |