large progress on guide

Don't have enough time to complete and copy-edit this now though.
author: Bryan Newbold <bnewbold@robocracy.org> 2018-09-21 12:33:35 -0700
committer: Bryan Newbold <bnewbold@robocracy.org> 2018-09-21 12:34:06 -0700
commit: 1915c7b885641a34191efeee2cc8525a6ad27b9f (patch)
tree: c26b8a772d8e79689b0b7bf6498590d517717ece /guide/src/bulk_exports.md
parent: a1e5acf125decc0f2af28beca43e91b4085cc3d9 (diff)
download: fatcat-1915c7b885641a34191efeee2cc8525a6ad27b9f.tar.gz
fatcat-1915c7b885641a34191efeee2cc8525a6ad27b9f.zip
1 files changed, 48 insertions, 2 deletions
diff --git a/guide/src/bulk_exports.md b/guide/src/bulk_exports.md
index 0aac4475..21cb8226 100644
--- a/guide/src/bulk_exports.md
+++ b/guide/src/bulk_exports.md
@@ -1,8 +1,9 @@
 # Bulk Exports
 
-There are a few different database dump formats folks might want:
+There are several types of bulk exports and database dumps folks might be
+interested in:
 
-- raw native database backups, for disaster recovery (would include
+- raw, native-format database backups: for disaster recovery (would include
   volatile/unsupported schema details, user API credentials, full history,
   in-process edits, comments, etc)
 - a sanitized version of the above: roughly per-table dumps of the full state
@@ -21,3 +22,48 @@ There are a few different database dump formats folks might want:
   just the Release table in a fully "hydrated" state to start. Unclear if
   should be on a work or release basis; will go with release for now. Harder to
   do using public interface because of the need for transaction locking.
+
+## Identifier Snapshots
+
+One form of bulk export is a fast, consistent (single database transaction)
+snapshot of all "live" entity identifiers and their current revisions. This
+snapshot can be used by non-blocking background scripts to generate full bulk
+exports that will be consistent.
+
+These exports are generated by the `./extra/sql_dumps/ident_table_snapshot.sh`
+script, run on a primary database machine, and result in a single tarball,
+which gets uploaded to archive.org. The format is TSV (tab-separated). Unlike
+all other dumps and public formats, the fatcat identifiers in these dumps are
+in raw UUID format (not base32-encoded).
+
+A variant of these dumps is to include external identifiers, resulting in files
+that map, eg, (release ID, DOI, PubMed identifiers, Wikidata QID).
+
+## Abstract Table Dumps
+
+The `./extra/sql_dumps/dump_abstracts.sql` file, when run from the primary
+database machine, outputs all raw abstract strings in JSON format,
+one-object-per-line.
+
+Abstracts are immutable and referenced by hash in the database, so the
+consistency of these dumps is not as much of a concern as with other exports.
+See the [Policy](./policy.md) page for more context around abstract exports.
+
+## "Expanded" Entity Dumps
+
+Using the above identifier snapshots, the `fatcat-export` script outputs
+single-entity-per-line JSON files with the same schema as the HTTP API. The
+most useful version of these for most users are the "expanded" (including
+container and file metadata) release exports.
+
+These exports are compressed and uploaded to archive.org.
+
+## Changelog Entity Dumps
+
+A final export type are changelog dumps. Currently these are implemented in
+python, and anybody can create them. They contain JSON,
+one-line-per-changelog-entry, with the full list of entity edits and editgroup
+metadata for the given changelog entry. Changelog history is immutable; this
+script works by iterating up the (monotonic) changelog counter until it
+encounters a 404.
+
author	Bryan Newbold <bnewbold@robocracy.org>	2018-09-21 12:33:35 -0700
committer	Bryan Newbold <bnewbold@robocracy.org>	2018-09-21 12:34:06 -0700
commit	1915c7b885641a34191efeee2cc8525a6ad27b9f (patch)
tree	c26b8a772d8e79689b0b7bf6498590d517717ece /guide/src/bulk_exports.md
parent	a1e5acf125decc0f2af28beca43e91b4085cc3d9 (diff)
download	fatcat-1915c7b885641a34191efeee2cc8525a6ad27b9f.tar.gz fatcat-1915c7b885641a34191efeee2cc8525a6ad27b9f.zip