summaryrefslogtreecommitdiffstats
path: root/guide/src/bulk_exports.md
diff options
context:
space:
mode:
Diffstat (limited to 'guide/src/bulk_exports.md')
-rw-r--r--guide/src/bulk_exports.md114
1 files changed, 64 insertions, 50 deletions
diff --git a/guide/src/bulk_exports.md b/guide/src/bulk_exports.md
index 3a9badcb..052b667e 100644
--- a/guide/src/bulk_exports.md
+++ b/guide/src/bulk_exports.md
@@ -3,67 +3,81 @@
There are several types of bulk exports and database dumps folks might be
interested in:
-- raw, native-format database backups: for disaster recovery (would include
- volatile/unsupported schema details, user API credentials, full history,
- in-process edits, comments, etc)
-- a sanitized version of the above: roughly per-table dumps of the full state
- of the database. Could use per-table SQL expressions with sub-queries to pull
- in small tables ("partial transform") and export JSON for each table; would
- be extra work to maintain, so not pursuing for now.
-- full history, full public schema exports, in a form that might be used to
- mirror or entirely fork the project. Propose supplying the full "changelog"
- in API schema format, in a single file to capture all entity history, without
- "hydrating" any inter-entity references. Rely on separate dumps of
- non-entity, non-versioned tables (editors, abstracts, etc). Note that a
- variant of this could use the public interface, in particular to do
- incremental updates (though that wouldn't capture schema changes).
-- transformed exports of the current state of the database (aka, without
- history). Useful for data analysis, search engines, etc. Propose supplying
- just the Release table in a fully "hydrated" state to start. Unclear if
- should be on a work or release basis; will go with release for now. Harder to
- do using public interface because of the need for transaction locking.
+- complete database dumps
+- changelog history with all entity revisions and edit metadata
+- identifier snapshot tables
+- entity exports
+
+All exports and dumps get uploaded to the Internet Archive under the
+[bibliographic metadata](https://archive.org/details/ia_biblio_metadata)
+collection.
+
+## Complete Database Dumps
+
+The most simple and complete bulk export. Useful for disaster recovery,
+mirroring, or forking the entire service. The internal database schema is not
+stable, so not as useful for longitudinal analysis. These dumps will include
+edits-in-progress, deleted entities, old revisions, etc, which are potentially
+difficult or impossible to fetch through the API.
+
+Public copies may have some tables redacted (eg, API credentials).
+
+Dumps are in PostgreSQL `pg_dump` "tar" binary format, and can be restored
+locally with the `pg_restore` command. See `./extra/sql_dumps/` for commands
+and details. Dumps are on the order of 100 GBytes (compressed) and will grow
+over time.
+
+## Changelog History
+
+These are currently unimplemented; would involve "hydrating" sub-entities into
+changelog exports. Useful for some mirrors, and analysis that needs to track
+provenance information. Format would be the public API schema (JSON).
+
+All information in these dumps should be possible to fetch via the public API,
+including on a feed/streaming basis using the sequential changelog index. All
+information is also contained in the database dumps.
## Identifier Snapshots
-One form of bulk export is a fast, consistent (single database transaction)
-snapshot of all "live" entity identifiers and their current revisions. This
-snapshot can be used by non-blocking background scripts to generate full bulk
-exports that will be consistent.
+Many of the other dump formats are very large. To save time and bandwidth, a
+few simple snapshot tables can be exported directly in TSV format. Because
+these tables can be dumped in single SQL transactions, they are consistent
+point-in-time snapshots.
-These exports are generated by the `./extra/sql_dumps/ident_table_snapshot.sh`
-script, run on a primary database machine, and result in a single tarball,
-which gets uploaded to archive.org. The format is TSV (tab-separated). Unlike
-all other dumps and public formats, the fatcat identifiers in these dumps are
-in raw UUID format (not base32-encoded).
+One format is per-entity identifier/revision tables. These contain active,
+deleted, and redirected identifiers, with revision and redirect references, and
+are used to generate the entity dumps below.
-A variant of these dumps is to include external identifiers, resulting in files
-that map, eg, (release ID, DOI, PubMed identifiers, Wikidata QID).
+Other tables contain external identifier mappings or file hashes.
-## Abstract Table Dumps
+Release abstracts can be dumped in their own table (JSON format), allowing them
+to be included only by reference from other dumps. The copyright status and
+usage restrictions on abstracts are different from other catalog content; see
+the [policy](./policy.md) page for more context. Abstracts are immutable and
+referenced by hash in the database, so the consistency of these dumps is not as
+much of a concern as with other exports.
-The `./extra/sql_dumps/dump_abstracts.sql` file, when run from the primary
-database machine, outputs all raw abstract strings in JSON format,
-one-object-per-line.
+Unlike all other dumps and public formats, the Fatcat identifiers in these
+dumps are in raw UUID format (not base32-encoded), though this may be fixed in
+the future.
-Abstracts are immutable and referenced by hash in the database, so the
-consistency of these dumps is not as much of a concern as with other exports.
-See the [Policy](./policy.md) page for more context around abstract exports.
+See `./extra/sql_dumps/` for scripts and details. Dumps are on the order of a
+couple GBytes each (compressed).
-## "Expanded" Entity Dumps
+## Entity Exports
-Using the above identifier snapshots, the `fatcat-export` script outputs
-single-entity-per-line JSON files with the same schema as the HTTP API. The
-most useful version of these for most users are the "expanded" (including
-container and file metadata) release exports.
+Using the above identifier snapshots, the Rust `fatcat-export` program outputs
+single-entity-per-line JSON files with the same schema as the HTTP API. These
+might contain the default fields, or be in "expanded" format containing
+sub-entities for each record.
-These exports are compressed and uploaded to archive.org.
+Only "active" entities are included (not deleted, work-in-progress, or
+redirected entities).
-## Changelog Entity Dumps
+The `./rust/README.export.md` file has more context.
-A final export type are changelog dumps. Currently these are implemented in
-python, and anybody can create them. They contain JSON,
-one-line-per-changelog-entry, with the full list of entity edits and editgroup
-metadata for the given changelog entry. Changelog history is immutable; this
-script works by iterating up the (monotonic) changelog counter until it
-encounters a 404.
+These dumps can be quite large when expanded (over 100 GBytes compressed), but
+do not include history so will not grow as fast as other exports over time. Not
+all entity types are dumped at the moment; if you would like specific dumps get
+in touch!