diff options
-rw-r--r-- | guide/src/bulk_exports.md | 114 |
1 files changed, 64 insertions, 50 deletions
diff --git a/guide/src/bulk_exports.md b/guide/src/bulk_exports.md index 3a9badcb..052b667e 100644 --- a/guide/src/bulk_exports.md +++ b/guide/src/bulk_exports.md @@ -3,67 +3,81 @@ There are several types of bulk exports and database dumps folks might be interested in: -- raw, native-format database backups: for disaster recovery (would include - volatile/unsupported schema details, user API credentials, full history, - in-process edits, comments, etc) -- a sanitized version of the above: roughly per-table dumps of the full state - of the database. Could use per-table SQL expressions with sub-queries to pull - in small tables ("partial transform") and export JSON for each table; would - be extra work to maintain, so not pursuing for now. -- full history, full public schema exports, in a form that might be used to - mirror or entirely fork the project. Propose supplying the full "changelog" - in API schema format, in a single file to capture all entity history, without - "hydrating" any inter-entity references. Rely on separate dumps of - non-entity, non-versioned tables (editors, abstracts, etc). Note that a - variant of this could use the public interface, in particular to do - incremental updates (though that wouldn't capture schema changes). -- transformed exports of the current state of the database (aka, without - history). Useful for data analysis, search engines, etc. Propose supplying - just the Release table in a fully "hydrated" state to start. Unclear if - should be on a work or release basis; will go with release for now. Harder to - do using public interface because of the need for transaction locking. +- complete database dumps +- changelog history with all entity revisions and edit metadata +- identifier snapshot tables +- entity exports + +All exports and dumps get uploaded to the Internet Archive under the +[bibliographic metadata](https://archive.org/details/ia_biblio_metadata) +collection. + +## Complete Database Dumps + +The most simple and complete bulk export. Useful for disaster recovery, +mirroring, or forking the entire service. The internal database schema is not +stable, so not as useful for longitudinal analysis. These dumps will include +edits-in-progress, deleted entities, old revisions, etc, which are potentially +difficult or impossible to fetch through the API. + +Public copies may have some tables redacted (eg, API credentials). + +Dumps are in PostgreSQL `pg_dump` "tar" binary format, and can be restored +locally with the `pg_restore` command. See `./extra/sql_dumps/` for commands +and details. Dumps are on the order of 100 GBytes (compressed) and will grow +over time. + +## Changelog History + +These are currently unimplemented; would involve "hydrating" sub-entities into +changelog exports. Useful for some mirrors, and analysis that needs to track +provenance information. Format would be the public API schema (JSON). + +All information in these dumps should be possible to fetch via the public API, +including on a feed/streaming basis using the sequential changelog index. All +information is also contained in the database dumps. ## Identifier Snapshots -One form of bulk export is a fast, consistent (single database transaction) -snapshot of all "live" entity identifiers and their current revisions. This -snapshot can be used by non-blocking background scripts to generate full bulk -exports that will be consistent. +Many of the other dump formats are very large. To save time and bandwidth, a +few simple snapshot tables can be exported directly in TSV format. Because +these tables can be dumped in single SQL transactions, they are consistent +point-in-time snapshots. -These exports are generated by the `./extra/sql_dumps/ident_table_snapshot.sh` -script, run on a primary database machine, and result in a single tarball, -which gets uploaded to archive.org. The format is TSV (tab-separated). Unlike -all other dumps and public formats, the fatcat identifiers in these dumps are -in raw UUID format (not base32-encoded). +One format is per-entity identifier/revision tables. These contain active, +deleted, and redirected identifiers, with revision and redirect references, and +are used to generate the entity dumps below. -A variant of these dumps is to include external identifiers, resulting in files -that map, eg, (release ID, DOI, PubMed identifiers, Wikidata QID). +Other tables contain external identifier mappings or file hashes. -## Abstract Table Dumps +Release abstracts can be dumped in their own table (JSON format), allowing them +to be included only by reference from other dumps. The copyright status and +usage restrictions on abstracts are different from other catalog content; see +the [policy](./policy.md) page for more context. Abstracts are immutable and +referenced by hash in the database, so the consistency of these dumps is not as +much of a concern as with other exports. -The `./extra/sql_dumps/dump_abstracts.sql` file, when run from the primary -database machine, outputs all raw abstract strings in JSON format, -one-object-per-line. +Unlike all other dumps and public formats, the Fatcat identifiers in these +dumps are in raw UUID format (not base32-encoded), though this may be fixed in +the future. -Abstracts are immutable and referenced by hash in the database, so the -consistency of these dumps is not as much of a concern as with other exports. -See the [Policy](./policy.md) page for more context around abstract exports. +See `./extra/sql_dumps/` for scripts and details. Dumps are on the order of a +couple GBytes each (compressed). -## "Expanded" Entity Dumps +## Entity Exports -Using the above identifier snapshots, the `fatcat-export` script outputs -single-entity-per-line JSON files with the same schema as the HTTP API. The -most useful version of these for most users are the "expanded" (including -container and file metadata) release exports. +Using the above identifier snapshots, the Rust `fatcat-export` program outputs +single-entity-per-line JSON files with the same schema as the HTTP API. These +might contain the default fields, or be in "expanded" format containing +sub-entities for each record. -These exports are compressed and uploaded to archive.org. +Only "active" entities are included (not deleted, work-in-progress, or +redirected entities). -## Changelog Entity Dumps +The `./rust/README.export.md` file has more context. -A final export type are changelog dumps. Currently these are implemented in -python, and anybody can create them. They contain JSON, -one-line-per-changelog-entry, with the full list of entity edits and editgroup -metadata for the given changelog entry. Changelog history is immutable; this -script works by iterating up the (monotonic) changelog counter until it -encounters a 404. +These dumps can be quite large when expanded (over 100 GBytes compressed), but +do not include history so will not grow as fast as other exports over time. Not +all entity types are dumped at the moment; if you would like specific dumps get +in touch! |