diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2022-04-07 14:44:01 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2022-04-07 14:44:01 -0700 |
commit | ede98644a89afd15d903061e0998dbd08851df6d (patch) | |
tree | 17c54c5764adb2f5d67aa750174f635e0fb1cdc8 /guide | |
parent | 2ef72e0c769e94401568ab42def30ddb5268fa98 (diff) | |
parent | 0aaa2a839d7a14716ee1a84b730203a7953dc5e0 (diff) | |
download | fatcat-ede98644a89afd15d903061e0998dbd08851df6d.tar.gz fatcat-ede98644a89afd15d903061e0998dbd08851df6d.zip |
Merge branch 'bnewbold-dataset-ingest-fixes'
Diffstat (limited to 'guide')
-rw-r--r-- | guide/src/entity_file.md | 4 | ||||
-rw-r--r-- | guide/src/entity_fileset.md | 25 |
2 files changed, 20 insertions, 9 deletions
diff --git a/guide/src/entity_file.md b/guide/src/entity_file.md index 84d9eac4..6a11e945 100644 --- a/guide/src/entity_file.md +++ b/guide/src/entity_file.md @@ -26,6 +26,10 @@ many articles), and that a release will often have multiple files (differing only by watermarks, or different digitizations of the same printed work, or variant MIME/media types of the same published work). +- `extra` (object with string keys): additional metadata about this file + - `path`: filename, with optional path prefix. path must be "relative", not + "absolute", and should use UNIX-style forward slashes, not Windows-style + backward slashes #### URL `rel` Vocabulary diff --git a/guide/src/entity_fileset.md b/guide/src/entity_fileset.md index 6083a09d..8a9ea832 100644 --- a/guide/src/entity_fileset.md +++ b/guide/src/entity_fileset.md @@ -10,16 +10,17 @@ - `sha1` (string): SHA-1 hash in lower-case hex - `sha256` (string): SHA-256 hash in lower-case hex - `mimetype` (string): Content type in MIME type schema - - `extra` (object): any extra metadata about this specific file - - `original_url`: live web canonical URL to download this file (optional) - - `webarchive_url`: web archive capture of this file (optional) - - `platform_id`: platform-specific identifier for this file + - `extra` (object): any extra metadata about this specific file. all are + optional + - `original_url`: live web canonical URL to download this file + - `webarchive_url`: web archive capture of this file - `urls`: An array of "typed" URLs. Order is not meaningful, and may not be - preserved. + preserved. These are URLs for the entire fileset, not individual files. - `url` (string, required): Eg: "https://example.edu/~frau/prcding.pdf". - `rel` (string, required): - Eg: "webarchive". + Eg: "archive-base", "webarchive". + - `release_ids` (array of string identifiers): references to `release` entities - `content_scope` (string): for situations where the fileset does not simply contain the full representation of a work (eg, all files in dataset, for a @@ -27,12 +28,18 @@ vocabulary as File entity. - `extra` (object with string keys): additional metadata about this group of files, including upstream platform-specific metadata and identifiers + - `platform_id`: platform-specific identifier for this fileset #### URL `rel` types -- `repository`: URL of a live-web landing page or other location where content can be - found. May not be machine-reachable. -- `webarchive`: web archive version of `repository` +Any ending in "-base" implies that a file path (from the manifest) can be +appended to the "base" URL to get a file download URL. Any "bundle" implies a +direct link to an archive or "bundle" (like `.zip` or `.tar`) which contains +all the files in this fileset + +- `repository` or `platform` or `web`: URL of a live-web landing page or other + location where content can be found. May or may not be machine-reachable. +- `webarchive`: web archive version of `repository` landing page - `repository-bundle`: direct URL to a live-web "archive" file, such as `.zip`, which contains all of the individual files in this fileset - `webarchive-bundle`: web archive version of `repository-bundle` |