summaryrefslogtreecommitdiffstats
path: root/notes/fileset_webcapture.txt
blob: e13222d9b88216c90b32a57b957598a957f09a6b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

## fileset

Constraints:
- limit to 200 files per set, to start. This to work around very large metadata
  sizes and >1 MByte JSON API blobs
- must have a complete manifest for at least one hash type (of md5, sha1,
  sha256)

Could end up separating manifest into a separate redirect, like abstracts, to
reduce database size. Could also store as a single giant JSONB blob, like
planned for citations, to get better compression. These denormlization steps
can happen later as performance/resource optimizations.

Would like to handle things like git repositories of code, git-annex datasets,
dat archives, and torrents. Some options:

- store git URL + commit in release metadata, with no file/fileset. This ties
  release with a specific version well, but breaks semantics of data model
  (artifact/metadata separation)
- store a full file manifest (or just the important files) and full URLs; maybe
  version/commit as extra but not in URL?
- store "stub" FileSet with no manifest, git version/commit in extra (or as a
  new column?), and locations in URL list

## webcapture

Constraints
- also limit to 200 lines, same as with fileset