summaryrefslogtreecommitdiffstats
path: root/notes
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2018-12-20 16:28:37 -0800
committerBryan Newbold <bnewbold@robocracy.org>2018-12-24 16:27:01 -0800
commit36ce50acc6f44d9f504531571eea8755d723acc5 (patch)
tree804fba2793ddd0866d3edd9dd3f63d14bca8d2ff /notes
parent219ea171eed66882d38c352aadd950fed1305d77 (diff)
downloadfatcat-36ce50acc6f44d9f504531571eea8755d723acc5.tar.gz
fatcat-36ce50acc6f44d9f504531571eea8755d723acc5.zip
notes on fileset/webcapture
Diffstat (limited to 'notes')
-rw-r--r--notes/fileset_webcapture.txt29
1 files changed, 29 insertions, 0 deletions
diff --git a/notes/fileset_webcapture.txt b/notes/fileset_webcapture.txt
new file mode 100644
index 00000000..e13222d9
--- /dev/null
+++ b/notes/fileset_webcapture.txt
@@ -0,0 +1,29 @@
+
+## fileset
+
+Constraints:
+- limit to 200 files per set, to start. This to work around very large metadata
+ sizes and >1 MByte JSON API blobs
+- must have a complete manifest for at least one hash type (of md5, sha1,
+ sha256)
+
+Could end up separating manifest into a separate redirect, like abstracts, to
+reduce database size. Could also store as a single giant JSONB blob, like
+planned for citations, to get better compression. These denormlization steps
+can happen later as performance/resource optimizations.
+
+Would like to handle things like git repositories of code, git-annex datasets,
+dat archives, and torrents. Some options:
+
+- store git URL + commit in release metadata, with no file/fileset. This ties
+ release with a specific version well, but breaks semantics of data model
+ (artifact/metadata separation)
+- store a full file manifest (or just the important files) and full URLs; maybe
+ version/commit as extra but not in URL?
+- store "stub" FileSet with no manifest, git version/commit in extra (or as a
+ new column?), and locations in URL list
+
+## webcapture
+
+Constraints
+- also limit to 200 lines, same as with fileset