aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-05-31 15:39:56 -0700
committerBryan Newbold <bnewbold@archive.org>2020-05-31 15:39:56 -0700
commit661da10e73b9ec95a0411d1e54b999f6696c3152 (patch)
tree7d60fa4bd9073cad1e3d1a3a6ff033496a93dc59
parent52c8fd028789f80dfd8460dcbaca07df17dd56e9 (diff)
downloadaft-661da10e73b9ec95a0411d1e54b999f6696c3152.tar.gz
aft-661da10e73b9ec95a0411d1e54b999f6696c3152.zip
more notes in README
-rw-r--r--README.md75
1 files changed, 62 insertions, 13 deletions
diff --git a/README.md b/README.md
index 1de9277..b2abfd7 100644
--- a/README.md
+++ b/README.md
@@ -2,26 +2,43 @@
aft: ASCII Framed Tables
===========================
-Like TSV, but using ASCII control characters to distinguish headers, fields, and rows.
-
-Store tabular data in files, or for simple structured data transfer between
-programs and over networks.
+Tabular text files (like CSV or TSV), but using special ASCII characters to
+distinguish header, rows, and fields. For storing tabular data in files, or for
+simple structured data transfer between programs and over networks.
file suffix: `.aft`
-mimetype (speculative): `text/aft`
+mimetype (speculative): `text/x-aft`
## Format
+Format is:
+
+- start header: `0x01` byte (ANSI "Start of Header")
+- header rows, with same width/structure as table (see below)
+ - first row, required, is column names. if blank name, column is unnamed.
+ duplicate names are ambiguous: pass-through by some tools, forbidden by
+ `check`
+ - second row indicates column types (see below)
+ - third row indicuates current sort order
+- start body: `0x02` byte (ANSI "Start of Text")
+- rows are lines ending with `0x1D` (ANSI "Group Separator"), immediately
+ followed by a newline. TBD how strict about including the newline. Other
+ newlines in the text do not incidate a new row, they are part of the cell
+ ("record" or "unit") value
+- within a row, columns are separated by `0x1E` ("Record Separator"). Within
+ columns, records can optionally have additional structure, separated by
+ `0x1F` ("Unit Separator")
+
The special characters are:
-- `0x01 1 SOH`: "Start of Header"
-- `0x02 2 STX`: "Start of Text"
-- `0x03 3 ETX`: "End of Text"
-- `0x1C 28 FS`: "File Separator"
-- `0x1D 29 GS`: "Group Separator"
-- `0x1E 30 RS`: "Record Separator"
-- `0x1F 31 US`: "Unit Separator"
+- `0x01 1 \001 SOH ^A`: "Start of Header"
+- `0x02 2 \002 STX ^B`: "Start of Text"
+- `0x03 3 \003 ETX ^C`: "End of Text"
+- `0x1C 28 \034 FS ^\`: "File Separator"
+- `0x1D 29 \035 GS ^]`: "Group Separator"
+- `0x1E 30 \036 RS ^^`: "Record Separator"
+- `0x1F 31 \037 US ^_`: "Unit Separator"
Tools and file readers check the start of stream. If it is `0x01`, can treat as
`aft`; otherwise treat as regular text. `0x02` terminates header section and
@@ -59,8 +76,40 @@ Encoding is that of JSON or TOML. Values can be:
- date, timestamp
- null
-## Examples
+## Usage Examples
cargo run --bin aft-demo | cat -v
cargo run --bin aft-demo | cut -f1 -d$'\036' | cat -v
+## Use Cases
+
+- (structured) logging
+- (structured) command output
+- data munging pipelines and intermediate formats
+- datastore imports and exports
+- REST API output (via content negotiation)
+- efficient
+- sorted static lookup tables
+
+## Ideas
+
+- sub-types for columns. eg, "uint+bytes"
+- sort order
+
+## Similar To
+
+- Hadoop formats?
+- "column families" in datastores
+- CDX (web archiving), particularly CDXJ (JSON-ish)
+
+## Open Questions
+
+- should windows-style terminators be allowed? or strictly
+ single-newline-character after the Group Separator?
+- should SOH and STX characters be followed by newline? optionally? this may
+ make some command-line stuff easier, but adds parsing complexity (how to
+ represent leading whitespace in first column?)
+- what about newlines in values? should there be a mode where we use record or
+ group separator instead of newline? single mode is much better, but
+ compatibility with existing UNIX-y newline-oriented stuff also compelling. Hrm.
+- should first line in file starting with `#` be ignored? for future extensibility