From 661da10e73b9ec95a0411d1e54b999f6696c3152 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Sun, 31 May 2020 15:39:56 -0700 Subject: more notes in README --- README.md | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 62 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index 1de9277..b2abfd7 100644 --- a/README.md +++ b/README.md @@ -2,26 +2,43 @@ aft: ASCII Framed Tables =========================== -Like TSV, but using ASCII control characters to distinguish headers, fields, and rows. - -Store tabular data in files, or for simple structured data transfer between -programs and over networks. +Tabular text files (like CSV or TSV), but using special ASCII characters to +distinguish header, rows, and fields. For storing tabular data in files, or for +simple structured data transfer between programs and over networks. file suffix: `.aft` -mimetype (speculative): `text/aft` +mimetype (speculative): `text/x-aft` ## Format +Format is: + +- start header: `0x01` byte (ANSI "Start of Header") +- header rows, with same width/structure as table (see below) + - first row, required, is column names. if blank name, column is unnamed. + duplicate names are ambiguous: pass-through by some tools, forbidden by + `check` + - second row indicates column types (see below) + - third row indicuates current sort order +- start body: `0x02` byte (ANSI "Start of Text") +- rows are lines ending with `0x1D` (ANSI "Group Separator"), immediately + followed by a newline. TBD how strict about including the newline. Other + newlines in the text do not incidate a new row, they are part of the cell + ("record" or "unit") value +- within a row, columns are separated by `0x1E` ("Record Separator"). Within + columns, records can optionally have additional structure, separated by + `0x1F` ("Unit Separator") + The special characters are: -- `0x01 1 SOH`: "Start of Header" -- `0x02 2 STX`: "Start of Text" -- `0x03 3 ETX`: "End of Text" -- `0x1C 28 FS`: "File Separator" -- `0x1D 29 GS`: "Group Separator" -- `0x1E 30 RS`: "Record Separator" -- `0x1F 31 US`: "Unit Separator" +- `0x01 1 \001 SOH ^A`: "Start of Header" +- `0x02 2 \002 STX ^B`: "Start of Text" +- `0x03 3 \003 ETX ^C`: "End of Text" +- `0x1C 28 \034 FS ^\`: "File Separator" +- `0x1D 29 \035 GS ^]`: "Group Separator" +- `0x1E 30 \036 RS ^^`: "Record Separator" +- `0x1F 31 \037 US ^_`: "Unit Separator" Tools and file readers check the start of stream. If it is `0x01`, can treat as `aft`; otherwise treat as regular text. `0x02` terminates header section and @@ -59,8 +76,40 @@ Encoding is that of JSON or TOML. Values can be: - date, timestamp - null -## Examples +## Usage Examples cargo run --bin aft-demo | cat -v cargo run --bin aft-demo | cut -f1 -d$'\036' | cat -v +## Use Cases + +- (structured) logging +- (structured) command output +- data munging pipelines and intermediate formats +- datastore imports and exports +- REST API output (via content negotiation) +- efficient +- sorted static lookup tables + +## Ideas + +- sub-types for columns. eg, "uint+bytes" +- sort order + +## Similar To + +- Hadoop formats? +- "column families" in datastores +- CDX (web archiving), particularly CDXJ (JSON-ish) + +## Open Questions + +- should windows-style terminators be allowed? or strictly + single-newline-character after the Group Separator? +- should SOH and STX characters be followed by newline? optionally? this may + make some command-line stuff easier, but adds parsing complexity (how to + represent leading whitespace in first column?) +- what about newlines in values? should there be a mode where we use record or + group separator instead of newline? single mode is much better, but + compatibility with existing UNIX-y newline-oriented stuff also compelling. Hrm. +- should first line in file starting with `#` be ignored? for future extensibility -- cgit v1.2.3