aft: ASCII Framed Tables

Tabular text files (like CSV or TSV), but using special ASCII characters to distinguish header, rows, and fields. For storing tabular data in files, or for simple structured data transfer between programs and over networks.

file suffix: .aft

mimetype (speculative): text/x-aft

Format

Format is:

start header: 0x01 byte (ANSI "Start of Header")
header rows, with same width/structure as table (see below)
- first row, required, is column names. if blank name, column is unnamed. duplicate names are ambiguous: pass-through by some tools, forbidden by check
- second row indicates column types (see below)
- third row indicuates current sort order
start body: 0x02 byte (ANSI "Start of Text")
rows are lines ending with 0x1D (ANSI "Group Separator"), immediately followed by a newline. TBD how strict about including the newline. Other newlines in the text do not incidate a new row, they are part of the cell ("record" or "unit") value
within a row, columns are separated by 0x1E ("Record Separator"). Within columns, records can optionally have additional structure, separated by 0x1F ("Unit Separator")

The special characters are:

0x01 1 \001 SOH ^A: "Start of Header"
0x02 2 \002 STX ^B: "Start of Text"
0x03 3 \003 ETX ^C: "End of Text"
0x1C 28 \034 FS ^\: "File Separator"
0x1D 29 \035 GS ^]: "Group Separator"
0x1E 30 \036 RS ^^: "Record Separator"
0x1F 31 \037 US ^_: "Unit Separator"

Tools and file readers check the start of stream. If it is 0x01, can treat as aft; otherwise treat as regular text. 0x02 terminates header section and starts contents. Header records could somehow include:

column names
column types (default is string)
indicate sort order (columns, asc/desc)

General structure of tables is:

units: for arrays of values in a cell (no other structure) => could be "array" (variable-length, single type) or "tuple" (fixed-length, names and types for each slot)
records: for cells in 2-dimensional grid
groups: rows of table (or, use newline, so regular tools work?)
file: unused; could be "sheets"? sort of redundant with "end of text" character

Comments and annotations are not supported.

Default strings must not contain any of the special characters. This must be enforced by code doing the encoding, and requires a full pass of every value. In this mode, strings which include the special characters may not be included and must raise an error. For UTF-8 strings which may include the characters, an escaped column type is allowed, which allows arbitrary JSON-style character escapes (like '\u01`). The downside with this mode is that conversion and processing may be slower.

Extras

Text editor support: coloring, easy entry of special chars, highlighting errors, tab-separation.

Unfortunately, my terminal (xterm) interprets Ctrl-V Ctrl-_ as "resize smaller".

Column Types

Encoding is that of JSON or TOML. Values can be:

string (UTF-8) (default if not specified)
escaped byte-string, JSON-style
base64 encoded bytes
integer number
floating point number
boolean
date, timestamp (ISO format)
null

Usage Examples

cargo run --bin aft-demo | cat -v
cargo run --bin aft-demo | cut -f1 -d$'\036' | cat -v

Use Cases

(structured) logging
(structured) command output
data munging pipelines and intermediate formats
datastore imports and exports
REST API output (via content negotiation)
efficient
sorted static lookup tables

Ideas

sub-types for columns. eg, "uint+bytes"
sort order

Similar To

Hadoop formats?
"column families" in datastores
CDX (web archiving), particularly CDXJ (JSON-ish)

Open Questions

should windows-style terminators be allowed? or strictly single-newline-character after the Group Separator?
should SOH and STX characters be followed by newline? optionally? this may make some command-line stuff easier, but adds parsing complexity (how to represent leading whitespace in first column?)
what about newlines in values? should there be a mode where we use record or group separator instead of newline? single mode is much better, but compatibility with existing UNIX-y newline-oriented stuff also compelling. Hrm.
should first line in file starting with # be ignored? for future extensibility
should the default mode be escaped strings, or strings with no special characters allowed? may not matter for the common case of no special characters.
should additional constraints be placed on header encoding? eg, that they must be escaped or not have any ASCII special characters. this might make parsing to determine validity and mode easier
should there be fast ASCII-only, no special characters mode? would that provide even more speed? daft?