aft: ASCII Framed Tables
Tabular text files (like CSV or TSV), but using special ASCII characters to distinguish header, rows, and fields. For storing tabular data in files, or for simple structured data transfer between programs and over networks.
file suffix: .aft
mimetype (speculative): text/x-aft
Format
Format is:
- start header:
0x01
byte (ANSI "Start of Header") - header rows, with same width/structure as table (see below)
- first row, required, is column names. if blank name, column is unnamed.
duplicate names are ambiguous: pass-through by some tools, forbidden by
check
- second row indicates column types (see below)
- third row indicuates current sort order
- first row, required, is column names. if blank name, column is unnamed.
duplicate names are ambiguous: pass-through by some tools, forbidden by
- start body:
0x02
byte (ANSI "Start of Text") - rows are lines ending with
0x1D
(ANSI "Group Separator"), immediately followed by a newline. TBD how strict about including the newline. Other newlines in the text do not incidate a new row, they are part of the cell ("record" or "unit") value - within a row, columns are separated by
0x1E
("Record Separator"). Within columns, records can optionally have additional structure, separated by0x1F
("Unit Separator")
The special characters are:
0x01 1 \001 SOH ^A
: "Start of Header"0x02 2 \002 STX ^B
: "Start of Text"0x03 3 \003 ETX ^C
: "End of Text"0x1C 28 \034 FS ^\
: "File Separator"0x1D 29 \035 GS ^]
: "Group Separator"0x1E 30 \036 RS ^^
: "Record Separator"0x1F 31 \037 US ^_
: "Unit Separator"
Tools and file readers check the start of stream. If it is 0x01
, can treat as
aft
; otherwise treat as regular text. 0x02
terminates header section and
starts contents. Header records could somehow include:
- column names
- column types (default is string)
- indicate sort order (columns, asc/desc)
General structure of tables is:
- units: for arrays of values in a cell (no other structure) => could be "array" (variable-length, single type) or "tuple" (fixed-length, names and types for each slot)
- records: for cells in 2-dimensional grid
- groups: rows of table (or, use newline, so regular tools work?)
- file: unused; could be "sheets"? sort of redundant with "end of text" character
Comments and annotations are not supported.
Default strings must not contain any of the special characters. This must be enforced by code doing the encoding, and requires a full pass of every value. In this mode, strings which include the special characters may not be included and must raise an error. For UTF-8 strings which may include the characters, an escaped column type is allowed, which allows arbitrary JSON-style character escapes (like '\u01`). The downside with this mode is that conversion and processing may be slower.
Extras
Text editor support: coloring, easy entry of special chars, highlighting errors, tab-separation.
Unfortunately, my terminal (xterm) interprets Ctrl-V Ctrl-_ as "resize smaller".
Column Types
Encoding is that of JSON or TOML. Values can be:
- string (UTF-8) (default if not specified)
- escaped byte-string, JSON-style
- base64 encoded bytes
- integer number
- floating point number
- boolean
- date, timestamp (ISO format)
- null
Usage Examples
cargo run --bin aft-demo | cat -v
cargo run --bin aft-demo | cut -f1 -d$'\036' | cat -v
Use Cases
- (structured) logging
- (structured) command output
- data munging pipelines and intermediate formats
- datastore imports and exports
- REST API output (via content negotiation)
- efficient
- sorted static lookup tables
Ideas
- sub-types for columns. eg, "uint+bytes"
- sort order
Similar To
- Hadoop formats?
- "column families" in datastores
- CDX (web archiving), particularly CDXJ (JSON-ish)
Open Questions
- should windows-style terminators be allowed? or strictly single-newline-character after the Group Separator?
- should SOH and STX characters be followed by newline? optionally? this may make some command-line stuff easier, but adds parsing complexity (how to represent leading whitespace in first column?)
- what about newlines in values? should there be a mode where we use record or group separator instead of newline? single mode is much better, but compatibility with existing UNIX-y newline-oriented stuff also compelling. Hrm.
- should first line in file starting with
#
be ignored? for future extensibility - should the default mode be escaped strings, or strings with no special characters allowed? may not matter for the common case of no special characters.
- should additional constraints be placed on header encoding? eg, that they must be escaped or not have any ASCII special characters. this might make parsing to determine validity and mode easier
- should there be fast ASCII-only, no special characters mode? would that
provide even more speed?
daft
?