aboutsummaryrefslogtreecommitdiffstats
path: root/README.md
blob: b3d80c93768fb6ba3a14203532b6b4c7aa2f5f97 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132

aft: ASCII Framed Tables
===========================

Tabular text files (like CSV or TSV), but using special ASCII characters to
distinguish header, rows, and fields. For storing tabular data in files, or for
simple structured data transfer between programs and over networks.

file suffix: `.aft`

mimetype (speculative): `text/x-aft`

## Format

Format is:

- start header: `0x01` byte (ANSI "Start of Header")
- header rows, with same width/structure as table (see below)
    - first row, required, is column names. if blank name, column is unnamed.
      duplicate names are ambiguous: pass-through by some tools, forbidden by
      `check`
    - second row indicates column types (see below)
    - third row indicuates current sort order
- start body: `0x02` byte (ANSI "Start of Text")
- rows are lines ending with `0x1D` (ANSI "Group Separator"), immediately
  followed by a newline. TBD how strict about including the newline. Other
  newlines in the text do not incidate a new row, they are part of the cell
  ("record" or "unit") value
- within a row, columns are separated by `0x1E` ("Record Separator"). Within
  columns, records can optionally have additional structure, separated by
  `0x1F` ("Unit Separator")

The special characters are:

- `0x01  1 \001 SOH ^A`: "Start of Header"
- `0x02  2 \002 STX ^B`: "Start of Text"
- `0x03  3 \003 ETX ^C`: "End of Text"
- `0x1C 28 \034  FS ^\`: "File Separator"
- `0x1D 29 \035  GS ^]`: "Group Separator"
- `0x1E 30 \036  RS ^^`: "Record Separator"
- `0x1F 31 \037  US ^_`: "Unit Separator"

Tools and file readers check the start of stream. If it is `0x01`, can treat as
`aft`; otherwise treat as regular text. `0x02` terminates header section and
starts contents. Header records could somehow include:

- column names
- column types (default is string)
- indicate sort order (columns, asc/desc)

General structure of tables is:

- units: for arrays of values in a cell (no other structure)
    => could be "array" (variable-length, single type) or "tuple" (fixed-length, names and types for each slot)
- records: for cells in 2-dimensional grid
- groups: rows of table (or, use newline, so regular tools work?)
- file: unused; could be "sheets"? sort of redundant with "end of text" character

Comments and annotations are not supported.

Default strings must not contain any of the special characters. This must be
enforced by code doing the encoding, and requires a full pass of every value.
In this mode, strings which include the special characters may not be included
and must raise an error. For UTF-8 strings which may include the characters, an
escaped column type is allowed, which allows arbitrary JSON-style character
escapes (like '\u01`). The downside with this mode is that conversion and
processing may be slower.

## Extras

Text editor support: coloring, easy entry of special chars, highlighting
errors, tab-separation.

Unfortunately, my terminal (xterm) interprets Ctrl-V Ctrl-_ as "resize smaller".

## Column Types

Encoding is that of JSON or TOML. Values can be:

- string (UTF-8) (default if not specified)
- escaped byte-string, JSON-style
- base64 encoded bytes
- integer number
- floating point number
- boolean
- date, timestamp (ISO format)
- null

## Usage Examples

    cargo run --bin aft-demo | cat -v
    cargo run --bin aft-demo | cut -f1 -d$'\036' | cat -v

## Use Cases

- (structured) logging
- (structured) command output
- data munging pipelines and intermediate formats
- datastore imports and exports
- REST API output (via content negotiation)
- efficient 
- sorted static lookup tables

## Ideas

- sub-types for columns. eg, "uint+bytes"
- sort order

## Similar To

- Hadoop formats?
- "column families" in datastores
- CDX (web archiving), particularly CDXJ (JSON-ish)

## Open Questions

- should windows-style terminators be allowed? or strictly
  single-newline-character after the Group Separator?
- should SOH and STX characters be followed by newline? optionally? this may
  make some command-line stuff easier, but adds parsing complexity (how to
  represent leading whitespace in first column?)
- what about newlines in values? should there be a mode where we use record or
  group separator instead of newline? single mode is much better, but
  compatibility with existing UNIX-y newline-oriented stuff also compelling. Hrm.
- should first line in file starting with `#` be ignored? for future extensibility
- should the default mode be escaped strings, or strings with no special
  characters allowed? may not matter for the common case of no special characters.
- should additional constraints be placed on header encoding? eg, that they
  must be escaped or not have any ASCII special characters. this might make
  parsing to determine validity and mode easier
- should there be fast ASCII-only, no special characters mode? would that
  provide even more speed? `daft`?