From 661da10e73b9ec95a0411d1e54b999f6696c3152 Mon Sep 17 00:00:00 2001
From: Bryan Newbold <bnewbold@archive.org>
Date: Sun, 31 May 2020 15:39:56 -0700
Subject: more notes in README

---
 README.md | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 62 insertions(+), 13 deletions(-)

diff --git a/README.md b/README.md
index 1de9277..b2abfd7 100644
--- a/README.md
+++ b/README.md
@@ -2,26 +2,43 @@
 aft: ASCII Framed Tables
 ===========================
 
-Like TSV, but using ASCII control characters to distinguish headers, fields, and rows.
-
-Store tabular data in files, or for simple structured data transfer between
-programs and over networks.
+Tabular text files (like CSV or TSV), but using special ASCII characters to
+distinguish header, rows, and fields. For storing tabular data in files, or for
+simple structured data transfer between programs and over networks.
 
 file suffix: `.aft`
 
-mimetype (speculative): `text/aft`
+mimetype (speculative): `text/x-aft`
 
 ## Format
 
+Format is:
+
+- start header: `0x01` byte (ANSI "Start of Header")
+- header rows, with same width/structure as table (see below)
+    - first row, required, is column names. if blank name, column is unnamed.
+      duplicate names are ambiguous: pass-through by some tools, forbidden by
+      `check`
+    - second row indicates column types (see below)
+    - third row indicuates current sort order
+- start body: `0x02` byte (ANSI "Start of Text")
+- rows are lines ending with `0x1D` (ANSI "Group Separator"), immediately
+  followed by a newline. TBD how strict about including the newline. Other
+  newlines in the text do not incidate a new row, they are part of the cell
+  ("record" or "unit") value
+- within a row, columns are separated by `0x1E` ("Record Separator"). Within
+  columns, records can optionally have additional structure, separated by
+  `0x1F` ("Unit Separator")
+
 The special characters are:
 
-- `0x01  1 SOH`: "Start of Header"
-- `0x02  2 STX`: "Start of Text"
-- `0x03  3 ETX`: "End of Text"
-- `0x1C 28  FS`: "File Separator"
-- `0x1D 29  GS`: "Group Separator"
-- `0x1E 30  RS`: "Record Separator"
-- `0x1F 31  US`: "Unit Separator"
+- `0x01  1 \001 SOH ^A`: "Start of Header"
+- `0x02  2 \002 STX ^B`: "Start of Text"
+- `0x03  3 \003 ETX ^C`: "End of Text"
+- `0x1C 28 \034  FS ^\`: "File Separator"
+- `0x1D 29 \035  GS ^]`: "Group Separator"
+- `0x1E 30 \036  RS ^^`: "Record Separator"
+- `0x1F 31 \037  US ^_`: "Unit Separator"
 
 Tools and file readers check the start of stream. If it is `0x01`, can treat as
 `aft`; otherwise treat as regular text. `0x02` terminates header section and
@@ -59,8 +76,40 @@ Encoding is that of JSON or TOML. Values can be:
 - date, timestamp
 - null
 
-## Examples
+## Usage Examples
 
     cargo run --bin aft-demo | cat -v
     cargo run --bin aft-demo | cut -f1 -d$'\036' | cat -v
 
+## Use Cases
+
+- (structured) logging
+- (structured) command output
+- data munging pipelines and intermediate formats
+- datastore imports and exports
+- REST API output (via content negotiation)
+- efficient 
+- sorted static lookup tables
+
+## Ideas
+
+- sub-types for columns. eg, "uint+bytes"
+- sort order
+
+## Similar To
+
+- Hadoop formats?
+- "column families" in datastores
+- CDX (web archiving), particularly CDXJ (JSON-ish)
+
+## Open Questions
+
+- should windows-style terminators be allowed? or strictly
+  single-newline-character after the Group Separator?
+- should SOH and STX characters be followed by newline? optionally? this may
+  make some command-line stuff easier, but adds parsing complexity (how to
+  represent leading whitespace in first column?)
+- what about newlines in values? should there be a mode where we use record or
+  group separator instead of newline? single mode is much better, but
+  compatibility with existing UNIX-y newline-oriented stuff also compelling. Hrm.
+- should first line in file starting with `#` be ignored? for future extensibility
-- 
cgit v1.2.3