summaryrefslogtreecommitdiffstats
path: root/xml-parse.txi
diff options
context:
space:
mode:
authorThomas Bushnell, BSG <tb@debian.org>2007-12-28 16:25:32 -0800
committerBryan Newbold <bnewbold@robocracy.org>2017-02-20 00:05:39 -0800
commitd8ae23691ed6392b7f320f5fa7d4dd78ae52c10e (patch)
treeb20b8bc02e854c4c86d39ee22a0638a8b06e01af /xml-parse.txi
parentedd1ebef3ad774e7cbcc2f5918d555bfb0b44091 (diff)
parent64f037d91e0c9296dcaef9a0ff3eb33b19a2ed34 (diff)
downloadslib-d8ae23691ed6392b7f320f5fa7d4dd78ae52c10e.tar.gz
slib-d8ae23691ed6392b7f320f5fa7d4dd78ae52c10e.zip
Import Debian changes 3a5-1debian/3a5-1
slib (3a5-1) unstable; urgency=low * New upstream release. * slib.texi (Library Catalogs): Repeat change from 3a3-3. * Makefile: Repeat $(htmldir)slib_toc.html changes from 3a2-1. * guile.init: (library-vicinity): Repeat change from 3a4-2. * debian/rules (binary-indep): Don't hide .init files in a separate subdirectory, thus conforming better to the usual slib practice. Put a symlink in place to ease transitions. (Closes: #407370).
Diffstat (limited to 'xml-parse.txi')
-rw-r--r--xml-parse.txi1010
1 files changed, 1010 insertions, 0 deletions
diff --git a/xml-parse.txi b/xml-parse.txi
new file mode 100644
index 0000000..365d914
--- /dev/null
+++ b/xml-parse.txi
@@ -0,0 +1,1010 @@
+@code{(require 'xml-parse)} or @code{(require 'ssax)}
+
+@noindent
+The XML standard document referred to in this module is@*
+@url{http://www.w3.org/TR/1998/REC-xml-19980210.html}.
+
+@noindent
+The present frameworks fully supports the XML Namespaces
+Recommendation@*
+@url{http://www.w3.org/TR/REC-xml-names}.
+
+@subsection String Glue
+
+
+@defun ssax:reverse-collect-str list-of-frags
+
+
+Given the list of fragments (some of which are text strings),
+reverse the list and concatenate adjacent text strings. If
+LIST-OF-FRAGS has zero or one element, the result of the procedure
+is @code{equal?} to its argument.
+@end defun
+
+
+@defun ssax:reverse-collect-str-drop-ws list-of-frags
+
+
+Given the list of fragments (some of which are text strings),
+reverse the list and concatenate adjacent text strings while
+dropping "unsignificant" whitespace, that is, whitespace in front,
+behind and between elements. The whitespace that is included in
+character data is not affected.
+
+Use this procedure to "intelligently" drop "insignificant"
+whitespace in the parsed SXML. If the strict compliance with the
+XML Recommendation regarding the whitespace is desired, use the
+@code{ssax:reverse-collect-str} procedure instead.
+@end defun
+
+@subsection Character and Token Functions
+
+The following functions either skip, or build and return tokens,
+according to inclusion or delimiting semantics. The list of
+characters to expect, include, or to break at may vary from one
+invocation of a function to another. This allows the functions to
+easily parse even context-sensitive languages.
+
+Exceptions are mentioned specifically. The list of expected
+characters (characters to skip until, or break-characters) may
+include an EOF "character", which is coded as symbol *eof*
+
+The input stream to parse is specified as a PORT, which is the last
+argument.
+
+
+@defun ssax:assert-current-char char-list string port
+
+
+Reads a character from the @var{port} and looks it up in the
+@var{char-list} of expected characters. If the read character was
+found among expected, it is returned. Otherwise, the
+procedure writes a message using @var{string} as a comment
+and quits.
+@end defun
+
+
+@defun ssax:skip-while char-list port
+
+
+Reads characters from the @var{port} and disregards them, as long as they
+are mentioned in the @var{char-list}. The first character (which may be EOF)
+peeked from the stream that is @emph{not} a member of the @var{char-list} is
+returned.
+@end defun
+
+
+@defun ssax:init-buffer
+
+
+Returns an initial buffer for @code{ssax:next-token*} procedures.
+@code{ssax:init-buffer} may allocate a new buffer at each invocation.
+@end defun
+
+
+@defun ssax:next-token prefix-char-list break-char-list comment-string port
+
+
+Skips any number of the prefix characters (members of the @var{prefix-char-list}), if
+any, and reads the sequence of characters up to (but not including)
+a break character, one of the @var{break-char-list}.
+
+The string of characters thus read is returned. The break character
+is left on the input stream. @var{break-char-list} may include the symbol @code{*eof*};
+otherwise, EOF is fatal, generating an error message including a
+specified @var{comment-string}.
+@end defun
+
+@noindent
+@code{ssax:next-token-of} is similar to @code{ssax:next-token}
+except that it implements an inclusion rather than delimiting
+semantics.
+
+
+@defun ssax:next-token-of inc-charset port
+
+
+Reads characters from the @var{port} that belong to the list of characters
+@var{inc-charset}. The reading stops at the first character which is not a member
+of the set. This character is left on the stream. All the read
+characters are returned in a string.
+
+
+@defunx ssax:next-token-of pred port
+
+Reads characters from the @var{port} for which @var{pred} (a procedure of
+one argument) returns non-#f. The reading stops at the first
+character for which @var{pred} returns #f. That character is left
+on the stream. All the results of evaluating of @var{pred} up to #f
+are returned in a string.
+
+@var{pred} is a procedure that takes one argument (a character or
+the EOF object) and returns a character or #f. The returned
+character does not have to be the same as the input argument to the
+@var{pred}. For example,
+
+@example
+(ssax:next-token-of (lambda (c)
+ (cond ((eof-object? c) #f)
+ ((char-alphabetic? c) (char-downcase c))
+ (else #f)))
+ (current-input-port))
+@end example
+
+will try to read an alphabetic token from the current input port,
+and return it in lower case.
+@end defun
+
+
+@defun ssax:read-string len port
+
+
+Reads @var{len} characters from the @var{port}, and returns them in a string. If
+EOF is encountered before @var{len} characters are read, a shorter string
+will be returned.
+@end defun
+
+@subsection Data Types
+
+@table @code
+
+@item TAG-KIND
+
+A symbol @samp{START}, @samp{END}, @samp{PI}, @samp{DECL},
+@samp{COMMENT}, @samp{CDSECT}, or @samp{ENTITY-REF} that identifies
+a markup token
+
+@item UNRES-NAME
+
+a name (called GI in the XML Recommendation) as given in an XML
+document for a markup token: start-tag, PI target, attribute name.
+If a GI is an NCName, UNRES-NAME is this NCName converted into a
+Scheme symbol. If a GI is a QName, @samp{UNRES-NAME} is a pair of
+symbols: @code{(@var{PREFIX} . @var{LOCALPART})}.
+
+@item RES-NAME
+
+An expanded name, a resolved version of an @samp{UNRES-NAME}. For
+an element or an attribute name with a non-empty namespace URI,
+@samp{RES-NAME} is a pair of symbols,
+@code{(@var{URI-SYMB} . @var{LOCALPART})}.
+Otherwise, it's a single symbol.
+
+@item ELEM-CONTENT-MODEL
+
+A symbol:
+@table @samp
+@item ANY
+anything goes, expect an END tag.
+@item EMPTY-TAG
+no content, and no END-tag is coming
+@item EMPTY
+no content, expect the END-tag as the next token
+@item PCDATA
+expect character data only, and no children elements
+@item MIXED
+@item ELEM-CONTENT
+@end table
+
+@item URI-SYMB
+
+A symbol representing a namespace URI -- or other symbol chosen by
+the user to represent URI. In the former case, @code{URI-SYMB} is
+created by %-quoting of bad URI characters and converting the
+resulting string into a symbol.
+
+@item NAMESPACES
+
+A list representing namespaces in effect. An element of the list
+has one of the following forms:
+
+@table @code
+
+@item (@var{prefix} @var{uri-symb} . @var{uri-symb}) or
+
+@item (@var{prefix} @var{user-prefix} . @var{uri-symb})
+@var{user-prefix} is a symbol chosen by the user to represent the URI.
+
+@item (#f @var{user-prefix} . @var{uri-symb})
+Specification of the user-chosen prefix and a URI-SYMBOL.
+
+@item (*DEFAULT* @var{user-prefix} . @var{uri-symb})
+Declaration of the default namespace
+
+@item (*DEFAULT* #f . #f)
+Un-declaration of the default namespace. This notation
+represents overriding of the previous declaration
+
+@end table
+
+A NAMESPACES list may contain several elements for the same @var{prefix}.
+The one closest to the beginning of the list takes effect.
+
+@item ATTLIST
+
+An ordered collection of (@var{NAME} . @var{VALUE}) pairs, where
+@var{NAME} is a RES-NAME or an UNRES-NAME. The collection is an ADT.
+
+@item STR-HANDLER
+
+A procedure of three arguments: @var{string1} @var{string2}
+@var{seed} returning a new @var{seed}. The procedure is supposed to
+handle a chunk of character data @var{string1} followed by a chunk
+of character data @var{string2}. @var{string2} is a short string,
+often @samp{"\n"} and even @samp{""}.
+
+@item ENTITIES
+An assoc list of pairs:
+@lisp
+ (@var{named-entity-name} . @var{named-entity-body})
+@end lisp
+
+where @var{named-entity-name} is a symbol under which the entity was
+declared, @var{named-entity-body} is either a string, or (for an
+external entity) a thunk that will return an input port (from which
+the entity can be read). @var{named-entity-body} may also be #f.
+This is an indication that a @var{named-entity-name} is currently
+being expanded. A reference to this @var{named-entity-name} will be
+an error: violation of the WFC nonrecursion.
+
+@item XML-TOKEN
+
+This record represents a markup, which is, according to the XML
+Recommendation, "takes the form of start-tags, end-tags,
+empty-element tags, entity references, character references,
+comments, CDATA section delimiters, document type declarations, and
+processing instructions."
+
+@table @asis
+@item kind
+a TAG-KIND
+@item head
+an UNRES-NAME. For XML-TOKENs of kinds 'COMMENT and 'CDSECT, the
+head is #f.
+@end table
+
+For example,
+@example
+<P> => kind=START, head=P
+</P> => kind=END, head=P
+<BR/> => kind=EMPTY-EL, head=BR
+<!DOCTYPE OMF ...> => kind=DECL, head=DOCTYPE
+<?xml version="1.0"?> => kind=PI, head=xml
+&my-ent; => kind=ENTITY-REF, head=my-ent
+@end example
+
+Character references are not represented by xml-tokens as these
+references are transparently resolved into the corresponding
+characters.
+
+@item XML-DECL
+
+The record represents a datatype of an XML document: the list of
+declared elements and their attributes, declared notations, list of
+replacement strings or loading procedures for parsed general
+entities, etc. Normally an XML-DECL record is created from a DTD or
+an XML Schema, although it can be created and filled in in many
+other ways (e.g., loaded from a file).
+
+@table @var
+@item elems
+an (assoc) list of decl-elem or #f. The latter instructs
+the parser to do no validation of elements and attributes.
+
+@item decl-elem
+declaration of one element:
+
+@code{(@var{elem-name} @var{elem-content} @var{decl-attrs})}
+
+@var{elem-name} is an UNRES-NAME for the element.
+
+@var{elem-content} is an ELEM-CONTENT-MODEL.
+
+@var{decl-attrs} is an @code{ATTLIST}, of
+@code{(@var{attr-name} . @var{value})} associations.
+
+This element can declare a user procedure to handle parsing of an
+element (e.g., to do a custom validation, or to build a hash of IDs
+as they're encountered).
+
+@item decl-attr
+an element of an @code{ATTLIST}, declaration of one attribute:
+
+@code{(@var{attr-name} @var{content-type} @var{use-type} @var{default-value})}
+
+@var{attr-name} is an UNRES-NAME for the declared attribute.
+
+@var{content-type} is a symbol: @code{CDATA}, @code{NMTOKEN},
+@code{NMTOKENS}, @dots{} or a list of strings for the enumerated
+type.
+
+@var{use-type} is a symbol: @code{REQUIRED}, @code{IMPLIED}, or
+@code{FIXED}.
+
+@var{default-value} is a string for the default value, or #f if not
+given.
+
+@end table
+
+@end table
+
+@subsection Low-Level Parsers and Scanners
+
+@noindent
+These procedures deal with primitive lexical units (Names,
+whitespaces, tags) and with pieces of more generic productions.
+Most of these parsers must be called in appropriate context. For
+example, @code{ssax:complete-start-tag} must be called only when the
+start-tag has been detected and its GI has been read.
+
+
+@defun ssax:skip-s port
+
+
+Skip the S (whitespace) production as defined by
+@example
+[3] S ::= (#x20 | #x09 | #x0D | #x0A)
+@end example
+
+@code{ssax:skip-s} returns the first not-whitespace character it encounters while
+scanning the @var{port}. This character is left on the input stream.
+@end defun
+
+
+@defun ssax:read-ncname port
+
+
+Read a NCName starting from the current position in the @var{port} and
+return it as a symbol.
+
+@example
+[4] NameChar ::= Letter | Digit | '.' | '-' | '_' | ':'
+ | CombiningChar | Extender
+[5] Name ::= (Letter | '_' | ':') (NameChar)*
+@end example
+
+This code supports the XML Namespace Recommendation REC-xml-names,
+which modifies the above productions as follows:
+
+@example
+[4] NCNameChar ::= Letter | Digit | '.' | '-' | '_'
+ | CombiningChar | Extender
+[5] NCName ::= (Letter | '_') (NCNameChar)*
+@end example
+
+As the Rec-xml-names says,
+
+@quotation
+"An XML document conforms to this specification if all other tokens
+[other than element types and attribute names] in the document which
+are required, for XML conformance, to match the XML production for
+Name, match this specification's production for NCName."
+@end quotation
+
+Element types and attribute names must match the production QName,
+defined below.
+@end defun
+
+
+@defun ssax:read-qname port
+
+
+Read a (namespace-) Qualified Name, QName, from the current position
+in @var{port}; and return an UNRES-NAME.
+
+From REC-xml-names:
+@example
+[6] QName ::= (Prefix ':')? LocalPart
+[7] Prefix ::= NCName
+[8] LocalPart ::= NCName
+@end example
+@end defun
+
+
+@defun ssax:read-markup-token port
+
+
+This procedure starts parsing of a markup token. The current
+position in the stream must be @samp{<}. This procedure scans
+enough of the input stream to figure out what kind of a markup token
+it is seeing. The procedure returns an XML-TOKEN structure
+describing the token. Note, generally reading of the current markup
+is not finished! In particular, no attributes of the start-tag
+token are scanned.
+
+Here's a detailed break out of the return values and the position in
+the PORT when that particular value is returned:
+
+@table @asis
+
+@item PI-token
+
+only PI-target is read. To finish the Processing-Instruction and
+disregard it, call @code{ssax:skip-pi}. @code{ssax:read-attributes}
+may be useful as well (for PIs whose content is attribute-value
+pairs).
+
+@item END-token
+
+The end tag is read completely; the current position is right after
+the terminating @samp{>} character.
+
+@item COMMENT
+
+is read and skipped completely. The current position is right after
+@samp{-->} that terminates the comment.
+
+@item CDSECT
+
+The current position is right after @samp{<!CDATA[}. Use
+@code{ssax:read-cdata-body} to read the rest.
+
+@item DECL
+
+We have read the keyword (the one that follows @samp{<!})
+identifying this declaration markup. The current position is after
+the keyword (usually a whitespace character)
+
+@item START-token
+
+We have read the keyword (GI) of this start tag. No attributes are
+scanned yet. We don't know if this tag has an empty content either.
+Use @code{ssax:complete-start-tag} to finish parsing of the token.
+
+@end table
+@end defun
+
+
+@defun ssax:skip-pi port
+
+
+The current position is inside a PI. Skip till the rest of the PI
+@end defun
+
+
+@defun ssax:read-pi-body-as-string port
+
+
+The current position is right after reading the PITarget. We read
+the body of PI and return is as a string. The port will point to
+the character right after @samp{?>} combination that terminates PI.
+
+@example
+[16] PI ::= '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'
+@end example
+@end defun
+
+
+@defun ssax:skip-internal-dtd port
+
+
+The current pos in the port is inside an internal DTD subset (e.g.,
+after reading @samp{#\[} that begins an internal DTD subset) Skip
+until the @samp{]>} combination that terminates this DTD.
+@end defun
+
+
+@defun ssax:read-cdata-body port str-handler seed
+
+
+This procedure must be called after we have read a string
+@samp{<![CDATA[} that begins a CDATA section. The current position
+must be the first position of the CDATA body. This function reads
+@emph{lines} of the CDATA body and passes them to a @var{str-handler}, a character
+data consumer.
+
+@var{str-handler} is a procedure taking arguments: @var{string1}, @var{string2},
+and @var{seed}. The first @var{string1} argument to @var{str-handler} never
+contains a newline; the second @var{string2} argument often will.
+On the first invocation of @var{str-handler}, @var{seed} is the one passed to @code{ssax:read-cdata-body} as the
+third argument. The result of this first invocation will be passed
+as the @var{seed} argument to the second invocation of the line
+consumer, and so on. The result of the last invocation of the @var{str-handler} is
+returned by the @code{ssax:read-cdata-body}. Note a similarity to the fundamental @dfn{fold}
+@cindex fold
+iterator.
+
+Within a CDATA section all characters are taken at their face value,
+with three exceptions:
+@itemize @bullet
+@item
+CR, LF, and CRLF are treated as line delimiters, and passed
+as a single @samp{#\newline} to @var{str-handler}
+
+@item
+@samp{]]>} combination is the end of the CDATA section.
+@samp{&gt;} is treated as an embedded @samp{>} character.
+
+@item
+@samp{&lt;} and @samp{&amp;} are not specially recognized (and are
+not expanded)!
+
+@end itemize
+@end defun
+
+
+@defun ssax:read-char-ref port
+
+
+@example
+[66] CharRef ::= '&#' [0-9]+ ';'
+ | '&#x' [0-9a-fA-F]+ ';'
+@end example
+
+This procedure must be called after we we have read @samp{&#} that
+introduces a char reference. The procedure reads this reference and
+returns the corresponding char. The current position in PORT will
+be after the @samp{;} that terminates the char reference.
+
+Faults detected:@*
+WFC: XML-Spec.html#wf-Legalchar
+
+According to Section @cite{4.1 Character and Entity References}
+of the XML Recommendation:
+
+@quotation
+"[Definition: A character reference refers to a specific character
+in the ISO/IEC 10646 character set, for example one not directly
+accessible from available input devices.]"
+@end quotation
+
+@c Therefore, we use a @code{ucscode->char} function to convert a
+@c character code into the character -- *regardless* of the current
+@c character encoding of the input stream.
+@end defun
+
+
+@defun ssax:handle-parsed-entity port name entities content-handler str-handler seed
+
+
+Expands and handles a parsed-entity reference.
+
+@var{name} is a symbol, the name of the parsed entity to expand.
+@c entities - see ENTITIES
+@var{content-handler} is a procedure of arguments @var{port}, @var{entities}, and
+@var{seed} that returns a seed.
+@var{str-handler} is called if the entity in question is a pre-declared entity.
+
+@code{ssax:handle-parsed-entity} returns the result returned by @var{content-handler} or @var{str-handler}.
+
+Faults detected:@*
+WFC: XML-Spec.html#wf-entdeclared@*
+WFC: XML-Spec.html#norecursion
+@end defun
+
+
+@defun attlist-add attlist name-value
+
+
+Add a @var{name-value} pair to the existing @var{attlist}, preserving its sorted ascending
+order; and return the new list. Return #f if a pair with the same
+name already exists in @var{attlist}
+@end defun
+
+
+@defun attlist-remove-top attlist
+
+
+Given an non-null @var{attlist}, return a pair of values: the top and the rest.
+@end defun
+
+
+@defun ssax:read-attributes port entities
+
+
+This procedure reads and parses a production @dfn{Attribute}.
+@cindex Attribute
+
+@example
+[41] Attribute ::= Name Eq AttValue
+[10] AttValue ::= '"' ([^<&"] | Reference)* '"'
+ | "'" ([^<&'] | Reference)* "'"
+[25] Eq ::= S? '=' S?
+@end example
+
+The procedure returns an ATTLIST, of Name (as UNRES-NAME), Value (as
+string) pairs. The current character on the @var{port} is a non-whitespace
+character that is not an NCName-starting character.
+
+Note the following rules to keep in mind when reading an
+@dfn{AttValue}:
+@cindex AttValue
+@quotation
+Before the value of an attribute is passed to the application or
+checked for validity, the XML processor must normalize it as
+follows:
+
+@itemize @bullet
+@item
+A character reference is processed by appending the referenced
+character to the attribute value.
+
+@item
+An entity reference is processed by recursively processing the
+replacement text of the entity. The named entities @samp{amp},
+@samp{lt}, @samp{gt}, @samp{quot}, and @samp{apos} are pre-declared.
+
+@item
+A whitespace character (#x20, #x0D, #x0A, #x09) is processed by
+appending #x20 to the normalized value, except that only a single
+#x20 is appended for a "#x0D#x0A" sequence that is part of an
+external parsed entity or the literal entity value of an internal
+parsed entity.
+
+@item
+Other characters are processed by appending them to the normalized
+value.
+
+@end itemize
+
+@end quotation
+
+Faults detected:@*
+WFC: XML-Spec.html#CleanAttrVals@*
+WFC: XML-Spec.html#uniqattspec
+@end defun
+
+
+@defun ssax:resolve-name port unres-name namespaces apply-default-ns?
+
+
+Convert an @var{unres-name} to a RES-NAME, given the appropriate @var{namespaces} declarations.
+The last parameter, @var{apply-default-ns?}, determines if the default namespace applies
+(for instance, it does not for attribute names).
+
+Per REC-xml-names/#nsc-NSDeclared, the "xml" prefix is considered
+pre-declared and bound to the namespace name
+"http://www.w3.org/XML/1998/namespace".
+
+@code{ssax:resolve-name} tests for the namespace constraints:@*
+@url{http://www.w3.org/TR/REC-xml-names/#nsc-NSDeclared}
+@end defun
+
+
+@defun ssax:complete-start-tag tag port elems entities namespaces
+
+
+Complete parsing of a start-tag markup. @code{ssax:complete-start-tag} must be called after the
+start tag token has been read. @var{tag} is an UNRES-NAME. @var{elems} is an
+instance of the ELEMS slot of XML-DECL; it can be #f to tell the
+function to do @emph{no} validation of elements and their
+attributes.
+
+@code{ssax:complete-start-tag} returns several values:
+@itemize @bullet
+@item ELEM-GI:
+a RES-NAME.
+@item ATTRIBUTES:
+element's attributes, an ATTLIST of (RES-NAME . STRING) pairs.
+The list does NOT include xmlns attributes.
+@item NAMESPACES:
+the input list of namespaces amended with namespace
+(re-)declarations contained within the start-tag under parsing
+@item ELEM-CONTENT-MODEL
+@end itemize
+
+On exit, the current position in @var{port} will be the first character
+after @samp{>} that terminates the start-tag markup.
+
+Faults detected:@*
+VC: XML-Spec.html#enum@*
+VC: XML-Spec.html#RequiredAttr@*
+VC: XML-Spec.html#FixedAttr@*
+VC: XML-Spec.html#ValueType@*
+WFC: XML-Spec.html#uniqattspec (after namespaces prefixes are resolved)@*
+VC: XML-Spec.html#elementvalid@*
+WFC: REC-xml-names/#dt-NSName
+
+@emph{Note}: although XML Recommendation does not explicitly say it,
+xmlns and xmlns: attributes don't have to be declared (although they
+can be declared, to specify their default value).
+@end defun
+
+
+@defun ssax:read-external-id port
+
+
+Parses an ExternalID production:
+
+@example
+[75] ExternalID ::= 'SYSTEM' S SystemLiteral
+ | 'PUBLIC' S PubidLiteral S SystemLiteral
+[11] SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")
+[12] PubidLiteral ::= '"' PubidChar* '"'
+ | "'" (PubidChar - "'")* "'"
+[13] PubidChar ::= #x20 | #x0D | #x0A | [a-zA-Z0-9]
+ | [-'()+,./:=?;!*#@@$_%]
+@end example
+
+Call @code{ssax:read-external-id} when an ExternalID is expected; that is, the current
+character must be either #\S or #\P that starts correspondingly a
+SYSTEM or PUBLIC token. @code{ssax:read-external-id} returns the @var{SystemLiteral} as a
+string. A @var{PubidLiteral} is disregarded if present.
+@end defun
+
+@subsection Mid-Level Parsers and Scanners
+
+@noindent
+These procedures parse productions corresponding to the whole
+(document) entity or its higher-level pieces (prolog, root element,
+etc).
+
+
+@defun ssax:scan-misc port
+
+
+Scan the Misc production in the context:
+
+@example
+[1] document ::= prolog element Misc*
+[22] prolog ::= XMLDecl? Misc* (doctypedec l Misc*)?
+[27] Misc ::= Comment | PI | S
+@end example
+
+Call @code{ssax:scan-misc} in the prolog or epilog contexts. In these contexts,
+whitespaces are completely ignored. The return value from @code{ssax:scan-misc} is
+either a PI-token, a DECL-token, a START token, or *EOF*. Comments
+are ignored and not reported.
+@end defun
+
+
+@defun ssax:read-char-data port expect-eof? str-handler iseed
+
+
+Read the character content of an XML document or an XML element.
+
+@example
+[43] content ::=
+(element | CharData | Reference | CDSect | PI | Comment)*
+@end example
+
+To be more precise, @code{ssax:read-char-data} reads CharData, expands CDSect and character
+entities, and skips comments. @code{ssax:read-char-data} stops at a named reference, EOF,
+at the beginning of a PI, or a start/end tag.
+
+@var{expect-eof?} is a boolean indicating if EOF is normal; i.e., the character
+data may be terminated by the EOF. EOF is normal while processing a
+parsed entity.
+
+@var{iseed} is an argument passed to the first invocation of @var{str-handler}.
+
+@code{ssax:read-char-data} returns two results: @var{seed} and @var{token}. The @var{seed}
+is the result of the last invocation of @var{str-handler}, or the original @var{iseed} if @var{str-handler}
+was never called.
+
+@var{token} can be either an eof-object (this can happen only if @var{expect-eof?}
+was #t), or:
+@itemize @bullet
+
+@item
+an xml-token describing a START tag or an END-tag;
+For a start token, the caller has to finish reading it.
+
+@item
+an xml-token describing the beginning of a PI. It's up to an
+application to read or skip through the rest of this PI;
+
+@item
+an xml-token describing a named entity reference.
+
+@end itemize
+
+CDATA sections and character references are expanded inline and
+never returned. Comments are silently disregarded.
+
+As the XML Recommendation requires, all whitespace in character data
+must be preserved. However, a CR character (#x0D) must be
+disregarded if it appears before a LF character (#x0A), or replaced
+by a #x0A character otherwise. See Secs. 2.10 and 2.11 of the XML
+Recommendation. See also the canonical XML Recommendation.
+@end defun
+
+
+@defun ssax:assert-token token kind gi error-cont
+
+
+Make sure that @var{token} is of anticipated @var{kind} and has anticipated @var{gi}. Note
+that the @var{gi} argument may actually be a pair of two symbols,
+Namespace-URI or the prefix, and of the localname. If the assertion
+fails, @var{error-cont} is evaluated by passing it three arguments: @var{token} @var{kind} @var{gi}. The
+result of @var{error-cont} is returned.
+@end defun
+
+@subsection High-level Parsers
+
+These procedures are to instantiate a SSAX parser. A user can
+instantiate the parser to do the full validation, or no validation,
+or any particular validation. The user specifies which PI he wants
+to be notified about. The user tells what to do with the parsed
+character and element data. The latter handlers determine if the
+parsing follows a SAX or a DOM model.
+
+
+@defun ssax:make-pi-parser my-pi-handlers
+
+
+Create a parser to parse and process one Processing Element (PI).
+
+@var{my-pi-handlers} is an association list of pairs
+@code{(@var{pi-tag} . @var{pi-handler})} where @var{pi-tag} is an
+NCName symbol, the PI target; and @var{pi-handler} is a procedure
+taking arguments @var{port}, @var{pi-tag}, and @var{seed}.
+
+@var{pi-handler} should read the rest of the PI up to and including
+the combination @samp{?>} that terminates the PI. The handler
+should return a new seed. One of the @var{pi-tag}s may be the
+symbol @code{*DEFAULT*}. The corresponding handler will handle PIs
+that no other handler will. If the *DEFAULT* @var{pi-tag} is not
+specified, @code{ssax:make-pi-parser} will assume the default handler that skips the body of
+the PI.
+
+@code{ssax:make-pi-parser} returns a procedure of arguments @var{port}, @var{pi-tag}, and
+@var{seed}; that will parse the current PI according to @var{my-pi-handlers}.
+@end defun
+
+
+@defun ssax:make-elem-parser my-new-level-seed my-finish-element my-char-data-handler my-pi-handlers
+
+
+Create a parser to parse and process one element, including its
+character content or children elements. The parser is typically
+applied to the root element of a document.
+
+@table @asis
+
+@item @var{my-new-level-seed}
+is a procedure taking arguments:
+
+@var{elem-gi} @var{attributes} @var{namespaces} @var{expected-content} @var{seed}
+
+where @var{elem-gi} is a RES-NAME of the element about to be
+processed.
+
+@var{my-new-level-seed} is to generate the seed to be passed to handlers that process the
+content of the element.
+
+@item @var{my-finish-element}
+is a procedure taking arguments:
+
+@var{elem-gi} @var{attributes} @var{namespaces} @var{parent-seed} @var{seed}
+
+@var{my-finish-element} is called when parsing of @var{elem-gi} is finished.
+The @var{seed} is the result from the last content parser (or
+from @var{my-new-level-seed} if the element has the empty content).
+@var{parent-seed} is the same seed as was passed to @var{my-new-level-seed}.
+@var{my-finish-element} is to generate a seed that will be the result
+of the element parser.
+
+@item @var{my-char-data-handler}
+is a STR-HANDLER as described in Data Types above.
+
+@item @var{my-pi-handlers}
+is as described for @code{ssax:make-pi-handler} above.
+
+@end table
+
+The generated parser is a procedure taking arguments:
+
+@var{start-tag-head} @var{port} @var{elems} @var{entities} @var{namespaces} @var{preserve-ws?} @var{seed}
+
+The procedure must be called after the start tag token has been
+read. @var{start-tag-head} is an UNRES-NAME from the start-element
+tag. ELEMS is an instance of ELEMS slot of XML-DECL.
+
+Faults detected:@*
+VC: XML-Spec.html#elementvalid@*
+WFC: XML-Spec.html#GIMatch
+@end defun
+
+
+@defun ssax:make-parser user-handler-tag user-handler @dots{}
+
+
+Create an XML parser, an instance of the XML parsing framework.
+This will be a SAX, a DOM, or a specialized parser depending on the
+supplied user-handlers.
+
+@code{ssax:make-parser} takes an even number of arguments; @var{user-handler-tag} is a symbol that identifies
+a procedure (or association list for @code{PROCESSING-INSTRUCTIONS})
+(@var{user-handler}) that follows the tag. Given below are tags and signatures of
+the corresponding procedures. Not all tags have to be specified.
+If some are omitted, reasonable defaults will apply.
+
+@table @samp
+
+@item DOCTYPE
+handler-procedure: @var{port} @var{docname} @var{systemid} @var{internal-subset?} @var{seed}
+
+If @var{internal-subset?} is #t, the current position in the port is
+right after we have read @samp{[} that begins the internal DTD
+subset. We must finish reading of this subset before we return (or
+must call @code{skip-internal-dtd} if we aren't interested in
+reading it). @var{port} at exit must be at the first symbol after
+the whole DOCTYPE declaration.
+
+The handler-procedure must generate four values:
+@quotation
+@var{elems} @var{entities} @var{namespaces} @var{seed}
+@end quotation
+
+@var{elems} is as defined for the ELEMS slot of XML-DECL. It may be
+#f to switch off validation. @var{namespaces} will typically
+contain @var{user-prefix}es for selected @var{uri-symb}s. The
+default handler-procedure skips the internal subset, if any, and
+returns @code{(values #f '() '() seed)}.
+
+@item UNDECL-ROOT
+procedure: @var{elem-gi} @var{seed}
+
+where @var{elem-gi} is an UNRES-NAME of the root element. This
+procedure is called when an XML document under parsing contains
+@emph{no} DOCTYPE declaration.
+
+The handler-procedure, as a DOCTYPE handler procedure above,
+must generate four values:
+@quotation
+@var{elems} @var{entities} @var{namespaces} @var{seed}
+@end quotation
+
+The default handler-procedure returns (values #f '() '() seed)
+
+@item DECL-ROOT
+procedure: @var{elem-gi} @var{seed}
+
+where @var{elem-gi} is an UNRES-NAME of the root element. This
+procedure is called when an XML document under parsing does contains
+the DOCTYPE declaration. The handler-procedure must generate a new
+@var{seed} (and verify that the name of the root element matches the
+doctype, if the handler so wishes). The default handler-procedure
+is the identity function.
+
+@item NEW-LEVEL-SEED
+procedure: see ssax:make-elem-parser, my-new-level-seed
+
+@item FINISH-ELEMENT
+procedure: see ssax:make-elem-parser, my-finish-element
+
+@item CHAR-DATA-HANDLER
+procedure: see ssax:make-elem-parser, my-char-data-handler
+
+@item PROCESSING-INSTRUCTIONS
+association list as is passed to @code{ssax:make-pi-parser}.
+The default value is '()
+
+@end table
+
+The generated parser is a procedure of arguments @var{port} and
+@var{seed}.
+
+This procedure parses the document prolog and then exits to an
+element parser (created by @code{ssax:make-elem-parser}) to handle
+the rest.
+
+@example
+[1] document ::= prolog element Misc*
+[22] prolog ::= XMLDecl? Misc* (doctypedec | Misc*)?
+[27] Misc ::= Comment | PI | S
+[28] doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S?
+ ('[' (markupdecl | PEReference | S)* ']' S?)? '>'
+[29] markupdecl ::= elementdecl | AttlistDecl
+ | EntityDecl
+ | NotationDecl | PI
+ | Comment
+@end example
+@end defun
+
+@subsection Parsing XML to SXML
+
+
+@defun ssax:xml->sxml port namespace-prefix-assig
+
+
+This is an instance of the SSAX parser that returns an SXML
+representation of the XML document to be read from @var{port}. @var{namespace-prefix-assig} is a list
+of @code{(@var{user-prefix} . @var{uri-string})} that assigns
+@var{user-prefix}es to certain namespaces identified by particular
+@var{uri-string}s. It may be an empty list. @code{ssax:xml->sxml} returns an SXML
+tree. The port points out to the first character after the root
+element.
+@end defun
+