diff options
Diffstat (limited to 'xml-parse.txi')
-rw-r--r-- | xml-parse.txi | 1010 |
1 files changed, 1010 insertions, 0 deletions
diff --git a/xml-parse.txi b/xml-parse.txi new file mode 100644 index 0000000..365d914 --- /dev/null +++ b/xml-parse.txi @@ -0,0 +1,1010 @@ +@code{(require 'xml-parse)} or @code{(require 'ssax)} + +@noindent +The XML standard document referred to in this module is@* +@url{http://www.w3.org/TR/1998/REC-xml-19980210.html}. + +@noindent +The present frameworks fully supports the XML Namespaces +Recommendation@* +@url{http://www.w3.org/TR/REC-xml-names}. + +@subsection String Glue + + +@defun ssax:reverse-collect-str list-of-frags + + +Given the list of fragments (some of which are text strings), +reverse the list and concatenate adjacent text strings. If +LIST-OF-FRAGS has zero or one element, the result of the procedure +is @code{equal?} to its argument. +@end defun + + +@defun ssax:reverse-collect-str-drop-ws list-of-frags + + +Given the list of fragments (some of which are text strings), +reverse the list and concatenate adjacent text strings while +dropping "unsignificant" whitespace, that is, whitespace in front, +behind and between elements. The whitespace that is included in +character data is not affected. + +Use this procedure to "intelligently" drop "insignificant" +whitespace in the parsed SXML. If the strict compliance with the +XML Recommendation regarding the whitespace is desired, use the +@code{ssax:reverse-collect-str} procedure instead. +@end defun + +@subsection Character and Token Functions + +The following functions either skip, or build and return tokens, +according to inclusion or delimiting semantics. The list of +characters to expect, include, or to break at may vary from one +invocation of a function to another. This allows the functions to +easily parse even context-sensitive languages. + +Exceptions are mentioned specifically. The list of expected +characters (characters to skip until, or break-characters) may +include an EOF "character", which is coded as symbol *eof* + +The input stream to parse is specified as a PORT, which is the last +argument. + + +@defun ssax:assert-current-char char-list string port + + +Reads a character from the @var{port} and looks it up in the +@var{char-list} of expected characters. If the read character was +found among expected, it is returned. Otherwise, the +procedure writes a message using @var{string} as a comment +and quits. +@end defun + + +@defun ssax:skip-while char-list port + + +Reads characters from the @var{port} and disregards them, as long as they +are mentioned in the @var{char-list}. The first character (which may be EOF) +peeked from the stream that is @emph{not} a member of the @var{char-list} is +returned. +@end defun + + +@defun ssax:init-buffer + + +Returns an initial buffer for @code{ssax:next-token*} procedures. +@code{ssax:init-buffer} may allocate a new buffer at each invocation. +@end defun + + +@defun ssax:next-token prefix-char-list break-char-list comment-string port + + +Skips any number of the prefix characters (members of the @var{prefix-char-list}), if +any, and reads the sequence of characters up to (but not including) +a break character, one of the @var{break-char-list}. + +The string of characters thus read is returned. The break character +is left on the input stream. @var{break-char-list} may include the symbol @code{*eof*}; +otherwise, EOF is fatal, generating an error message including a +specified @var{comment-string}. +@end defun + +@noindent +@code{ssax:next-token-of} is similar to @code{ssax:next-token} +except that it implements an inclusion rather than delimiting +semantics. + + +@defun ssax:next-token-of inc-charset port + + +Reads characters from the @var{port} that belong to the list of characters +@var{inc-charset}. The reading stops at the first character which is not a member +of the set. This character is left on the stream. All the read +characters are returned in a string. + + +@defunx ssax:next-token-of pred port + +Reads characters from the @var{port} for which @var{pred} (a procedure of +one argument) returns non-#f. The reading stops at the first +character for which @var{pred} returns #f. That character is left +on the stream. All the results of evaluating of @var{pred} up to #f +are returned in a string. + +@var{pred} is a procedure that takes one argument (a character or +the EOF object) and returns a character or #f. The returned +character does not have to be the same as the input argument to the +@var{pred}. For example, + +@example +(ssax:next-token-of (lambda (c) + (cond ((eof-object? c) #f) + ((char-alphabetic? c) (char-downcase c)) + (else #f))) + (current-input-port)) +@end example + +will try to read an alphabetic token from the current input port, +and return it in lower case. +@end defun + + +@defun ssax:read-string len port + + +Reads @var{len} characters from the @var{port}, and returns them in a string. If +EOF is encountered before @var{len} characters are read, a shorter string +will be returned. +@end defun + +@subsection Data Types + +@table @code + +@item TAG-KIND + +A symbol @samp{START}, @samp{END}, @samp{PI}, @samp{DECL}, +@samp{COMMENT}, @samp{CDSECT}, or @samp{ENTITY-REF} that identifies +a markup token + +@item UNRES-NAME + +a name (called GI in the XML Recommendation) as given in an XML +document for a markup token: start-tag, PI target, attribute name. +If a GI is an NCName, UNRES-NAME is this NCName converted into a +Scheme symbol. If a GI is a QName, @samp{UNRES-NAME} is a pair of +symbols: @code{(@var{PREFIX} . @var{LOCALPART})}. + +@item RES-NAME + +An expanded name, a resolved version of an @samp{UNRES-NAME}. For +an element or an attribute name with a non-empty namespace URI, +@samp{RES-NAME} is a pair of symbols, +@code{(@var{URI-SYMB} . @var{LOCALPART})}. +Otherwise, it's a single symbol. + +@item ELEM-CONTENT-MODEL + +A symbol: +@table @samp +@item ANY +anything goes, expect an END tag. +@item EMPTY-TAG +no content, and no END-tag is coming +@item EMPTY +no content, expect the END-tag as the next token +@item PCDATA +expect character data only, and no children elements +@item MIXED +@item ELEM-CONTENT +@end table + +@item URI-SYMB + +A symbol representing a namespace URI -- or other symbol chosen by +the user to represent URI. In the former case, @code{URI-SYMB} is +created by %-quoting of bad URI characters and converting the +resulting string into a symbol. + +@item NAMESPACES + +A list representing namespaces in effect. An element of the list +has one of the following forms: + +@table @code + +@item (@var{prefix} @var{uri-symb} . @var{uri-symb}) or + +@item (@var{prefix} @var{user-prefix} . @var{uri-symb}) +@var{user-prefix} is a symbol chosen by the user to represent the URI. + +@item (#f @var{user-prefix} . @var{uri-symb}) +Specification of the user-chosen prefix and a URI-SYMBOL. + +@item (*DEFAULT* @var{user-prefix} . @var{uri-symb}) +Declaration of the default namespace + +@item (*DEFAULT* #f . #f) +Un-declaration of the default namespace. This notation +represents overriding of the previous declaration + +@end table + +A NAMESPACES list may contain several elements for the same @var{prefix}. +The one closest to the beginning of the list takes effect. + +@item ATTLIST + +An ordered collection of (@var{NAME} . @var{VALUE}) pairs, where +@var{NAME} is a RES-NAME or an UNRES-NAME. The collection is an ADT. + +@item STR-HANDLER + +A procedure of three arguments: @var{string1} @var{string2} +@var{seed} returning a new @var{seed}. The procedure is supposed to +handle a chunk of character data @var{string1} followed by a chunk +of character data @var{string2}. @var{string2} is a short string, +often @samp{"\n"} and even @samp{""}. + +@item ENTITIES +An assoc list of pairs: +@lisp + (@var{named-entity-name} . @var{named-entity-body}) +@end lisp + +where @var{named-entity-name} is a symbol under which the entity was +declared, @var{named-entity-body} is either a string, or (for an +external entity) a thunk that will return an input port (from which +the entity can be read). @var{named-entity-body} may also be #f. +This is an indication that a @var{named-entity-name} is currently +being expanded. A reference to this @var{named-entity-name} will be +an error: violation of the WFC nonrecursion. + +@item XML-TOKEN + +This record represents a markup, which is, according to the XML +Recommendation, "takes the form of start-tags, end-tags, +empty-element tags, entity references, character references, +comments, CDATA section delimiters, document type declarations, and +processing instructions." + +@table @asis +@item kind +a TAG-KIND +@item head +an UNRES-NAME. For XML-TOKENs of kinds 'COMMENT and 'CDSECT, the +head is #f. +@end table + +For example, +@example +<P> => kind=START, head=P +</P> => kind=END, head=P +<BR/> => kind=EMPTY-EL, head=BR +<!DOCTYPE OMF ...> => kind=DECL, head=DOCTYPE +<?xml version="1.0"?> => kind=PI, head=xml +&my-ent; => kind=ENTITY-REF, head=my-ent +@end example + +Character references are not represented by xml-tokens as these +references are transparently resolved into the corresponding +characters. + +@item XML-DECL + +The record represents a datatype of an XML document: the list of +declared elements and their attributes, declared notations, list of +replacement strings or loading procedures for parsed general +entities, etc. Normally an XML-DECL record is created from a DTD or +an XML Schema, although it can be created and filled in in many +other ways (e.g., loaded from a file). + +@table @var +@item elems +an (assoc) list of decl-elem or #f. The latter instructs +the parser to do no validation of elements and attributes. + +@item decl-elem +declaration of one element: + +@code{(@var{elem-name} @var{elem-content} @var{decl-attrs})} + +@var{elem-name} is an UNRES-NAME for the element. + +@var{elem-content} is an ELEM-CONTENT-MODEL. + +@var{decl-attrs} is an @code{ATTLIST}, of +@code{(@var{attr-name} . @var{value})} associations. + +This element can declare a user procedure to handle parsing of an +element (e.g., to do a custom validation, or to build a hash of IDs +as they're encountered). + +@item decl-attr +an element of an @code{ATTLIST}, declaration of one attribute: + +@code{(@var{attr-name} @var{content-type} @var{use-type} @var{default-value})} + +@var{attr-name} is an UNRES-NAME for the declared attribute. + +@var{content-type} is a symbol: @code{CDATA}, @code{NMTOKEN}, +@code{NMTOKENS}, @dots{} or a list of strings for the enumerated +type. + +@var{use-type} is a symbol: @code{REQUIRED}, @code{IMPLIED}, or +@code{FIXED}. + +@var{default-value} is a string for the default value, or #f if not +given. + +@end table + +@end table + +@subsection Low-Level Parsers and Scanners + +@noindent +These procedures deal with primitive lexical units (Names, +whitespaces, tags) and with pieces of more generic productions. +Most of these parsers must be called in appropriate context. For +example, @code{ssax:complete-start-tag} must be called only when the +start-tag has been detected and its GI has been read. + + +@defun ssax:skip-s port + + +Skip the S (whitespace) production as defined by +@example +[3] S ::= (#x20 | #x09 | #x0D | #x0A) +@end example + +@code{ssax:skip-s} returns the first not-whitespace character it encounters while +scanning the @var{port}. This character is left on the input stream. +@end defun + + +@defun ssax:read-ncname port + + +Read a NCName starting from the current position in the @var{port} and +return it as a symbol. + +@example +[4] NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' + | CombiningChar | Extender +[5] Name ::= (Letter | '_' | ':') (NameChar)* +@end example + +This code supports the XML Namespace Recommendation REC-xml-names, +which modifies the above productions as follows: + +@example +[4] NCNameChar ::= Letter | Digit | '.' | '-' | '_' + | CombiningChar | Extender +[5] NCName ::= (Letter | '_') (NCNameChar)* +@end example + +As the Rec-xml-names says, + +@quotation +"An XML document conforms to this specification if all other tokens +[other than element types and attribute names] in the document which +are required, for XML conformance, to match the XML production for +Name, match this specification's production for NCName." +@end quotation + +Element types and attribute names must match the production QName, +defined below. +@end defun + + +@defun ssax:read-qname port + + +Read a (namespace-) Qualified Name, QName, from the current position +in @var{port}; and return an UNRES-NAME. + +From REC-xml-names: +@example +[6] QName ::= (Prefix ':')? LocalPart +[7] Prefix ::= NCName +[8] LocalPart ::= NCName +@end example +@end defun + + +@defun ssax:read-markup-token port + + +This procedure starts parsing of a markup token. The current +position in the stream must be @samp{<}. This procedure scans +enough of the input stream to figure out what kind of a markup token +it is seeing. The procedure returns an XML-TOKEN structure +describing the token. Note, generally reading of the current markup +is not finished! In particular, no attributes of the start-tag +token are scanned. + +Here's a detailed break out of the return values and the position in +the PORT when that particular value is returned: + +@table @asis + +@item PI-token + +only PI-target is read. To finish the Processing-Instruction and +disregard it, call @code{ssax:skip-pi}. @code{ssax:read-attributes} +may be useful as well (for PIs whose content is attribute-value +pairs). + +@item END-token + +The end tag is read completely; the current position is right after +the terminating @samp{>} character. + +@item COMMENT + +is read and skipped completely. The current position is right after +@samp{-->} that terminates the comment. + +@item CDSECT + +The current position is right after @samp{<!CDATA[}. Use +@code{ssax:read-cdata-body} to read the rest. + +@item DECL + +We have read the keyword (the one that follows @samp{<!}) +identifying this declaration markup. The current position is after +the keyword (usually a whitespace character) + +@item START-token + +We have read the keyword (GI) of this start tag. No attributes are +scanned yet. We don't know if this tag has an empty content either. +Use @code{ssax:complete-start-tag} to finish parsing of the token. + +@end table +@end defun + + +@defun ssax:skip-pi port + + +The current position is inside a PI. Skip till the rest of the PI +@end defun + + +@defun ssax:read-pi-body-as-string port + + +The current position is right after reading the PITarget. We read +the body of PI and return is as a string. The port will point to +the character right after @samp{?>} combination that terminates PI. + +@example +[16] PI ::= '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>' +@end example +@end defun + + +@defun ssax:skip-internal-dtd port + + +The current pos in the port is inside an internal DTD subset (e.g., +after reading @samp{#\[} that begins an internal DTD subset) Skip +until the @samp{]>} combination that terminates this DTD. +@end defun + + +@defun ssax:read-cdata-body port str-handler seed + + +This procedure must be called after we have read a string +@samp{<![CDATA[} that begins a CDATA section. The current position +must be the first position of the CDATA body. This function reads +@emph{lines} of the CDATA body and passes them to a @var{str-handler}, a character +data consumer. + +@var{str-handler} is a procedure taking arguments: @var{string1}, @var{string2}, +and @var{seed}. The first @var{string1} argument to @var{str-handler} never +contains a newline; the second @var{string2} argument often will. +On the first invocation of @var{str-handler}, @var{seed} is the one passed to @code{ssax:read-cdata-body} as the +third argument. The result of this first invocation will be passed +as the @var{seed} argument to the second invocation of the line +consumer, and so on. The result of the last invocation of the @var{str-handler} is +returned by the @code{ssax:read-cdata-body}. Note a similarity to the fundamental @dfn{fold} +@cindex fold +iterator. + +Within a CDATA section all characters are taken at their face value, +with three exceptions: +@itemize @bullet +@item +CR, LF, and CRLF are treated as line delimiters, and passed +as a single @samp{#\newline} to @var{str-handler} + +@item +@samp{]]>} combination is the end of the CDATA section. +@samp{>} is treated as an embedded @samp{>} character. + +@item +@samp{<} and @samp{&} are not specially recognized (and are +not expanded)! + +@end itemize +@end defun + + +@defun ssax:read-char-ref port + + +@example +[66] CharRef ::= '&#' [0-9]+ ';' + | '&#x' [0-9a-fA-F]+ ';' +@end example + +This procedure must be called after we we have read @samp{&#} that +introduces a char reference. The procedure reads this reference and +returns the corresponding char. The current position in PORT will +be after the @samp{;} that terminates the char reference. + +Faults detected:@* +WFC: XML-Spec.html#wf-Legalchar + +According to Section @cite{4.1 Character and Entity References} +of the XML Recommendation: + +@quotation +"[Definition: A character reference refers to a specific character +in the ISO/IEC 10646 character set, for example one not directly +accessible from available input devices.]" +@end quotation + +@c Therefore, we use a @code{ucscode->char} function to convert a +@c character code into the character -- *regardless* of the current +@c character encoding of the input stream. +@end defun + + +@defun ssax:handle-parsed-entity port name entities content-handler str-handler seed + + +Expands and handles a parsed-entity reference. + +@var{name} is a symbol, the name of the parsed entity to expand. +@c entities - see ENTITIES +@var{content-handler} is a procedure of arguments @var{port}, @var{entities}, and +@var{seed} that returns a seed. +@var{str-handler} is called if the entity in question is a pre-declared entity. + +@code{ssax:handle-parsed-entity} returns the result returned by @var{content-handler} or @var{str-handler}. + +Faults detected:@* +WFC: XML-Spec.html#wf-entdeclared@* +WFC: XML-Spec.html#norecursion +@end defun + + +@defun attlist-add attlist name-value + + +Add a @var{name-value} pair to the existing @var{attlist}, preserving its sorted ascending +order; and return the new list. Return #f if a pair with the same +name already exists in @var{attlist} +@end defun + + +@defun attlist-remove-top attlist + + +Given an non-null @var{attlist}, return a pair of values: the top and the rest. +@end defun + + +@defun ssax:read-attributes port entities + + +This procedure reads and parses a production @dfn{Attribute}. +@cindex Attribute + +@example +[41] Attribute ::= Name Eq AttValue +[10] AttValue ::= '"' ([^<&"] | Reference)* '"' + | "'" ([^<&'] | Reference)* "'" +[25] Eq ::= S? '=' S? +@end example + +The procedure returns an ATTLIST, of Name (as UNRES-NAME), Value (as +string) pairs. The current character on the @var{port} is a non-whitespace +character that is not an NCName-starting character. + +Note the following rules to keep in mind when reading an +@dfn{AttValue}: +@cindex AttValue +@quotation +Before the value of an attribute is passed to the application or +checked for validity, the XML processor must normalize it as +follows: + +@itemize @bullet +@item +A character reference is processed by appending the referenced +character to the attribute value. + +@item +An entity reference is processed by recursively processing the +replacement text of the entity. The named entities @samp{amp}, +@samp{lt}, @samp{gt}, @samp{quot}, and @samp{apos} are pre-declared. + +@item +A whitespace character (#x20, #x0D, #x0A, #x09) is processed by +appending #x20 to the normalized value, except that only a single +#x20 is appended for a "#x0D#x0A" sequence that is part of an +external parsed entity or the literal entity value of an internal +parsed entity. + +@item +Other characters are processed by appending them to the normalized +value. + +@end itemize + +@end quotation + +Faults detected:@* +WFC: XML-Spec.html#CleanAttrVals@* +WFC: XML-Spec.html#uniqattspec +@end defun + + +@defun ssax:resolve-name port unres-name namespaces apply-default-ns? + + +Convert an @var{unres-name} to a RES-NAME, given the appropriate @var{namespaces} declarations. +The last parameter, @var{apply-default-ns?}, determines if the default namespace applies +(for instance, it does not for attribute names). + +Per REC-xml-names/#nsc-NSDeclared, the "xml" prefix is considered +pre-declared and bound to the namespace name +"http://www.w3.org/XML/1998/namespace". + +@code{ssax:resolve-name} tests for the namespace constraints:@* +@url{http://www.w3.org/TR/REC-xml-names/#nsc-NSDeclared} +@end defun + + +@defun ssax:complete-start-tag tag port elems entities namespaces + + +Complete parsing of a start-tag markup. @code{ssax:complete-start-tag} must be called after the +start tag token has been read. @var{tag} is an UNRES-NAME. @var{elems} is an +instance of the ELEMS slot of XML-DECL; it can be #f to tell the +function to do @emph{no} validation of elements and their +attributes. + +@code{ssax:complete-start-tag} returns several values: +@itemize @bullet +@item ELEM-GI: +a RES-NAME. +@item ATTRIBUTES: +element's attributes, an ATTLIST of (RES-NAME . STRING) pairs. +The list does NOT include xmlns attributes. +@item NAMESPACES: +the input list of namespaces amended with namespace +(re-)declarations contained within the start-tag under parsing +@item ELEM-CONTENT-MODEL +@end itemize + +On exit, the current position in @var{port} will be the first character +after @samp{>} that terminates the start-tag markup. + +Faults detected:@* +VC: XML-Spec.html#enum@* +VC: XML-Spec.html#RequiredAttr@* +VC: XML-Spec.html#FixedAttr@* +VC: XML-Spec.html#ValueType@* +WFC: XML-Spec.html#uniqattspec (after namespaces prefixes are resolved)@* +VC: XML-Spec.html#elementvalid@* +WFC: REC-xml-names/#dt-NSName + +@emph{Note}: although XML Recommendation does not explicitly say it, +xmlns and xmlns: attributes don't have to be declared (although they +can be declared, to specify their default value). +@end defun + + +@defun ssax:read-external-id port + + +Parses an ExternalID production: + +@example +[75] ExternalID ::= 'SYSTEM' S SystemLiteral + | 'PUBLIC' S PubidLiteral S SystemLiteral +[11] SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'") +[12] PubidLiteral ::= '"' PubidChar* '"' + | "'" (PubidChar - "'")* "'" +[13] PubidChar ::= #x20 | #x0D | #x0A | [a-zA-Z0-9] + | [-'()+,./:=?;!*#@@$_%] +@end example + +Call @code{ssax:read-external-id} when an ExternalID is expected; that is, the current +character must be either #\S or #\P that starts correspondingly a +SYSTEM or PUBLIC token. @code{ssax:read-external-id} returns the @var{SystemLiteral} as a +string. A @var{PubidLiteral} is disregarded if present. +@end defun + +@subsection Mid-Level Parsers and Scanners + +@noindent +These procedures parse productions corresponding to the whole +(document) entity or its higher-level pieces (prolog, root element, +etc). + + +@defun ssax:scan-misc port + + +Scan the Misc production in the context: + +@example +[1] document ::= prolog element Misc* +[22] prolog ::= XMLDecl? Misc* (doctypedec l Misc*)? +[27] Misc ::= Comment | PI | S +@end example + +Call @code{ssax:scan-misc} in the prolog or epilog contexts. In these contexts, +whitespaces are completely ignored. The return value from @code{ssax:scan-misc} is +either a PI-token, a DECL-token, a START token, or *EOF*. Comments +are ignored and not reported. +@end defun + + +@defun ssax:read-char-data port expect-eof? str-handler iseed + + +Read the character content of an XML document or an XML element. + +@example +[43] content ::= +(element | CharData | Reference | CDSect | PI | Comment)* +@end example + +To be more precise, @code{ssax:read-char-data} reads CharData, expands CDSect and character +entities, and skips comments. @code{ssax:read-char-data} stops at a named reference, EOF, +at the beginning of a PI, or a start/end tag. + +@var{expect-eof?} is a boolean indicating if EOF is normal; i.e., the character +data may be terminated by the EOF. EOF is normal while processing a +parsed entity. + +@var{iseed} is an argument passed to the first invocation of @var{str-handler}. + +@code{ssax:read-char-data} returns two results: @var{seed} and @var{token}. The @var{seed} +is the result of the last invocation of @var{str-handler}, or the original @var{iseed} if @var{str-handler} +was never called. + +@var{token} can be either an eof-object (this can happen only if @var{expect-eof?} +was #t), or: +@itemize @bullet + +@item +an xml-token describing a START tag or an END-tag; +For a start token, the caller has to finish reading it. + +@item +an xml-token describing the beginning of a PI. It's up to an +application to read or skip through the rest of this PI; + +@item +an xml-token describing a named entity reference. + +@end itemize + +CDATA sections and character references are expanded inline and +never returned. Comments are silently disregarded. + +As the XML Recommendation requires, all whitespace in character data +must be preserved. However, a CR character (#x0D) must be +disregarded if it appears before a LF character (#x0A), or replaced +by a #x0A character otherwise. See Secs. 2.10 and 2.11 of the XML +Recommendation. See also the canonical XML Recommendation. +@end defun + + +@defun ssax:assert-token token kind gi error-cont + + +Make sure that @var{token} is of anticipated @var{kind} and has anticipated @var{gi}. Note +that the @var{gi} argument may actually be a pair of two symbols, +Namespace-URI or the prefix, and of the localname. If the assertion +fails, @var{error-cont} is evaluated by passing it three arguments: @var{token} @var{kind} @var{gi}. The +result of @var{error-cont} is returned. +@end defun + +@subsection High-level Parsers + +These procedures are to instantiate a SSAX parser. A user can +instantiate the parser to do the full validation, or no validation, +or any particular validation. The user specifies which PI he wants +to be notified about. The user tells what to do with the parsed +character and element data. The latter handlers determine if the +parsing follows a SAX or a DOM model. + + +@defun ssax:make-pi-parser my-pi-handlers + + +Create a parser to parse and process one Processing Element (PI). + +@var{my-pi-handlers} is an association list of pairs +@code{(@var{pi-tag} . @var{pi-handler})} where @var{pi-tag} is an +NCName symbol, the PI target; and @var{pi-handler} is a procedure +taking arguments @var{port}, @var{pi-tag}, and @var{seed}. + +@var{pi-handler} should read the rest of the PI up to and including +the combination @samp{?>} that terminates the PI. The handler +should return a new seed. One of the @var{pi-tag}s may be the +symbol @code{*DEFAULT*}. The corresponding handler will handle PIs +that no other handler will. If the *DEFAULT* @var{pi-tag} is not +specified, @code{ssax:make-pi-parser} will assume the default handler that skips the body of +the PI. + +@code{ssax:make-pi-parser} returns a procedure of arguments @var{port}, @var{pi-tag}, and +@var{seed}; that will parse the current PI according to @var{my-pi-handlers}. +@end defun + + +@defun ssax:make-elem-parser my-new-level-seed my-finish-element my-char-data-handler my-pi-handlers + + +Create a parser to parse and process one element, including its +character content or children elements. The parser is typically +applied to the root element of a document. + +@table @asis + +@item @var{my-new-level-seed} +is a procedure taking arguments: + +@var{elem-gi} @var{attributes} @var{namespaces} @var{expected-content} @var{seed} + +where @var{elem-gi} is a RES-NAME of the element about to be +processed. + +@var{my-new-level-seed} is to generate the seed to be passed to handlers that process the +content of the element. + +@item @var{my-finish-element} +is a procedure taking arguments: + +@var{elem-gi} @var{attributes} @var{namespaces} @var{parent-seed} @var{seed} + +@var{my-finish-element} is called when parsing of @var{elem-gi} is finished. +The @var{seed} is the result from the last content parser (or +from @var{my-new-level-seed} if the element has the empty content). +@var{parent-seed} is the same seed as was passed to @var{my-new-level-seed}. +@var{my-finish-element} is to generate a seed that will be the result +of the element parser. + +@item @var{my-char-data-handler} +is a STR-HANDLER as described in Data Types above. + +@item @var{my-pi-handlers} +is as described for @code{ssax:make-pi-handler} above. + +@end table + +The generated parser is a procedure taking arguments: + +@var{start-tag-head} @var{port} @var{elems} @var{entities} @var{namespaces} @var{preserve-ws?} @var{seed} + +The procedure must be called after the start tag token has been +read. @var{start-tag-head} is an UNRES-NAME from the start-element +tag. ELEMS is an instance of ELEMS slot of XML-DECL. + +Faults detected:@* +VC: XML-Spec.html#elementvalid@* +WFC: XML-Spec.html#GIMatch +@end defun + + +@defun ssax:make-parser user-handler-tag user-handler @dots{} + + +Create an XML parser, an instance of the XML parsing framework. +This will be a SAX, a DOM, or a specialized parser depending on the +supplied user-handlers. + +@code{ssax:make-parser} takes an even number of arguments; @var{user-handler-tag} is a symbol that identifies +a procedure (or association list for @code{PROCESSING-INSTRUCTIONS}) +(@var{user-handler}) that follows the tag. Given below are tags and signatures of +the corresponding procedures. Not all tags have to be specified. +If some are omitted, reasonable defaults will apply. + +@table @samp + +@item DOCTYPE +handler-procedure: @var{port} @var{docname} @var{systemid} @var{internal-subset?} @var{seed} + +If @var{internal-subset?} is #t, the current position in the port is +right after we have read @samp{[} that begins the internal DTD +subset. We must finish reading of this subset before we return (or +must call @code{skip-internal-dtd} if we aren't interested in +reading it). @var{port} at exit must be at the first symbol after +the whole DOCTYPE declaration. + +The handler-procedure must generate four values: +@quotation +@var{elems} @var{entities} @var{namespaces} @var{seed} +@end quotation + +@var{elems} is as defined for the ELEMS slot of XML-DECL. It may be +#f to switch off validation. @var{namespaces} will typically +contain @var{user-prefix}es for selected @var{uri-symb}s. The +default handler-procedure skips the internal subset, if any, and +returns @code{(values #f '() '() seed)}. + +@item UNDECL-ROOT +procedure: @var{elem-gi} @var{seed} + +where @var{elem-gi} is an UNRES-NAME of the root element. This +procedure is called when an XML document under parsing contains +@emph{no} DOCTYPE declaration. + +The handler-procedure, as a DOCTYPE handler procedure above, +must generate four values: +@quotation +@var{elems} @var{entities} @var{namespaces} @var{seed} +@end quotation + +The default handler-procedure returns (values #f '() '() seed) + +@item DECL-ROOT +procedure: @var{elem-gi} @var{seed} + +where @var{elem-gi} is an UNRES-NAME of the root element. This +procedure is called when an XML document under parsing does contains +the DOCTYPE declaration. The handler-procedure must generate a new +@var{seed} (and verify that the name of the root element matches the +doctype, if the handler so wishes). The default handler-procedure +is the identity function. + +@item NEW-LEVEL-SEED +procedure: see ssax:make-elem-parser, my-new-level-seed + +@item FINISH-ELEMENT +procedure: see ssax:make-elem-parser, my-finish-element + +@item CHAR-DATA-HANDLER +procedure: see ssax:make-elem-parser, my-char-data-handler + +@item PROCESSING-INSTRUCTIONS +association list as is passed to @code{ssax:make-pi-parser}. +The default value is '() + +@end table + +The generated parser is a procedure of arguments @var{port} and +@var{seed}. + +This procedure parses the document prolog and then exits to an +element parser (created by @code{ssax:make-elem-parser}) to handle +the rest. + +@example +[1] document ::= prolog element Misc* +[22] prolog ::= XMLDecl? Misc* (doctypedec | Misc*)? +[27] Misc ::= Comment | PI | S +[28] doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S? + ('[' (markupdecl | PEReference | S)* ']' S?)? '>' +[29] markupdecl ::= elementdecl | AttlistDecl + | EntityDecl + | NotationDecl | PI + | Comment +@end example +@end defun + +@subsection Parsing XML to SXML + + +@defun ssax:xml->sxml port namespace-prefix-assig + + +This is an instance of the SSAX parser that returns an SXML +representation of the XML document to be read from @var{port}. @var{namespace-prefix-assig} is a list +of @code{(@var{user-prefix} . @var{uri-string})} that assigns +@var{user-prefix}es to certain namespaces identified by particular +@var{uri-string}s. It may be an empty list. @code{ssax:xml->sxml} returns an SXML +tree. The port points out to the first character after the root +element. +@end defun + |