diff options
author | bnewbold <bnewbold@eta.mit.edu> | 2009-02-11 14:24:16 -0500 |
---|---|---|
committer | bnewbold <bnewbold@eta.mit.edu> | 2009-02-11 14:24:16 -0500 |
commit | 2b7db5b23df55e3a1ac0639494bea750d0797c9d (patch) | |
tree | 700983f54296f16984629f25a395e751531f515f /ps01_grep | |
parent | 40f5708a4d382f358a854b336a9e8db9ffb3fccc (diff) | |
download | 6.945-2b7db5b23df55e3a1ac0639494bea750d0797c9d.tar.gz 6.945-2b7db5b23df55e3a1ac0639494bea750d0797c9d.zip |
first pset work
Diffstat (limited to 'ps01_grep')
-rw-r--r-- | ps01_grep/posix_standard.html | 1109 | ||||
-rw-r--r-- | ps01_grep/ps01_bnewbold.txt | 345 | ||||
-rw-r--r-- | ps01_grep/ps01_work.scm | 331 |
3 files changed, 1785 insertions, 0 deletions
diff --git a/ps01_grep/posix_standard.html b/ps01_grep/posix_standard.html new file mode 100644 index 0000000..6c1aeb2 --- /dev/null +++ b/ps01_grep/posix_standard.html @@ -0,0 +1,1109 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> +<html> +<head> +<meta name="generator" content="HTML Tidy, see www.w3.org"> +<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> +<link type="text/css" rel="stylesheet" href="style.css"> +<!-- Generated by The Open Group's rhtm tool v1.2.1 --> +<!-- Copyright (c) 2001-2004 IEEE and The Open Group, All Rights Reserved --> +<title>Regular Expressions</title> +</head> +<body bgcolor="white"> + +<basefont size="3"> <!--header start--> +<center><font size="2">The Open Group Base Specifications Issue 6<br> +IEEE Std 1003.1, 2004 Edition<br> +Copyright © 2001-2004 The IEEE and The Open Group, All Rights reserved.</font></center> + +<!--header end--> +<hr size="2" noshade> +<h2><a name="tag_09"> 9. </a>Regular Expressions</h2> + +<p>Regular Expressions (REs) provide a mechanism to select specific strings from a set of character strings.</p> + +<p>Regular expressions are a context-independent syntax that can represent a wide variety of character sets and character set +orderings, where these character sets are interpreted according to the current locale. While many regular expressions can be +interpreted differently depending on the current locale, many features, such as character class expressions, provide for contextual +invariance across locales.</p> + +<p>The Basic Regular Expression (BRE) notation and construction rules in <a href="#tag_09_03">Basic Regular Expressions</a> shall +apply to most utilities supporting regular expressions. Some utilities, instead, support the Extended Regular Expressions (ERE) +described in <a href="#tag_09_04">Extended Regular Expressions</a>; any exceptions for both cases are noted in the descriptions of +the specific utilities using regular expressions. Both BREs and EREs are supported by the Regular Expression Matching interface in +the System Interfaces volume of IEEE Std 1003.1-2001 under <a href="../functions/regcomp.html"><i>regcomp</i>()</a>, <a +href="../functions/regexec.html"><i>regexec</i>()</a>, and related functions.</p> + +<h3><a name="tag_09_01"> 9.1 </a>Regular Expression Definitions</h3> + +<p>For the purposes of this section, the following definitions shall apply:</p> + +<h4><a name="tag_09_01_01"></a>entire regular expression</h4> + +<p>The concatenated set of one or more BREs or EREs that make up the pattern specified for string selection.</p> + +<h4><a name="tag_09_01_02"></a>matched</h4> + +<p>A sequence of zero or more characters shall be said to be matched by a BRE or ERE when the characters in the sequence correspond +to a sequence of characters defined by the pattern.</p> + +<p>Matching shall be based on the bit pattern used for encoding the character, not on the graphic representation of the character. +This means that if a character set contains two or more encodings for a graphic symbol, or if the strings searched contain text +encoded in more than one codeset, no attempt is made to search for any other representation of the encoded symbol. If that is +required, the user can specify equivalence classes containing all variations of the desired graphic symbol.</p> + +<p>The search for a matching sequence starts at the beginning of a string and stops when the first sequence matching the expression +is found, where "first" is defined to mean "begins earliest in the string". If the pattern permits a variable number of +matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched. For +example, the BRE <tt>"bb*"</tt> matches the second to fourth characters of the string <tt>"abbbc"</tt>, and the ERE +<tt>"(wee|week)(knights|night)"</tt> matches all ten characters of the string <tt>"weeknights"</tt>.</p> + +<p>Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, shall match the +longest possible string. For this purpose, a null string shall be considered to be longer than no match at all. For example, +matching the BRE <tt>"\(.*\).*"</tt> against <tt>"abcdef"</tt>, the subexpression <tt>"(\1)"</tt> is <tt>"abcdef"</tt>, and +matching the BRE <tt>"\(a*\)*"</tt> against <tt>"bc"</tt>, the subexpression <tt>"(\1)"</tt> is the null string.</p> + +<p>When a multi-character collating element in a bracket expression (see <a href="#tag_09_03_05">RE Bracket Expression</a>) is +involved, the longest sequence shall be measured in characters consumed from the string to be matched; that is, the collating +element counts not as one element, but as the number of characters it matches.</p> + +<h4><a name="tag_09_01_03"></a>BRE (ERE) matching a single character</h4> + +<p>A BRE or ERE that shall match either a single character or a single collating element.</p> + +<p>Only a BRE or ERE of this type that includes a bracket expression (see <a href="#tag_09_03_05">RE Bracket Expression</a>) can +match a collating element.</p> + +<h4><a name="tag_09_01_04"></a>BRE (ERE) matching multiple characters</h4> + +<p>A BRE or ERE that shall match a concatenation of single characters or collating elements.</p> + +<p>Such a BRE or ERE is made up from a BRE (ERE) matching a single character and BRE (ERE) special characters.</p> + +<h4><a name="tag_09_01_05"></a>invalid</h4> + +<p>This section uses the term "invalid" for certain constructs or conditions. Invalid REs shall cause the utility or function +using the RE to generate an error condition. When invalid is not used, violations of the specified syntax or semantics for REs +produce undefined results: this may entail an error, enabling an extended syntax for that RE, or using the construct in error as +literal characters to be matched. For example, the BRE construct <tt>"\{1,2,3\}"</tt> does not comply with the grammar. A +conforming application cannot rely on it producing an error nor matching the literal characters <tt>"\{1,2,3\}"</tt>.</p> + +<h3><a name="tag_09_02"> 9.2 </a>Regular Expression General Requirements</h3> + +<p>The requirements in this section shall apply to both basic and extended regular expressions.</p> + +<p>The use of regular expressions is generally associated with text processing. REs (BREs and EREs) operate on text strings; that +is, zero or more characters followed by an end-of-string delimiter (typically NUL). Some utilities employing regular expressions +limit the processing to lines; that is, zero or more characters followed by a <newline>. In the regular expression processing +described in IEEE Std 1003.1-2001, the <newline> is regarded as an ordinary character and both a period and a +non-matching list can match one. The Shell and Utilities volume of IEEE Std 1003.1-2001 specifies within the individual +descriptions of those standard utilities employing regular expressions whether they permit matching of <newline>s; if not +stated otherwise, the use of literal <newline>s or any escape sequence equivalent produces undefined results. Those utilities +(like <a href="../utilities/grep.html"><i>grep</i></a>) that do not allow <newline>s to match are responsible for eliminating +any <newline> from strings before matching against the RE. The <a href="../functions/regcomp.html"><i>regcomp</i>()</a> +function in the System Interfaces volume of IEEE Std 1003.1-2001, however, can provide support for such processing +without violating the rules of this section.</p> + +<p>The interfaces specified in IEEE Std 1003.1-2001 do not permit the inclusion of a NUL character in an RE or in the +string to be matched. If during the operation of a standard utility a NUL is included in the text designated to be matched, that +NUL may designate the end of the text string for the purposes of matching.</p> + +<p>When a standard utility or function that uses regular expressions specifies that pattern matching shall be performed without +regard to the case (uppercase or lowercase) of either data or patterns, then when each character in the string is matched against +the pattern, not only the character, but also its case counterpart (if any), shall be matched. This definition of case-insensitive +processing is intended to allow matching of multi-character collating elements as well as characters, as each character in the +string is matched using both its cases. For example, in a locale where <tt>"Ch"</tt> is a multi-character collating element and +where a matching list expression matches such elements, the RE <tt>"[[.Ch.]]"</tt> when matched against the string <tt>"char"</tt> +is in reality matched against <tt>"ch"</tt>, <tt>"Ch"</tt>, <tt>"cH"</tt>, and <tt>"CH"</tt>.</p> + +<p>The implementation shall support any regular expression that does not exceed 256 bytes in length.</p> + +<h3><a name="tag_09_03"> 9.3 </a>Basic Regular Expressions</h3> + +<h4><a name="tag_09_03_01"> 9.3.1 </a>BREs Matching a Single Character or Collating Element</h4> + +<p>A BRE ordinary character, a special character preceded by a backslash, or a period shall match a single character. A bracket +expression shall match a single character or a single collating element.</p> + +<h4><a name="tag_09_03_02"> 9.3.2 </a>BRE Ordinary Characters</h4> + +<p>An ordinary character is a BRE that matches itself: any character in the supported character set, except for the BRE special +characters listed in <a href="#tag_09_03_03">BRE Special Characters</a>.</p> + +<p>The interpretation of an ordinary character preceded by a backslash ( <tt>'\'</tt> ) is undefined, except for:</p> + +<ul> +<li> +<p>The characters <tt>')'</tt>, <tt>'('</tt>, <tt>'{'</tt>, and <tt>'}'</tt></p> +</li> + +<li> +<p>The digits 1 to 9 inclusive (see <a href="#tag_09_03_06">BREs Matching Multiple Characters</a>)</p> +</li> + +<li> +<p>A character inside a bracket expression</p> +</li> +</ul> + +<h4><a name="tag_09_03_03"> 9.3.3 </a>BRE Special Characters</h4> + +<p>A BRE special character has special properties in certain contexts. Outside those contexts, or when preceded by a backslash, +such a character is a BRE that matches the special character itself. The BRE special characters and the contexts in which they have +their special meaning are as follows:</p> + +<dl compact> +<dt><tt>.[\</tt></dt> + +<dd>The period, left-bracket, and backslash shall be special except when used in a bracket expression (see <a href= +"#tag_09_03_05">RE Bracket Expression</a>). An expression containing a <tt>'['</tt> that is not preceded by a backslash and is not +part of a bracket expression produces undefined results.</dd> + +<dt><tt>*</tt></dt> + +<dd>The asterisk shall be special except when used: + +<ul> +<li> +<p>In a bracket expression</p> +</li> + +<li> +<p>As the first character of an entire BRE (after an initial <tt>'^'</tt>, if any)</p> +</li> + +<li> +<p>As the first character of a subexpression (after an initial <tt>'^'</tt>, if any); see <a href="#tag_09_03_06">BREs Matching +Multiple Characters</a></p> +</li> +</ul> +</dd> + +<dt><tt>^</tt></dt> + +<dd>The circumflex shall be special when used as: + +<ul> +<li> +<p>An anchor (see <a href="#tag_09_03_08">BRE Expression Anchoring</a>)</p> +</li> + +<li> +<p>The first character of a bracket expression (see <a href="#tag_09_03_05">RE Bracket Expression</a>)</p> +</li> +</ul> +</dd> + +<dt><tt>$</tt></dt> + +<dd>The dollar sign shall be special when used as an anchor.</dd> +</dl> + +<h4><a name="tag_09_03_04"> 9.3.4 </a>Periods in BREs</h4> + +<p>A period ( <tt>'.'</tt> ), when used outside a bracket expression, is a BRE that shall match any character in the supported +character set except NUL.</p> + +<h4><a name="tag_09_03_05"> 9.3.5 </a>RE Bracket Expression</h4> + +<p>A bracket expression (an expression enclosed in square brackets, <tt>"[]"</tt> ) is an RE that shall match a single collating +element contained in the non-empty set of collating elements represented by the bracket expression.</p> + +<p>The following rules and definitions apply to bracket expressions:</p> + +<ol> +<li> +<p>A bracket expression is either a matching list expression or a non-matching list expression. It consists of one or more +expressions: collating elements, collating symbols, equivalence classes, character classes, or range expressions. The right-bracket +( <tt>']'</tt> ) shall lose its special meaning and represent itself in a bracket expression if it occurs first in the list (after +an initial circumflex ( <tt>'^'</tt> ), if any). Otherwise, it shall terminate the bracket expression, unless it appears in a +collating symbol (such as <tt>"[.].]"</tt> ) or is the ending right-bracket for a collating symbol, equivalence class, or character +class. The special characters <tt>'.'</tt>, <tt>'*'</tt>, <tt>'['</tt>, and <tt>'\'</tt> (period, asterisk, left-bracket, and +backslash, respectively) shall lose their special meaning within a bracket expression.</p> + +<p>The character sequences <tt>"[."</tt>, <tt>"[="</tt>, and <tt>"[:"</tt> (left-bracket followed by a period, equals-sign, or +colon) shall be special inside a bracket expression and are used to delimit collating symbols, equivalence class expressions, and +character class expressions. These symbols shall be followed by a valid expression and the matching terminating sequence +<tt>".]"</tt>, <tt>"=]"</tt>, or <tt>":]"</tt>, as described in the following items.</p> +</li> + +<li> +<p>A matching list expression specifies a list that shall match any single-character collating element in any of the expressions +represented in the list. The first character in the list shall not be the circumflex; for example, <tt>"[abc]"</tt> is an RE that +matches any of the characters <tt>'a'</tt>, <tt>'b'</tt>, or <tt>'c'</tt>. It is unspecified whether a matching list expression +matches a multi-character collating element that is matched by one of the expressions.</p> +</li> + +<li> +<p>A non-matching list expression begins with a circumflex ( <tt>'^'</tt> ), and specifies a list that shall match any +single-character collating element except for the expressions represented in the list after the leading circumflex. For example, +<tt>"[^abc]"</tt> is an RE that matches any character except the characters <tt>'a'</tt>, <tt>'b'</tt>, or <tt>'c'</tt>. It is +unspecified whether a non-matching list expression matches a multi-character collating element that is not matched by any of the +expressions. The circumflex shall have this special meaning only when it occurs first in the list, immediately following the +left-bracket.</p> +</li> + +<li> +<p>A collating symbol is a collating element enclosed within bracket-period ( <tt>"[."</tt> and <tt>".]"</tt> ) delimiters. +Collating elements are defined as described in <a href="xbd_chap07.html#tag_07_03_02_04"><i>Collation Order</i></a>. Conforming +applications shall represent multi-character collating elements as collating symbols when it is necessary to distinguish them from +a list of the individual characters that make up the multi-character collating element. For example, if the string <tt>"ch"</tt> is +a collating element defined using the line:</p> + +<blockquote> +<pre> +<tt>collating-element <ch-digraph> from "<c><h>" +</tt> +</pre> +</blockquote> + +<p>in the locale definition, the expression <tt>"[[.ch.]]"</tt> shall be treated as an RE containing the collating symbol +<tt>'ch'</tt>, while <tt>"[ch]"</tt> shall be treated as an RE matching <tt>'c'</tt> or <tt>'h'</tt>. Collating symbols are +recognized only inside bracket expressions. If the string is not a collating element in the current locale, the expression is +invalid.</p> +</li> + +<li> +<p>An equivalence class expression shall represent the set of collating elements belonging to an equivalence class, as described in +<a href="xbd_chap07.html#tag_07_03_02_04"><i>Collation Order</i></a>. Only primary equivalence classes shall be recognized. The +class shall be expressed by enclosing any one of the collating elements in the equivalence class within bracket-equal ( +<tt>"[="</tt> and <tt>"=]"</tt> ) delimiters. For example, if <tt>'a'</tt>, <tt>'à'</tt>, and <tt>'â'</tt> belong to +the same equivalence class, then <tt>"[[=a=]b]"</tt>, <tt>"[[=à=]b]"</tt>, and <tt>"[[=â=]b]"</tt> are each +equivalent to <tt>"[aàâb]"</tt>. If the collating element does not belong to an equivalence class, the equivalence +class expression shall be treated as a collating symbol.</p> +</li> + +<li> +<p>A character class expression shall represent the union of two sets:</p> + +<ol type="a"> +<li> +<p>The set of single-character collating elements whose characters belong to the character class, as defined in the <i>LC_CTYPE</i> +category in the current locale.</p> +</li> + +<li> +<p>An unspecified set of multi-character collating elements.</p> +</li> +</ol> + +<p>All character classes specified in the current locale shall be recognized. A character class expression is expressed as a +character class name enclosed within bracket-colon ( <tt>"[:"</tt> and <tt>":]"</tt> ) delimiters.</p> + +<p>The following character class expressions shall be supported in all locales:</p> + +<blockquote> +<pre> +<tt>[:alnum:] [:cntrl:] [:lower:] [:space:] +[:alpha:] [:digit:] [:print:] [:upper:] +[:blank:] [:graph:] [:punct:] [:xdigit:] +</tt> +</pre> +</blockquote> + +<p>In addition, character class expressions of the form:</p> + +<blockquote> +<pre> +<tt>[:</tt><i>name</i><tt>:] +</tt> +</pre> +</blockquote> + +<p>are recognized in those locales where the <i>name</i> keyword has been given a <b>charclass</b> definition in the +<i>LC_CTYPE</i> category.</p> +</li> + +<li> +<p>In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation +sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely +on whether the range expression is valid, or on the set of collating elements matched. A range expression shall be expressed as the +starting point and the ending point separated by a hyphen ( <tt>'-'</tt> ).</p> + +<p>In the following, all examples assume the POSIX locale.</p> + +<p>The starting range point and the ending range point shall be a collating element or collating symbol. An equivalence class +expression used as a starting or ending point of a range expression produces unspecified results. An equivalence class can be used +portably within a bracket expression, but only outside the range. If the represented set of collating elements is empty, it is +unspecified whether the expression matches nothing, or is treated as invalid.</p> + +<p>The interpretation of range expressions where the ending range point is also the starting range point of a subsequent range +expression (for example, <tt>"[a-m-o]"</tt> ) is undefined.</p> + +<p>The hyphen character shall be treated as itself if it occurs first (after an initial <tt>'^'</tt>, if any) or last in the list, +or as an ending range point in a range expression. As examples, the expressions <tt>"[-ac]"</tt> and <tt>"[ac-]"</tt> are +equivalent and match any of the characters <tt>'a'</tt>, <tt>'c'</tt>, or <tt>'-'</tt> ; <tt>"[^-ac]"</tt> and <tt>"[^ac-]"</tt> +are equivalent and match any characters except <tt>'a'</tt>, <tt>'c'</tt>, or <tt>'-'</tt> ; the expression <tt>"[%--]"</tt> +matches any of the characters between <tt>'%'</tt> and <tt>'-'</tt> inclusive; the expression <tt>"[--@]"</tt> matches any of the +characters between <tt>'-'</tt> and <tt>'@'</tt> inclusive; and the expression <tt>"[a--@]"</tt> is either invalid or equivalent to +<tt>'@'</tt>, because the letter <tt>'a'</tt> follows the symbol <tt>'-'</tt> in the POSIX locale. To use a hyphen as the starting +range point, it shall either come first in the bracket expression or be specified as a collating symbol; for example, +<tt>"[][.-.]-0]"</tt>, which matches either a right bracket or any character or collating element that collates between hyphen and +0, inclusive.</p> + +<p>If a bracket expression specifies both <tt>'-'</tt> and <tt>']'</tt>, the <tt>']'</tt> shall be placed first (after the +<tt>'^'</tt>, if any) and the <tt>'-'</tt> last within the bracket expression.</p> +</li> +</ol> + +<h4><a name="tag_09_03_06"> 9.3.6 </a>BREs Matching Multiple Characters</h4> + +<p>The following rules can be used to construct BREs matching multiple characters from BREs matching a single character:</p> + +<ol> +<li> +<p>The concatenation of BREs shall match the concatenation of the strings matched by each component of the BRE.</p> +</li> + +<li> +<p>A subexpression can be defined within a BRE by enclosing it between the character pairs <tt>"\("</tt> and <tt>"\)"</tt>. Such a +subexpression shall match whatever it would have matched without the <tt>"\("</tt> and <tt>"\)"</tt>, except that anchoring within +subexpressions is optional behavior; see <a href="#tag_09_03_08">BRE Expression Anchoring</a>. Subexpressions can be arbitrarily +nested.</p> +</li> + +<li> +<p>The back-reference expression <tt>'\n'</tt> shall match the same (possibly empty) string of characters as was matched by a +subexpression enclosed between <tt>"\("</tt> and <tt>"\)"</tt> preceding the <tt>'\n'</tt>. The character <tt>'n'</tt> shall be a +digit from 1 through 9, specifying the <i>n</i>th subexpression (the one that begins with the <i>n</i>th <tt>"\("</tt> from the +beginning of the pattern and ends with the corresponding paired <tt>"\)"</tt> ). The expression is invalid if less than <i>n</i> +subexpressions precede the <tt>'\n'</tt>. For example, the expression <tt>"\(.*\)\1$"</tt> matches a line consisting of two +adjacent appearances of the same string, and the expression <tt>"\(a\)*\1"</tt> fails to match <tt>'a'</tt>. When the referenced +subexpression matched more than one string, the back-referenced expression shall refer to the last matched string. If the +subexpression referenced by the back-reference matches more than one string because of an asterisk ( <tt>'*'</tt> ) or an interval +expression (see item (5)), the back-reference shall match the last (rightmost) of these strings.</p> +</li> + +<li> +<p>When a BRE matching a single character, a subexpression, or a back-reference is followed by the special character asterisk ( +<tt>'*'</tt> ), together with that asterisk it shall match what zero or more consecutive occurrences of the BRE would match. For +example, <tt>"[ab]*"</tt> and <tt>"[ab][ab]"</tt> are equivalent when matching the string <tt>"ab"</tt>.</p> +</li> + +<li> +<p>When a BRE matching a single character, a subexpression, or a back-reference is followed by an interval expression of the format +<tt>"\{m\}"</tt>, <tt>"\{m,\}"</tt>, or <tt>"\{m,n\}"</tt>, together with that interval expression it shall match what repeated +consecutive occurrences of the BRE would match. The values of <i>m</i> and <i>n</i> are decimal integers in the range 0 <= +<i>m</i><= <i>n</i><= {RE_DUP_MAX}, where <i>m</i> specifies the exact or minimum number of occurrences and <i>n</i> +specifies the maximum number of occurrences. The expression <tt>"\{m\}"</tt> shall match exactly <i>m</i> occurrences of the +preceding BRE, <tt>"\{m,\}"</tt> shall match at least <i>m</i> occurrences, and <tt>"\{m,n\}"</tt> shall match any number of +occurrences between <i>m</i> and <i>n</i>, inclusive.</p> + +<p>For example, in the string <tt>"abababccccccd"</tt> the BRE <tt>"c\{3\}"</tt> is matched by characters seven to nine, the BRE +<tt>"\(ab\)\{4,\}"</tt> is not matched at all, and the BRE <tt>"c\{1,3\}d"</tt> is matched by characters ten to thirteen.</p> +</li> +</ol> + +<p>The behavior of multiple adjacent duplication symbols ( <tt>'*'</tt> and intervals) produces undefined results.</p> + +<p>A subexpression repeated by an asterisk ( <tt>'*'</tt> ) or an interval expression shall not match a null expression unless this +is the only match for the repetition or it is necessary to satisfy the exact or minimum number of occurrences for the interval +expression.</p> + +<h4><a name="tag_09_03_07"> 9.3.7 </a>BRE Precedence</h4> + +<p>The order of precedence shall be as shown in the following table:</p> + +<center> +<table border="1" cellpadding="3" align="center"> +<tr valign="top"> +<th colspan="2" align="center"> +<p class="tent"><b>BRE Precedence (from high to low)</b></p> +</th> +</tr> + +<tr valign="top"> +<td align="left"> +<p class="tent">Collation-related bracket symbols</p> +</td> +<td align="left"> +<p class="tent">[==] [::] [..]</p> +</td> +</tr> + +<tr valign="top"> +<td align="left"> +<p class="tent">Escaped characters</p> +</td> +<td align="left"> +<p class="tent">\<special character></p> +</td> +</tr> + +<tr valign="top"> +<td align="left"> +<p class="tent">Bracket expression</p> +</td> +<td align="left"> +<p class="tent">[]</p> +</td> +</tr> + +<tr valign="top"> +<td align="left"> +<p class="tent">Subexpressions/back-references</p> +</td> +<td align="left"> +<p class="tent">\(\) \n</p> +</td> +</tr> + +<tr valign="top"> +<td align="left"> +<p class="tent">Single-character-BRE duplication</p> +</td> +<td align="left"> +<p class="tent">* \{m,n\}</p> +</td> +</tr> + +<tr valign="top"> +<td align="left"> +<p class="tent">Concatenation</p> +</td> +<td align="left"> +<p class="tent"> </p> +</td> +</tr> + +<tr valign="top"> +<td align="left"> +<p class="tent">Anchoring</p> +</td> +<td align="left"> +<p class="tent">^ $</p> +</td> +</tr> +</table> +</center> + +<h4><a name="tag_09_03_08"> 9.3.8 </a>BRE Expression Anchoring</h4> + +<p>A BRE can be limited to matching strings that begin or end a line; this is called "anchoring". The circumflex and dollar sign +special characters shall be considered BRE anchors in the following contexts:</p> + +<ol> +<li> +<p>A circumflex ( <tt>'^'</tt> ) shall be an anchor when used as the first character of an entire BRE. The implementation may treat +the circumflex as an anchor when used as the first character of a subexpression. The circumflex shall anchor the expression (or +optionally subexpression) to the beginning of a string; only sequences starting at the first character of a string shall be matched +by the BRE. For example, the BRE <tt>"^ab"</tt> matches <tt>"ab"</tt> in the string <tt>"abcdef"</tt>, but fails to match in the +string <tt>"cdefab"</tt>. The BRE <tt>"\(^ab\)"</tt> may match the former string. A portable BRE shall escape a leading circumflex +in a subexpression to match a literal circumflex.</p> +</li> + +<li> +<p>A dollar sign ( <tt>'$'</tt> ) shall be an anchor when used as the last character of an entire BRE. The implementation may treat +a dollar sign as an anchor when used as the last character of a subexpression. The dollar sign shall anchor the expression (or +optionally subexpression) to the end of the string being matched; the dollar sign can be said to match the end-of-string following +the last character.</p> +</li> + +<li> +<p>A BRE anchored by both <tt>'^'</tt> and <tt>'$'</tt> shall match only an entire string. For example, the BRE <tt>"^abcdef$"</tt> +matches strings consisting only of <tt>"abcdef"</tt>.</p> +</li> +</ol> + +<h3><a name="tag_09_04"> 9.4 </a>Extended Regular Expressions</h3> + +<p>The extended regular expression (ERE) notation and construction rules shall apply to utilities defined as using extended regular +expressions; any exceptions to the following rules are noted in the descriptions of the specific utilities using EREs.</p> + +<h4><a name="tag_09_04_01"> 9.4.1 </a>EREs Matching a Single Character or Collating Element</h4> + +<p>An ERE ordinary character, a special character preceded by a backslash, or a period shall match a single character. A bracket +expression shall match a single character or a single collating element. An ERE matching a single character enclosed in parentheses +shall match the same as the ERE without parentheses would have matched.</p> + +<h4><a name="tag_09_04_02"> 9.4.2 </a>ERE Ordinary Characters</h4> + +<p>An ordinary character is an ERE that matches itself. An ordinary character is any character in the supported character set, +except for the ERE special characters listed in <a href="#tag_09_04_03">ERE Special Characters</a>. The interpretation of an +ordinary character preceded by a backslash ( <tt>'\'</tt> ) is undefined.</p> + +<h4><a name="tag_09_04_03"> 9.4.3 </a>ERE Special Characters</h4> + +<p>An ERE special character has special properties in certain contexts. Outside those contexts, or when preceded by a backslash, +such a character shall be an ERE that matches the special character itself. The extended regular expression special characters and +the contexts in which they shall have their special meaning are as follows:</p> + +<dl compact> +<dt><tt>.[\(</tt></dt> + +<dd>The period, left-bracket, backslash, and left-parenthesis shall be special except when used in a bracket expression (see <a +href="#tag_09_03_05">RE Bracket Expression</a>). Outside a bracket expression, a left-parenthesis immediately followed by a +right-parenthesis produces undefined results.</dd> + +<dt><tt>)</tt></dt> + +<dd>The right-parenthesis shall be special when matched with a preceding left-parenthesis, both outside a bracket expression.</dd> + +<dt><tt>*+?{</tt></dt> + +<dd>The asterisk, plus-sign, question-mark, and left-brace shall be special except when used in a bracket expression (see <a href= +"#tag_09_03_05">RE Bracket Expression</a>). Any of the following uses produce undefined results: + +<ul> +<li> +<p>If these characters appear first in an ERE, or immediately following a vertical-line, circumflex, or left-parenthesis</p> +</li> + +<li> +<p>If a left-brace is not part of a valid interval expression (see <a href="#tag_09_04_06">EREs Matching Multiple Characters</a> +)</p> +</li> +</ul> +</dd> + +<dt><tt>|</tt></dt> + +<dd>The vertical-line is special except when used in a bracket expression (see <a href="#tag_09_03_05">RE Bracket Expression</a>). +A vertical-line appearing first or last in an ERE, or immediately following a vertical-line or a left-parenthesis, or immediately +preceding a right-parenthesis, produces undefined results.</dd> + +<dt><tt>^</tt></dt> + +<dd>The circumflex shall be special when used as: + +<ul> +<li> +<p>An anchor (see <a href="#tag_09_04_09">ERE Expression Anchoring</a>)</p> +</li> + +<li> +<p>The first character of a bracket expression (see <a href="#tag_09_03_05">RE Bracket Expression</a>)</p> +</li> +</ul> +</dd> + +<dt><tt>$</tt></dt> + +<dd>The dollar sign shall be special when used as an anchor.</dd> +</dl> + +<h4><a name="tag_09_04_04"> 9.4.4 </a>Periods in EREs</h4> + +<p>A period ( <tt>'.'</tt> ), when used outside a bracket expression, is an ERE that shall match any character in the supported +character set except NUL.</p> + +<h4><a name="tag_09_04_05"> 9.4.5 </a>ERE Bracket Expression</h4> + +<p>The rules for ERE Bracket Expressions are the same as for Basic Regular Expressions; see <a href="#tag_09_03_05">RE Bracket +Expression</a>.</p> + +<h4><a name="tag_09_04_06"> 9.4.6 </a>EREs Matching Multiple Characters</h4> + +<p>The following rules shall be used to construct EREs matching multiple characters from EREs matching a single character:</p> + +<ol> +<li> +<p>A concatenation of EREs shall match the concatenation of the character sequences matched by each component of the ERE. A +concatenation of EREs enclosed in parentheses shall match whatever the concatenation without the parentheses matches. For example, +both the ERE <tt>"cd"</tt> and the ERE <tt>"(cd)"</tt> are matched by the third and fourth character of the string +<tt>"abcdefabcdef"</tt>.</p> +</li> + +<li> +<p>When an ERE matching a single character or an ERE enclosed in parentheses is followed by the special character plus-sign ( +<tt>'+'</tt> ), together with that plus-sign it shall match what one or more consecutive occurrences of the ERE would match. For +example, the ERE <tt>"b+(bc)"</tt> matches the fourth to seventh characters in the string <tt>"acabbbcde"</tt>. And, +<tt>"[ab]+"</tt> and <tt>"[ab][ab]*"</tt> are equivalent.</p> +</li> + +<li> +<p>When an ERE matching a single character or an ERE enclosed in parentheses is followed by the special character asterisk ( +<tt>'*'</tt> ), together with that asterisk it shall match what zero or more consecutive occurrences of the ERE would match. For +example, the ERE <tt>"b*c"</tt> matches the first character in the string <tt>"cabbbcde"</tt>, and the ERE <tt>"b*cd"</tt> matches +the third to seventh characters in the string <tt>"cabbbcdebbbbbbcdbc"</tt>. And, <tt>"[ab]*"</tt> and <tt>"[ab][ab]"</tt> are +equivalent when matching the string <tt>"ab"</tt>.</p> +</li> + +<li> +<p>When an ERE matching a single character or an ERE enclosed in parentheses is followed by the special character question-mark ( +<tt>'?'</tt> ), together with that question-mark it shall match what zero or one consecutive occurrences of the ERE would match. +For example, the ERE <tt>"b?c"</tt> matches the second character in the string <tt>"acabbbcde"</tt>.</p> +</li> + +<li> +<p>When an ERE matching a single character or an ERE enclosed in parentheses is followed by an interval expression of the format +<tt>"{m}"</tt>, <tt>"{m,}"</tt>, or <tt>"{m,n}"</tt>, together with that interval expression it shall match what repeated +consecutive occurrences of the ERE would match. The values of <i>m</i> and <i>n</i> are decimal integers in the range 0 <= +<i>m</i><= <i>n</i><= {RE_DUP_MAX}, where <i>m</i> specifies the exact or minimum number of occurrences and <i>n</i> +specifies the maximum number of occurrences. The expression <tt>"{m}"</tt> matches exactly <i>m</i> occurrences of the preceding +ERE, <tt>"{m,}"</tt> matches at least <i>m</i> occurrences, and <tt>"{m,n}"</tt> matches any number of occurrences between <i>m</i> +and <i>n</i>, inclusive.</p> + +<p>For example, in the string <tt>"abababccccccd"</tt> the ERE <tt>"c{3}"</tt> is matched by characters seven to nine and the ERE +<tt>"(ab){2,}"</tt> is matched by characters one to six.</p> +</li> +</ol> + +<p>The behavior of multiple adjacent duplication symbols ( <tt>'+'</tt>, <tt>'*'</tt>, <tt>'?'</tt>, and intervals) produces +undefined results.</p> + +<p>An ERE matching a single character repeated by an <tt>'*'</tt>, <tt>'?'</tt>, or an interval expression shall not match a null +expression unless this is the only match for the repetition or it is necessary to satisfy the exact or minimum number of +occurrences for the interval expression.</p> + +<h4><a name="tag_09_04_07"> 9.4.7 </a>ERE Alternation</h4> + +<p>Two EREs separated by the special character vertical-line ( <tt>'|'</tt> ) shall match a string that is matched by either. For +example, the ERE <tt>"a((bc)|d)"</tt> matches the string <tt>"abc"</tt> and the string <tt>"ad"</tt>. Single characters, or +expressions matching single characters, separated by the vertical bar and enclosed in parentheses, shall be treated as an ERE +matching a single character.</p> + +<h4><a name="tag_09_04_08"> 9.4.8 </a>ERE Precedence</h4> + +<p>The order of precedence shall be as shown in the following table:</p> + +<center> +<table border="1" cellpadding="3" align="center"> +<tr valign="top"> +<th colspan="2" align="center"> +<p class="tent"><b>ERE Precedence (from high to low)</b></p> +</th> +</tr> + +<tr valign="top"> +<td align="left"> +<p class="tent">Collation-related bracket symbols</p> +</td> +<td align="left"> +<p class="tent">[==] [::] [..]</p> +</td> +</tr> + +<tr valign="top"> +<td align="left"> +<p class="tent">Escaped characters</p> +</td> +<td align="left"> +<p class="tent">\<special character></p> +</td> +</tr> + +<tr valign="top"> +<td align="left"> +<p class="tent">Bracket expression</p> +</td> +<td align="left"> +<p class="tent">[]</p> +</td> +</tr> + +<tr valign="top"> +<td align="left"> +<p class="tent">Grouping</p> +</td> +<td align="left"> +<p class="tent">()</p> +</td> +</tr> + +<tr valign="top"> +<td align="left"> +<p class="tent">Single-character-ERE duplication</p> +</td> +<td align="left"> +<p class="tent">* + ? {m,n}</p> +</td> +</tr> + +<tr valign="top"> +<td align="left"> +<p class="tent">Concatenation</p> +</td> +<td align="left"> +<p class="tent"> </p> +</td> +</tr> + +<tr valign="top"> +<td align="left"> +<p class="tent">Anchoring</p> +</td> +<td align="left"> +<p class="tent">^ $</p> +</td> +</tr> + +<tr valign="top"> +<td align="left"> +<p class="tent">Alternation</p> +</td> +<td align="left"> +<p class="tent">|</p> +</td> +</tr> +</table> +</center> + +<p>For example, the ERE <tt>"abba|cde"</tt> matches either the string <tt>"abba"</tt> or the string <tt>"cde"</tt> (rather than the +string <tt>"abbade"</tt> or <tt>"abbcde"</tt>, because concatenation has a higher order of precedence than alternation).</p> + +<h4><a name="tag_09_04_09"> 9.4.9 </a>ERE Expression Anchoring</h4> + +<p>An ERE can be limited to matching strings that begin or end a line; this is called "anchoring". The circumflex and dollar sign +special characters shall be considered ERE anchors when used anywhere outside a bracket expression. This shall have the following +effects:</p> + +<ol> +<li> +<p>A circumflex ( <tt>'^'</tt> ) outside a bracket expression shall anchor the expression or subexpression it begins to the +beginning of a string; such an expression or subexpression can match only a sequence starting at the first character of a string. +For example, the EREs <tt>"^ab"</tt> and <tt>"(^ab)"</tt> match <tt>"ab"</tt> in the string <tt>"abcdef"</tt>, but fail to match +in the string <tt>"cdefab"</tt>, and the ERE <tt>"a^b"</tt> is valid, but can never match because the <tt>'a'</tt> prevents the +expression <tt>"^b"</tt> from matching starting at the first character.</p> +</li> + +<li> +<p>A dollar sign ( <tt>'$'</tt> ) outside a bracket expression shall anchor the expression or subexpression it ends to the end of a +string; such an expression or subexpression can match only a sequence ending at the last character of a string. For example, the +EREs <tt>"ef$"</tt> and <tt>"(ef$)"</tt> match <tt>"ef"</tt> in the string <tt>"abcdef"</tt>, but fail to match in the string +<tt>"cdefab"</tt>, and the ERE <tt>"e$f"</tt> is valid, but can never match because the <tt>'f'</tt> prevents the expression +<tt>"e$"</tt> from matching ending at the last character.</p> +</li> +</ol> + +<h3><a name="tag_09_05"> 9.5 </a>Regular Expression Grammar</h3> + +<p>Grammars describing the syntax of both basic and extended regular expressions are presented in this section. The grammar takes +precedence over the text. See the Shell and Utilities volume of IEEE Std 1003.1-2001, <a href= +"../utilities/xcu_chap01.html#tag_01_10">Section 1.10, Grammar Conventions</a>.</p> + +<h4><a name="tag_09_05_01"> 9.5.1 </a>BRE/ERE Grammar Lexical Conventions</h4> + +<p>The lexical conventions for regular expressions are as described in this section.</p> + +<p>Except as noted, the longest possible token or delimiter beginning at a given point is recognized.</p> + +<p>The following tokens are processed (in addition to those string constants shown in the grammar):</p> + +<dl compact> +<dt><b>COLL_ELEM_SINGLE</b></dt> + +<dd> +Any single-character collating element, unless it is a <b>META_CHAR</b>.</dd> + +<dt><b>COLL_ELEM_MULTI</b></dt> + +<dd> +Any multi-character collating element.</dd> + +<dt><b>BACKREF</b></dt> + +<dd>Applicable only to basic regular expressions. The character string consisting of <tt>'\'</tt> followed by a single-digit +numeral, <tt>'1'</tt> to <tt>'9'</tt>.</dd> + +<dt><b>DUP_COUNT</b></dt> + +<dd>Represents a numeric constant. It shall be an integer in the range 0 <= <b>DUP_COUNT</b> <= {RE_DUP_MAX}. This token is +only recognized when the context of the grammar requires it. At all other times, digits not preceded by <tt>'\'</tt> are treated as +<b>ORD_CHAR</b>.</dd> + +<dt><b>META_CHAR</b></dt> + +<dd>One of the characters: + +<dl compact> +<dt><tt>^</tt></dt> + +<dd>When found first in a bracket expression</dd> + +<dt><tt>-</tt></dt> + +<dd>When found anywhere but first (after an initial <tt>'^'</tt>, if any) or last in a bracket expression, or as the ending range +point in a range expression</dd> + +<dt><tt>]</tt></dt> + +<dd>When found anywhere but first (after an initial <tt>'^'</tt>, if any) in a bracket expression</dd> +</dl> +</dd> + +<dt><b>L_ANCHOR</b></dt> + +<dd>Applicable only to basic regular expressions. The character <tt>'^'</tt> when it appears as the first character of a basic +regular expression and when not <b>QUOTED_CHAR</b>. The <tt>'^'</tt> may be recognized as an anchor elsewhere; see <a href= +"#tag_09_03_08">BRE Expression Anchoring</a>.</dd> + +<dt><b>ORD_CHAR</b></dt> + +<dd>A character, other than one of the special characters in <b>SPEC_CHAR</b>.</dd> + +<dt><b>QUOTED_CHAR</b></dt> + +<dd>In a BRE, one of the character sequences: + +<blockquote> +<pre> +<tt>\^ \. \* \[ \$ \\ +</tt> +</pre> +</blockquote> + +<p>In an ERE, one of the character sequences:</p> + +<blockquote> +<pre> +<tt>\^ \. \[ \$ \( \) \| +\* \+ \? \{ \\ +</tt> +</pre> +</blockquote> +</dd> + +<dt><b>R_ANCHOR</b></dt> + +<dd>(Applicable only to basic regular expressions.) The character <tt>'$'</tt> when it appears as the last character of a basic +regular expression and when not <b>QUOTED_CHAR</b>. The <tt>'$'</tt> may be recognized as an anchor elsewhere; see <a href= +"#tag_09_03_08">BRE Expression Anchoring</a>.</dd> + +<dt><b>SPEC_CHAR</b></dt> + +<dd>For basic regular expressions, one of the following special characters: + +<dl compact> +<dt><tt>.</tt></dt> + +<dd>Anywhere outside bracket expressions</dd> + +<dt><tt>\</tt></dt> + +<dd>Anywhere outside bracket expressions</dd> + +<dt><tt>[</tt></dt> + +<dd>Anywhere outside bracket expressions</dd> + +<dt><tt>^</tt></dt> + +<dd>When used as an anchor (see <a href="#tag_09_03_08">BRE Expression Anchoring</a>) or when first in a bracket expression</dd> + +<dt><tt>$</tt></dt> + +<dd>When used as an anchor</dd> + +<dt><tt>*</tt></dt> + +<dd>Anywhere except first in an entire RE, anywhere in a bracket expression, directly following <tt>"\("</tt>, directly following +an anchoring <tt>'^'</tt></dd> +</dl> + +<p>For extended regular expressions, shall be one of the following special characters found anywhere outside bracket +expressions:</p> + +<blockquote> +<pre> +<tt>^ . [ $ ( ) | +* + ? { \ +</tt> +</pre> +</blockquote> + +<p>(The close-parenthesis shall be considered special in this context only if matched with a preceding open-parenthesis.)</p> +</dd> +</dl> + +<h4><a name="tag_09_05_02"> 9.5.2 </a>RE and Bracket Expression Grammar</h4> + +<p>This section presents the grammar for basic regular expressions, including the bracket expression grammar that is common to both +BREs and EREs.</p> + +<pre> +<tt>%token ORD_CHAR QUOTED_CHAR DUP_COUNT +<br> +%token BACKREF L_ANCHOR R_ANCHOR +<br> +%token Back_open_paren Back_close_paren +/* '\(' '\)' */ +<br> +%token Back_open_brace Back_close_brace +/* '\{' '\}' */ +<br> +/* The following tokens are for the Bracket Expression + grammar common to both REs and EREs. */ +<br> +%token COLL_ELEM_SINGLE COLL_ELEM_MULTI META_CHAR +<br> +%token Open_equal Equal_close Open_dot Dot_close Open_colon Colon_close +/* '[=' '=]' '[.' '.]' '[:' ':]' */ +<br> +%token class_name +/* class_name is a keyword to the LC_CTYPE locale category */ +/* (representing a character class) in the current locale */ +/* and is only recognized between [: and :] */ +<br> +%start basic_reg_exp +%% +<br> +/* -------------------------------------------- + Basic Regular Expression + -------------------------------------------- +*/ +basic_reg_exp : RE_expression + | L_ANCHOR + | R_ANCHOR + | L_ANCHOR R_ANCHOR + | L_ANCHOR RE_expression + | RE_expression R_ANCHOR + | L_ANCHOR RE_expression R_ANCHOR + ; +RE_expression : simple_RE + | RE_expression simple_RE + ; +simple_RE : nondupl_RE + | nondupl_RE RE_dupl_symbol + ; +nondupl_RE : one_char_or_coll_elem_RE + | Back_open_paren RE_expression Back_close_paren + | BACKREF + ; +one_char_or_coll_elem_RE : ORD_CHAR + | QUOTED_CHAR + | '.' + | bracket_expression + ; +RE_dupl_symbol : '*' + | Back_open_brace DUP_COUNT Back_close_brace + | Back_open_brace DUP_COUNT ',' Back_close_brace + | Back_open_brace DUP_COUNT ',' DUP_COUNT Back_close_brace + ; +<br> +/* -------------------------------------------- + Bracket Expression + ------------------------------------------- +*/ +bracket_expression : '[' matching_list ']' + | '[' nonmatching_list ']' + ; +matching_list : bracket_list + ; +nonmatching_list : '^' bracket_list + ; +bracket_list : follow_list + | follow_list '-' + ; +follow_list : expression_term + | follow_list expression_term + ; +expression_term : single_expression + | range_expression + ; +single_expression : end_range + | character_class + | equivalence_class + ; +range_expression : start_range end_range + | start_range '-' + ; +start_range : end_range '-' + ; +end_range : COLL_ELEM_SINGLE + | collating_symbol + ; +collating_symbol : Open_dot COLL_ELEM_SINGLE Dot_close + | Open_dot COLL_ELEM_MULTI Dot_close + | Open_dot META_CHAR Dot_close + ; +equivalence_class : Open_equal COLL_ELEM_SINGLE Equal_close + | Open_equal COLL_ELEM_MULTI Equal_close + ; +character_class : Open_colon class_name Colon_close + ; +</tt> +</pre> + +<p>The BRE grammar does not permit <b>L_ANCHOR</b> or <b>R_ANCHOR</b> inside <tt>"\("</tt> and <tt>"\)"</tt> (which implies that +<tt>'^'</tt> and <tt>'$'</tt> are ordinary characters). This reflects the semantic limits on the application, as noted in <a href= +"#tag_09_03_08">BRE Expression Anchoring</a>. Implementations are permitted to extend the language to interpret <tt>'^'</tt> and +<tt>'$'</tt> as anchors in these locations, and as such, conforming applications cannot use unescaped <tt>'^'</tt> and <tt>'$'</tt> +in positions inside <tt>"\("</tt> and <tt>"\)"</tt> that might be interpreted as anchors.</p> + +<h4><a name="tag_09_05_03"> 9.5.3 </a>ERE Grammar</h4> + +<p>This section presents the grammar for extended regular expressions, excluding the bracket expression grammar. <basefont size= +"2"></p> + +<dl> +<dt><b>Note:</b></dt> + +<dd>The bracket expression grammar and the associated <b>%token</b> lines are identical between BREs and EREs. It has been omitted +from the ERE section to avoid unnecessary editorial duplication.</dd> +</dl> + +<basefont size="3"> + +<pre> +<tt>%token ORD_CHAR QUOTED_CHAR DUP_COUNT +%start extended_reg_exp +%% +<br> +/* -------------------------------------------- + Extended Regular Expression + -------------------------------------------- +*/ +extended_reg_exp : ERE_branch + | extended_reg_exp '|' ERE_branch + ; +ERE_branch : ERE_expression + | ERE_branch ERE_expression + ; +ERE_expression : one_char_or_coll_elem_ERE + | '^' + | '$' + | '(' extended_reg_exp ')' + | ERE_expression ERE_dupl_symbol + ; +one_char_or_coll_elem_ERE : ORD_CHAR + | QUOTED_CHAR + | '.' + | bracket_expression + ; +ERE_dupl_symbol : '*' + | '+' + | '?' + | '{' DUP_COUNT '}' + | '{' DUP_COUNT ',' '}' + | '{' DUP_COUNT ',' DUP_COUNT '}' + ; +</tt> +</pre> + +<p>The ERE grammar does not permit several constructs that previous sections specify as having undefined results:</p> + +<ul> +<li> +<p><b>ORD_CHAR</b> preceded by <tt>'\'</tt></p> +</li> + +<li> +<p>One or more <i>ERE_dupl_symbol</i>s appearing first in an ERE, or immediately following <tt>'|'</tt>, <tt>'^'</tt>, or +<tt>'('</tt></p> +</li> + +<li> +<p><tt>'{'</tt> not part of a valid <i>ERE_dupl_symbol</i></p> +</li> + +<li> +<p><tt>'|'</tt> appearing first or last in an ERE, or immediately following <tt>'|'</tt> or <tt>'('</tt>, or immediately preceding +<tt>')'</tt></p> +</li> +</ul> + +<p>Implementations are permitted to extend the language to allow these. Conforming applications cannot use such constructs.</p> + +<hr size="2" noshade> +<center><font size="2"><!--footer start--> +UNIX ® is a registered Trademark of The Open Group.<br> +POSIX ® is a registered Trademark of The IEEE.<br> +[ <a href="../mindex.html">Main Index</a> | <a href="../basedefs/contents.html">XBD</a> | <a href= +"../utilities/contents.html">XCU</a> | <a href="../functions/contents.html">XSH</a> | <a href="../xrat/contents.html">XRAT</a> +]</font></center> + +<!--footer end--> +<hr size="2" noshade> +</body> +</html> + diff --git a/ps01_grep/ps01_bnewbold.txt b/ps01_grep/ps01_bnewbold.txt new file mode 100644 index 0000000..52979e7 --- /dev/null +++ b/ps01_grep/ps01_bnewbold.txt @@ -0,0 +1,345 @@ +;;; 6.945 Problem Set #1 +;;; 02/10/2009 +;;; Bryan Newbold <bnewbold@mit.edu> + + +Problem 1.1: Warmup +--------------- + +(define (r:+ pattern) + (r:repeat 1 #f pattern)) + +(define (r:* pattern) + (r:repeat 0 #f pattern)) + +(pp (r:grep (r:seq (r:quote "dog") + (r:+ (r:quote "cat"))) + "tests.txt")) + +;("[09]. catdogcat" "[11]. dogdogcatdogdog" "[13]. acatdogdogcats") + +(pp (r:grep (r:seq (r:quote "cat") + (r:+ (r:char-from (string->char-set "ogd"))) + (r:quote "cat")) + "tests.txt")) + +;("[09]. catdogcat" "[13]. acatdogdogcats") + +(pp (r:grep (r:seq " " + (r:+ (r:quote "cat")) + (r:eol)) + "tests.txt")) +; #f + +(pp (r:grep (r:* (r:quote "something that isn't there")) + "tests.txt")) +; matches all + +(pp (r:grep (r:repeat 2 3 (r:* (r:char-from (string->char-set "tca")))) + "tests.txt")) +; matches all: nothing repeated several times + +(pp (r:grep (r:seq (r:quote "dog") + (r:+ (r:quote "cat")) + (r:quote "dog")) + "tests.txt")) + +Problem 1.2: The Proposals +--------------- + +a) This would be the same as asking Louis Reasoner to explain why his +suggestion is wrong; the problem lies within r:repeat, so re-calling r:repeat +does not sidestep the problem. + +b) Bonnie's method is computationally more efficient because checking up to the +minimum number of repetitions is only executed once; after that successive +checks are made for additional repetitions. From a straight-forward +perspective, Bonnie's approach also uses significantly fewer characters of +final regex code: this could potentially (but unlikely) become important if +large expressions were searched for with a large range of possible repetitions. +Most importantly, Bonnie's method is more readable and brings the +implementation more inline with the underlying philosophy of regular +expressions (in my opinion). + +c) Ben's solution is the one that an experienced regex-er would think of first +and is the closest conceptual analog to r:repeat in the actual specification. +The code is compact and should be the most readable and explicit, though we'll +see what kind of syntaxual issues crop up... + +d) + +(define (r:repeat min max expr) + (if (not (exact-nonnegative-integer? min)) + (error "Min must be non-negative integer:" min)) + (if max + (begin + (if (not (exact-nonnegative-integer? max)) + (error "Max must be non-negative integer (or null):" max)) + (if (not (<= min max)) + (error "Min not less than max:" min max)))) + (cond + ((not max) (r:seq expr (string-append "\\{" (number->string min) ",\\}"))) + ((= max min) (r:seq expr (string-append "\\{" (number->string min) "\\}"))) + (else (r:seq expr (string-append "\\{" + (number->string min) + "," + (number->string max) + "\\}"))))) +#| tests: +(pp (r:grep (r:repeat 2 3 (r:quote "cat")) + "tests.txt")) +(pp (r:grep (r:repeat 2 #f (r:quote "cat")) + "tests.txt")) +(pp (r:grep (r:repeat 1 1 (r:quote "cat")) + "tests.txt")) +(pp (r:grep (r:repeat 4 3 (r:quote "cat")) + "tests.txt")) +(r:repeat 2 3 (r:quote "cat")) +|# + + +Problem 1.3: Optimization +-------------- +I simply removed the default parentheses around all expressions and implemented +a new r:subexp wrapper which puts parentheses around expressions. Then I had +r:alt and r:repeat wrap their expressions in r:subexp. This results in /some/ +redundant nesting but is a huge reduction. + +(define (r:subexp . exprs) + (string-append "\\(" (apply string-append exprs) "\\)")) + +(define (r:seq . exprs) + (apply string-append exprs)) + +(r:seq " " (r:subexp "sdf") "dddd") + +(define (r:repeat min max expr) + (if (not (exact-nonnegative-integer? min)) + (error "Min must be non-negative integer:" min)) + (if max + (begin + (if (not (exact-nonnegative-integer? max)) + (error "Max must be non-negative integer (or null):" max)) + (if (not (<= min max)) + (error "Min not less than max:" min max)))) + (cond + ((not max) (r:seq (r:subexp expr) + (string-append "\\{" (number->string min) ",\\}"))) + ((= max min) (r:seq (r:subexp expr) + (string-append "\\{" (number->string min) "\\}"))) + (else (r:seq (r:subexp expr) + (string-append "\\{" + (number->string min) + "," + (number->string max) + "\\}"))))) + + +Problem 1.4: Back-references +-------------- +I only implemented the most crude form of back-expressions. A better +implementation would include a method of naming subexpressions and refering +back to them by name: the implementation would do the appropriate counting to +find the correct backreference number for a given named subexpression. + +(define (r:backref n) + (string-append "\\" (number->string n))) + +Problem 1.5: Ugh! +------------- +a) Differences between BREs and EREs (I referenced the table at +http://www.greenend.org.uk/rjk/2002/06/regexp.html): + +* Some characters do not need escaping backslashes: mutiple matching brackets + ('\{' becomes '{'), subexpressions ('\(' becomes '(') +* The alternation operator '|' is added +* The '+' operator is added meaning one or more matches +* The '?' operator is added meaning one or zero matches + +b) The basic method I chose is to have each r: procedure return a list +containing two lambda procedures: the first will return the regular grep +version of the expression, while the second will return the extended grep +version. The r:grep procedure selects the first lambda and r:egrep selects the +second, and this decision cascades down recursively to the atomic expressions. + +My implementation is crude; the r-grep: and r-egrep: procedures should be +internal to the primary r: procedures (using let), and the whole process could +be implemented as a funtion. + +c) (not, uh, fully tested...) + +; the following are only the expressions which had to be rewritten... + +(define (r-grep:seq . exprs) + (let ((choose-grep (lambda (x) + (cond + ((string? x) x) + ((list? x) (car x)))))) + (apply string-append (map choose-grep exprs)))) + +(r-grep:seq "a" "b" '("grep" "egrep")) +; abgrep + +(define (r-egrep:seq . exprs) + (let ((choose-egrep (lambda (x) + (cond + ((string? x) x) + ((list? x) (cadr x)))))) + (apply string-append (map choose-egrep exprs)))) + +#| test: +(r-grep:seq "a" "b" '("grep" "egrep")) +; "abgrep" +(r-egrep:seq "a" "b" '("grep" "egrep")) +; "abegrep" +|# + +(define (r:seq . exprs) + (cons (lambda () (apply r-grep:seq exprs)) + (cons (lambda () (apply r-egrep:seq exprs)) '()))) + +#| test: +(r:seq "a" "b" '("grep" "egrep")) +;Value: (#[compound-procedure 14] #[compound-procedure 15]) +((car (r:seq "a" "b" '("grep" "egrep")))) +; "abgrep" +((cadr (r:seq "a" "b" '("grep" "egrep")))) +; "abegrep" +|# + +(define (r-egrep:alt . exprs) + (let ((choose-egrep (lambda (x) + (cond + ((string? x) x) + ((list? x) ((cadr x))))))) + (if (pair? exprs) + (apply r-egrep:seq + (cons (choose-egrep (car exprs)) + (append-map (lambda (expr) + (list "|" (choose-egrep expr))) + (cdr exprs)))) + (r-egrep:seq)))) + +(define (r-grep:alt . exprs) + (let ((choose-grep (lambda (x) + (cond + ((string? x) x) + ((list? x) ((car x))))))) + (if (pair? exprs) + (apply r-grep:seq + (cons (choose-grep (car exprs)) + (append-map (lambda (expr) + (list "\\|" (choose-grep expr))) + (cdr exprs)))) + (r-grep:seq)))) + +(define (r:alt . exprs) + (cons (lambda () (apply r-grep:alt exprs)) + (cons (lambda () (apply r-egrep:alt exprs)) '()))) + +#| test: +(r-egrep:alt "a" (r:dot) "c") +;Value: "a|.|c" +(r:alt "asdf" (r:dot)) +;Value: (#[compound-procedure 16] #[compound-procedure 17]) +((car (r:alt "asdf" (r:dot)))) +;Value: "asdf\\|." +((cadr (r:alt "asdf" (r:dot)))) +;Value: "asdf|." +|# + +(define (r-grep:repeat min max expr) + (let ((choose-grep (lambda (x) + (cond + ((string? x) x) + ((list? x) ((car x))))))) + (if (not (exact-nonnegative-integer? min)) + (error "Min must be non-negative integer:" min)) + (if max + (begin + (if (not (exact-nonnegative-integer? max)) + (error "Max must be non-negative integer (or null):" max)) + (if (not (<= min max)) + (error "Min not less than max:" min max)))) + (cond + ((not max) (r-grep:seq (r-grep:subexp (choose-grep expr)) + (string-append "\\{" (number->string min) ",\\}"))) + ((= max min) (r-grep:seq (r-grep:subexp (choose-grep expr)) + (string-append "\\{" (number->string min) "\\}"))) + (else (r-grep:seq (r-grep:subexp (choose-grep expr)) + (string-append "\\{" + (number->string min) + "," + (number->string max) + "\\}")))))) + +(define (r-egrep:repeat min max expr) + (let ((choose-egrep (lambda (x) + (cond + ((string? x) x) + ((list? x) ((cadr x))))))) + (if (not (exact-nonnegative-integer? min)) + (error "Min must be non-negative integer:" min)) + (if max + (begin + (if (not (exact-nonnegative-integer? max)) + (error "Max must be non-negative integer (or null):" max)) + (if (not (<= min max)) + (error "Min not less than max:" min max)))) + (cond + ((not max) (r-egrep:seq (r-egrep:subexp (choose-egrep expr)) + (string-append "{" (number->string min) ",}"))) + ((= max min) (r-egrep:seq (r-egrep:subexp (choose-egrep expr)) + (string-append "{" (number->string min) "}"))) + (else (r-egrep:seq (r-egrep:subexp (choose-egrep expr)) + (string-append "{" + (number->string min) + "," + (number->string max) + "}")))))) + +(define (r:repeat min max expr) + (cons (lambda () (r-grep:repeat min max expr)) + (cons (lambda () (r-egrep:repeat min max expr)) '()))) + +(define (r-grep:subexp . exprs) + (string-append "\\(" (apply string-append + (map (lambda (x) + (cond + ((string? x) x) + ((pair? x) ((car x))))) + exprs)) "\\)")) + +(define (r-egrep:subexp . exprs) + (string-append "(" (apply string-append + (map (lambda (x) + (cond + ((string? x) x) + ((pair? x) ((cadr x))))) + exprs)) ")")) + +(define (r:subexp . exprs) + (cons (lambda () (apply r-grep:subexp exprs)) + (cons (lambda () (apply r-egrep:subexp exprs)) '()))) + + +#| test: +(r:subexp "a" "b" (r:dot)) +;Value: (#[compound-procedure 25] #[compound-procedure 26]) +((cadr (r:subexp "a" "b" (r:dot)))) +;Value: "(ab.)" +((car (r:subexp "a" "b" (r:dot)))) +;Value: "\\(ab.\\)" +|# + +(define (r:grep expr filename) + (r:grep-like "grep" '() ((car expr)) filename)) + +(define (r:egrep expr filename) + (if (eq? microcode-id/operating-system 'nt) + (r:grep-like "grep" '("-E") ((cadr expr)) filename) + (r:grep-like "egrep" '() ((cadr expr)) filename))) + +(pp (r:grep (r:seq (r:repeat 2 3 (r:quote "cat")) (r:+ (r:quote "dog"))) + "tests.txt")) + diff --git a/ps01_grep/ps01_work.scm b/ps01_grep/ps01_work.scm new file mode 100644 index 0000000..7338f20 --- /dev/null +++ b/ps01_grep/ps01_work.scm @@ -0,0 +1,331 @@ +;;; 6.945 Problem Set #1 - workfile +;;; 02/10/2009 +;;; Bryan Newbold <bnewbold@mit.edu> + +(load "regexp.scm") + + +(pp (r:grep (r:seq " " + (r:repeat 3 5 (r:alt (r:quote "cat") (r:quote "dog"))) + (r:eol)) + "tests.txt")) + +;------------------------------------------------------------------- +; 1.1: define r:* and r:+ + +(define (r:+ pattern) + (r:repeat 1 #f pattern)) + +(define (r:* pattern) + (r:repeat 0 #f pattern)) + +(pp (r:grep (r:seq (r:quote "dog") + (r:+ (r:quote "cat"))) + "tests.txt")) + +;("[09]. catdogcat" "[11]. dogdogcatdogdog" "[13]. acatdogdogcats") + +(pp (r:grep (r:seq (r:quote "cat") + (r:+ (r:char-from (string->char-set "ogd"))) + (r:quote "cat")) + "tests.txt")) + +;("[09]. catdogcat" "[13]. acatdogdogcats") + +(pp (r:grep (r:seq " " + (r:+ (r:quote "cat")) + (r:eol)) + "tests.txt")) +; #f + +(pp (r:grep (r:* (r:quote "something that isn't there")) + "tests.txt")) +; matches all + +(pp (r:grep (r:repeat 2 3 (r:* (r:char-from (string->char-set "tca")))) + "tests.txt")) +; matches all: nothing repeated several times + +(pp (r:grep (r:seq (r:quote "dog") + (r:+ (r:quote "cat")) + (r:quote "dog")) + "tests.txt")) + +;("[11]. dogdogcatdogdog") + +(pp (r:grep (r:seq (r:quote "dog") + (r:* (r:quote "cat")) + (r:quote "dog")) + "tests.txt")) +#| +("[10]. catcatdogdog" "[11]. dogdogcatdogdog" + "[12]. catcatcatdogdogdog" + "[13]. acatdogdogcats" + "[14]. ifacatdogdogs" + "[15]. acatdogdogsme") +|# + +;------------------------------------------------------------------------- +; 1.2 + +(define (r:repeat min max expr) + (if (not (exact-nonnegative-integer? min)) + (error "Min must be non-negative integer:" min)) + (if max + (begin + (if (not (exact-nonnegative-integer? max)) + (error "Max must be non-negative integer (or null):" max)) + (if (not (<= min max)) + (error "Min not less than max:" min max)))) + (cond + ((not max) (r:seq expr (string-append "\\{" (number->string min) ",\\}"))) + ((= max min) (r:seq expr (string-append "\\{" (number->string min) "\\}"))) + (else (r:seq expr (string-append "\\{" + (number->string min) + "," + (number->string max) + "\\}"))))) +#| tests: +(pp (r:grep (r:repeat 2 3 (r:quote "cat")) + "tests.txt")) +(pp (r:grep (r:repeat 2 #f (r:quote "cat")) + "tests.txt")) +(pp (r:grep (r:repeat 1 1 (r:quote "cat")) + "tests.txt")) +(pp (r:grep (r:repeat 4 3 (r:quote "cat")) + "tests.txt")) +(r:repeat 2 3 (r:quote "cat")) +|# + +;--------------------------------------------------------------------------- +; 1.3 + +(define (r:subexp . exprs) + (string-append "\\(" (apply string-append exprs) "\\)")) + +(define (r:seq . exprs) + (apply string-append exprs)) + +(r:seq " " (r:subexp "sdf") "dddd") + +(define (r:repeat min max expr) + (if (not (exact-nonnegative-integer? min)) + (error "Min must be non-negative integer:" min)) + (if max + (begin + (if (not (exact-nonnegative-integer? max)) + (error "Max must be non-negative integer (or null):" max)) + (if (not (<= min max)) + (error "Min not less than max:" min max)))) + (cond + ((not max) (r:seq (r:subexp expr) + (string-append "\\{" (number->string min) ",\\}"))) + ((= max min) (r:seq (r:subexp expr) + (string-append "\\{" (number->string min) "\\}"))) + (else (r:seq (r:subexp expr) + (string-append "\\{" + (number->string min) + "," + (number->string max) + "\\}"))))) + +(define (r:alt . exprs) + (if (pair? exprs) + (r:subexp (apply r:seq + (cons (car exprs) + (append-map (lambda (expr) + (list "\\|" expr)) + (cdr exprs))))) + (r:seq))) + +(r:alt "first thing" "second thing") + +;---------------------------------------------------------------------------- +; 1.4 + +(define (r:backref n) + (string-append "\\" (number->string n))) + +;---------------------------------------------------------------------------- +; 1.5 + +; the following are only the expressions which had to be rewritten... + +(define (r-grep:seq . exprs) + (let ((choose-grep (lambda (x) + (cond + ((string? x) x) + ((list? x) (car x)))))) + (apply string-append (map choose-grep exprs)))) + +(r-grep:seq "a" "b" '("grep" "egrep")) +; abgrep + +(define (r-egrep:seq . exprs) + (let ((choose-egrep (lambda (x) + (cond + ((string? x) x) + ((list? x) (cadr x)))))) + (apply string-append (map choose-egrep exprs)))) + +#| test: +(r-grep:seq "a" "b" '("grep" "egrep")) +; "abgrep" +(r-egrep:seq "a" "b" '("grep" "egrep")) +; "abegrep" +|# + +(define (r:seq . exprs) + (cons (lambda () (apply r-grep:seq exprs)) + (cons (lambda () (apply r-egrep:seq exprs)) '()))) + +#| test: +(r:seq "a" "b" '("grep" "egrep")) +;Value: (#[compound-procedure 14] #[compound-procedure 15]) +((car (r:seq "a" "b" '("grep" "egrep")))) +; "abgrep" +((cadr (r:seq "a" "b" '("grep" "egrep")))) +; "abegrep" +|# + +(define (r-egrep:alt . exprs) + (let ((choose-egrep (lambda (x) + (cond + ((string? x) x) + ((list? x) ((cadr x))))))) + (if (pair? exprs) + (apply r-egrep:seq + (cons (choose-egrep (car exprs)) + (append-map (lambda (expr) + (list "|" (choose-egrep expr))) + (cdr exprs)))) + (r-egrep:seq)))) + +(define (r-grep:alt . exprs) + (let ((choose-grep (lambda (x) + (cond + ((string? x) x) + ((list? x) ((car x))))))) + (if (pair? exprs) + (apply r-grep:seq + (cons (choose-grep (car exprs)) + (append-map (lambda (expr) + (list "\\|" (choose-grep expr))) + (cdr exprs)))) + (r-grep:seq)))) + +(define (r:alt . exprs) + (cons (lambda () (apply r-grep:alt exprs)) + (cons (lambda () (apply r-egrep:alt exprs)) '()))) + +#| test: +(r-egrep:alt "a" (r:dot) "c") +;Value: "a|.|c" +(r:alt "asdf" (r:dot)) +;Value: (#[compound-procedure 16] #[compound-procedure 17]) +((car (r:alt "asdf" (r:dot)))) +;Value: "asdf\\|." +((cadr (r:alt "asdf" (r:dot)))) +;Value: "asdf|." +|# + +(define (r-grep:repeat min max expr) + (let ((choose-grep (lambda (x) + (cond + ((string? x) x) + ((list? x) ((car x))))))) + (if (not (exact-nonnegative-integer? min)) + (error "Min must be non-negative integer:" min)) + (if max + (begin + (if (not (exact-nonnegative-integer? max)) + (error "Max must be non-negative integer (or null):" max)) + (if (not (<= min max)) + (error "Min not less than max:" min max)))) + (cond + ((not max) (r-grep:seq (r-grep:subexp (choose-grep expr)) + (string-append "\\{" (number->string min) ",\\}"))) + ((= max min) (r-grep:seq (r-grep:subexp (choose-grep expr)) + (string-append "\\{" (number->string min) "\\}"))) + (else (r-grep:seq (r-grep:subexp (choose-grep expr)) + (string-append "\\{" + (number->string min) + "," + (number->string max) + "\\}")))))) + + +(define (r-egrep:repeat min max expr) + (let ((choose-egrep (lambda (x) + (cond + ((string? x) x) + ((list? x) ((cadr x))))))) + (if (not (exact-nonnegative-integer? min)) + (error "Min must be non-negative integer:" min)) + (if max + (begin + (if (not (exact-nonnegative-integer? max)) + (error "Max must be non-negative integer (or null):" max)) + (if (not (<= min max)) + (error "Min not less than max:" min max)))) + (cond + ((not max) (r-egrep:seq (r-egrep:subexp (choose-egrep expr)) + (string-append "{" (number->string min) ",}"))) + ((= max min) (r-egrep:seq (r-egrep:subexp (choose-egrep expr)) + (string-append "{" (number->string min) "}"))) + (else (r-egrep:seq (r-egrep:subexp (choose-egrep expr)) + (string-append "{" + (number->string min) + "," + (number->string max) + "}")))))) + +(define (r:repeat min max expr) + (cons (lambda () (r-grep:repeat min max expr)) + (cons (lambda () (r-egrep:repeat min max expr)) '()))) + +(define (r-grep:subexp . exprs) + (string-append "\\(" (apply string-append + (map (lambda (x) + (cond + ((string? x) x) + ((pair? x) ((car x))))) + exprs)) "\\)")) + +(define (r-egrep:subexp . exprs) + (string-append "(" (apply string-append + (map (lambda (x) + (cond + ((string? x) x) + ((pair? x) ((cadr x))))) + exprs)) ")")) + +(define (r:subexp . exprs) + (cons (lambda () (apply r-grep:subexp exprs)) + (cons (lambda () (apply r-egrep:subexp exprs)) '()))) + +#| test: +(r:subexp "a" "b" (r:dot)) +;Value: (#[compound-procedure 25] #[compound-procedure 26]) +((cadr (r:subexp "a" "b" (r:dot)))) +;Value: "(ab.)" +((car (r:subexp "a" "b" (r:dot)))) +;Value: "\\(ab.\\)" +|# + +(define (r:grep expr filename) + (r:grep-like "grep" '() ((car expr)) filename)) + +(define (r:egrep expr filename) + (if (eq? microcode-id/operating-system 'nt) + (r:grep-like "grep" '("-E") ((cadr expr)) filename) + (r:grep-like "egrep" '() ((cadr expr)) filename))) + +(pp (r:grep (r:seq (r:repeat 2 3 (r:quote "cat")) (r:+ (r:quote "dog"))) + "tests.txt")) + + + + + + |