Regular Expressions

From 2b7db5b23df55e3a1ac0639494bea750d0797c9d Mon Sep 17 00:00:00 2001 From: bnewbold Date: Wed, 11 Feb 2009 14:24:16 -0500 Subject: first pset work --- from_athena.sh | 2 +- ps01_grep/posix_standard.html | 1109 +++++++++++++++++++++++++++++++++++++++++ ps01_grep/ps01_bnewbold.txt | 345 +++++++++++++ ps01_grep/ps01_work.scm | 331 ++++++++++++ 4 files changed, 1786 insertions(+), 1 deletion(-) create mode 100644 ps01_grep/posix_standard.html create mode 100644 ps01_grep/ps01_bnewbold.txt create mode 100644 ps01_grep/ps01_work.scm diff --git a/from_athena.sh b/from_athena.sh index 213503d..776987f 100755 --- a/from_athena.sh +++ b/from_athena.sh @@ -1,3 +1,3 @@ #!/bin/sh -rsync -arv bnewbold@linux.mit.edu:4.602/ps* . +rsync -arv bnewbold@linux.mit.edu:6.945/ps* . diff --git a/ps01_grep/posix_standard.html b/ps01_grep/posix_standard.html new file mode 100644 index 0000000..6c1aeb2 --- /dev/null +++ b/ps01_grep/posix_standard.html @@ -0,0 +1,1109 @@ + + + + + + + + +Regular Expressions + + + + +The Open Group Base Specifications Issue 6
+IEEE Std 1003.1, 2004 Edition
+Copyright © 2001-2004 The IEEE and The Open Group, All Rights reserved. + + +

9. Regular Expressions

+ +

Regular Expressions (REs) provide a mechanism to select specific strings from a set of character strings.

+ +

Regular expressions are a context-independent syntax that can represent a wide variety of character sets and character set +orderings, where these character sets are interpreted according to the current locale. While many regular expressions can be +interpreted differently depending on the current locale, many features, such as character class expressions, provide for contextual +invariance across locales.

+ +

The Basic Regular Expression (BRE) notation and construction rules in Basic Regular Expressions shall +apply to most utilities supporting regular expressions. Some utilities, instead, support the Extended Regular Expressions (ERE) +described in Extended Regular Expressions; any exceptions for both cases are noted in the descriptions of +the specific utilities using regular expressions. Both BREs and EREs are supported by the Regular Expression Matching interface in +the System Interfaces volume of IEEE Std 1003.1-2001 under regcomp(), regexec(), and related functions.

+ +

9.1 Regular Expression Definitions

+ +

For the purposes of this section, the following definitions shall apply:

+ +

entire regular expression

+ +

The concatenated set of one or more BREs or EREs that make up the pattern specified for string selection.

+ +

matched

+ +

A sequence of zero or more characters shall be said to be matched by a BRE or ERE when the characters in the sequence correspond +to a sequence of characters defined by the pattern.

+ +

Matching shall be based on the bit pattern used for encoding the character, not on the graphic representation of the character. +This means that if a character set contains two or more encodings for a graphic symbol, or if the strings searched contain text +encoded in more than one codeset, no attempt is made to search for any other representation of the encoded symbol. If that is +required, the user can specify equivalence classes containing all variations of the desired graphic symbol.

+ +

The search for a matching sequence starts at the beginning of a string and stops when the first sequence matching the expression +is found, where "first" is defined to mean "begins earliest in the string". If the pattern permits a variable number of +matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched. For +example, the BRE "bb*" matches the second to fourth characters of the string "abbbc", and the ERE +"(wee|week)(knights|night)" matches all ten characters of the string "weeknights".

+ +

Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, shall match the +longest possible string. For this purpose, a null string shall be considered to be longer than no match at all. For example, +matching the BRE "$.*$.*" against "abcdef", the subexpression "(\1)" is "abcdef", and +matching the BRE "$a*$*" against "bc", the subexpression "(\1)" is the null string.

+ +

When a multi-character collating element in a bracket expression (see RE Bracket Expression) is +involved, the longest sequence shall be measured in characters consumed from the string to be matched; that is, the collating +element counts not as one element, but as the number of characters it matches.

+ +

BRE (ERE) matching a single character

+ +

A BRE or ERE that shall match either a single character or a single collating element.

+ +

Only a BRE or ERE of this type that includes a bracket expression (see RE Bracket Expression) can +match a collating element.

+ +

BRE (ERE) matching multiple characters

+ +

A BRE or ERE that shall match a concatenation of single characters or collating elements.

+ +

Such a BRE or ERE is made up from a BRE (ERE) matching a single character and BRE (ERE) special characters.

+ +

invalid

+ +

This section uses the term "invalid" for certain constructs or conditions. Invalid REs shall cause the utility or function +using the RE to generate an error condition. When invalid is not used, violations of the specified syntax or semantics for REs +produce undefined results: this may entail an error, enabling an extended syntax for that RE, or using the construct in error as +literal characters to be matched. For example, the BRE construct "\{1,2,3\}" does not comply with the grammar. A +conforming application cannot rely on it producing an error nor matching the literal characters "\{1,2,3\}".

+ +

9.2 Regular Expression General Requirements

+ +

The requirements in this section shall apply to both basic and extended regular expressions.

+ +

The use of regular expressions is generally associated with text processing. REs (BREs and EREs) operate on text strings; that +is, zero or more characters followed by an end-of-string delimiter (typically NUL). Some utilities employing regular expressions +limit the processing to lines; that is, zero or more characters followed by a <newline>. In the regular expression processing +described in IEEE Std 1003.1-2001, the <newline> is regarded as an ordinary character and both a period and a +non-matching list can match one. The Shell and Utilities volume of IEEE Std 1003.1-2001 specifies within the individual +descriptions of those standard utilities employing regular expressions whether they permit matching of <newline>s; if not +stated otherwise, the use of literal <newline>s or any escape sequence equivalent produces undefined results. Those utilities +(like grep) that do not allow <newline>s to match are responsible for eliminating +any <newline> from strings before matching against the RE. The regcomp() +function in the System Interfaces volume of IEEE Std 1003.1-2001, however, can provide support for such processing +without violating the rules of this section.

+ +

The interfaces specified in IEEE Std 1003.1-2001 do not permit the inclusion of a NUL character in an RE or in the +string to be matched. If during the operation of a standard utility a NUL is included in the text designated to be matched, that +NUL may designate the end of the text string for the purposes of matching.

+ +

When a standard utility or function that uses regular expressions specifies that pattern matching shall be performed without +regard to the case (uppercase or lowercase) of either data or patterns, then when each character in the string is matched against +the pattern, not only the character, but also its case counterpart (if any), shall be matched. This definition of case-insensitive +processing is intended to allow matching of multi-character collating elements as well as characters, as each character in the +string is matched using both its cases. For example, in a locale where "Ch" is a multi-character collating element and +where a matching list expression matches such elements, the RE "[[.Ch.]]" when matched against the string "char" +is in reality matched against "ch", "Ch", "cH", and "CH".

+ +

The implementation shall support any regular expression that does not exceed 256 bytes in length.

+ +

9.3 Basic Regular Expressions

+ +

9.3.1 BREs Matching a Single Character or Collating Element

+ +

A BRE ordinary character, a special character preceded by a backslash, or a period shall match a single character. A bracket +expression shall match a single character or a single collating element.

+ +

9.3.2 BRE Ordinary Characters

+ +

An ordinary character is a BRE that matches itself: any character in the supported character set, except for the BRE special +characters listed in BRE Special Characters.

+ +

The interpretation of an ordinary character preceded by a backslash ( '\' ) is undefined, except for:

+ +

+
The characters ')', '(', '{', and '}'
+
+
The digits 1 to 9 inclusive (see BREs Matching Multiple Characters)
+
+
A character inside a bracket expression
+

+ +

9.3.3 BRE Special Characters

+ +

A BRE special character has special properties in certain contexts. Outside those contexts, or when preceded by a backslash, +such a character is a BRE that matches the special character itself. The BRE special characters and the contexts in which they have +their special meaning are as follows:

+ +

.[\

The period, left-bracket, and backslash shall be special except when used in a bracket expression (see RE Bracket Expression). An expression containing a '[' that is not preceded by a backslash and is not +part of a bracket expression produces undefined results.

*

The asterisk shall be special except when used: + +

+
In a bracket expression
+
+
As the first character of an entire BRE (after an initial '^', if any)
+
+
As the first character of a subexpression (after an initial '^', if any); see BREs Matching +Multiple Characters
+

^

The circumflex shall be special when used as: + +

+
An anchor (see BRE Expression Anchoring)
+
+
The first character of a bracket expression (see RE Bracket Expression)
+

$

The dollar sign shall be special when used as an anchor.

+ +

9.3.4 Periods in BREs

+ +

A period ( '.' ), when used outside a bracket expression, is a BRE that shall match any character in the supported +character set except NUL.

+ +

9.3.5 RE Bracket Expression

+ +

A bracket expression (an expression enclosed in square brackets, "[]" ) is an RE that shall match a single collating +element contained in the non-empty set of collating elements represented by the bracket expression.

+ +

The following rules and definitions apply to bracket expressions:

+ +

+
A bracket expression is either a matching list expression or a non-matching list expression. It consists of one or more +expressions: collating elements, collating symbols, equivalence classes, character classes, or range expressions. The right-bracket +( ']' ) shall lose its special meaning and represent itself in a bracket expression if it occurs first in the list (after +an initial circumflex ( '^' ), if any). Otherwise, it shall terminate the bracket expression, unless it appears in a +collating symbol (such as "[.].]" ) or is the ending right-bracket for a collating symbol, equivalence class, or character +class. The special characters '.', '*', '[', and '\' (period, asterisk, left-bracket, and +backslash, respectively) shall lose their special meaning within a bracket expression.
+ +
The character sequences "[.", "[=", and "[:" (left-bracket followed by a period, equals-sign, or +colon) shall be special inside a bracket expression and are used to delimit collating symbols, equivalence class expressions, and +character class expressions. These symbols shall be followed by a valid expression and the matching terminating sequence +".]", "=]", or ":]", as described in the following items.
+
+
A matching list expression specifies a list that shall match any single-character collating element in any of the expressions +represented in the list. The first character in the list shall not be the circumflex; for example, "[abc]" is an RE that +matches any of the characters 'a', 'b', or 'c'. It is unspecified whether a matching list expression +matches a multi-character collating element that is matched by one of the expressions.
+
+
A non-matching list expression begins with a circumflex ( '^' ), and specifies a list that shall match any +single-character collating element except for the expressions represented in the list after the leading circumflex. For example, +"[^abc]" is an RE that matches any character except the characters 'a', 'b', or 'c'. It is +unspecified whether a non-matching list expression matches a multi-character collating element that is not matched by any of the +expressions. The circumflex shall have this special meaning only when it occurs first in the list, immediately following the +left-bracket.
+
+
A collating symbol is a collating element enclosed within bracket-period ( "[." and ".]" ) delimiters. +Collating elements are defined as described in Collation Order. Conforming +applications shall represent multi-character collating elements as collating symbols when it is necessary to distinguish them from +a list of the individual characters that make up the multi-character collating element. For example, if the string "ch" is +a collating element defined using the line:
+ +
+
```
+collating-element <ch-digraph> from "<c><h>"
+
+
```
+
+ +
in the locale definition, the expression "[[.ch.]]" shall be treated as an RE containing the collating symbol +'ch', while "[ch]" shall be treated as an RE matching 'c' or 'h'. Collating symbols are +recognized only inside bracket expressions. If the string is not a collating element in the current locale, the expression is +invalid.
+
+
An equivalence class expression shall represent the set of collating elements belonging to an equivalence class, as described in +Collation Order. Only primary equivalence classes shall be recognized. The +class shall be expressed by enclosing any one of the collating elements in the equivalence class within bracket-equal ( +"[=" and "=]" ) delimiters. For example, if 'a', 'à', and 'â' belong to +the same equivalence class, then "[[=a=]b]", "[[=à=]b]", and "[[=â=]b]" are each +equivalent to "[aàâb]". If the collating element does not belong to an equivalence class, the equivalence +class expression shall be treated as a collating symbol.
+
+
A character class expression shall represent the union of two sets:
+ +
1. +
  The set of single-character collating elements whose characters belong to the character class, as defined in the LC_CTYPE +category in the current locale.
  +
2. +
  An unspecified set of multi-character collating elements.
  +
+ +
All character classes specified in the current locale shall be recognized. A character class expression is expressed as a +character class name enclosed within bracket-colon ( "[:" and ":]" ) delimiters.
+ +
The following character class expressions shall be supported in all locales:
+ +
+
```
+[:alnum:]   [:cntrl:]   [:lower:]   [:space:]
+[:alpha:]   [:digit:]   [:print:]   [:upper:]
+[:blank:]   [:graph:]   [:punct:]   [:xdigit:]
+
+
```
+
+ +
In addition, character class expressions of the form:
+ +
+
```
+[:name:]
+
+
```
+
+ +
are recognized in those locales where the name keyword has been given a charclass definition in the +LC_CTYPE category.
+
+
In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation +sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely +on whether the range expression is valid, or on the set of collating elements matched. A range expression shall be expressed as the +starting point and the ending point separated by a hyphen ( '-' ).
+ +
In the following, all examples assume the POSIX locale.
+ +
The starting range point and the ending range point shall be a collating element or collating symbol. An equivalence class +expression used as a starting or ending point of a range expression produces unspecified results. An equivalence class can be used +portably within a bracket expression, but only outside the range. If the represented set of collating elements is empty, it is +unspecified whether the expression matches nothing, or is treated as invalid.
+ +
The interpretation of range expressions where the ending range point is also the starting range point of a subsequent range +expression (for example, "[a-m-o]" ) is undefined.
+ +
The hyphen character shall be treated as itself if it occurs first (after an initial '^', if any) or last in the list, +or as an ending range point in a range expression. As examples, the expressions "[-ac]" and "[ac-]" are +equivalent and match any of the characters 'a', 'c', or '-' ; "[^-ac]" and "[^ac-]" +are equivalent and match any characters except 'a', 'c', or '-' ; the expression "[%--]" +matches any of the characters between '%' and '-' inclusive; the expression "[--@]" matches any of the +characters between '-' and '@' inclusive; and the expression "[a--@]" is either invalid or equivalent to +'@', because the letter 'a' follows the symbol '-' in the POSIX locale. To use a hyphen as the starting +range point, it shall either come first in the bracket expression or be specified as a collating symbol; for example, +"[][.-.]-0]", which matches either a right bracket or any character or collating element that collates between hyphen and +0, inclusive.
+ +
If a bracket expression specifies both '-' and ']', the ']' shall be placed first (after the +'^', if any) and the '-' last within the bracket expression.
+

+ +

9.3.6 BREs Matching Multiple Characters

+ +

The following rules can be used to construct BREs matching multiple characters from BREs matching a single character:

+ +

+
The concatenation of BREs shall match the concatenation of the strings matched by each component of the BRE.
+
+
A subexpression can be defined within a BRE by enclosing it between the character pairs "$" and "$". Such a +subexpression shall match whatever it would have matched without the "$" and "$", except that anchoring within +subexpressions is optional behavior; see BRE Expression Anchoring. Subexpressions can be arbitrarily +nested.
+
+
The back-reference expression '\n' shall match the same (possibly empty) string of characters as was matched by a +subexpression enclosed between "$" and "$" preceding the '\n'. The character 'n' shall be a +digit from 1 through 9, specifying the nth subexpression (the one that begins with the nth "$" from the +beginning of the pattern and ends with the corresponding paired "$" ). The expression is invalid if less than n +subexpressions precede the '\n'. For example, the expression "$.*$\1$" matches a line consisting of two +adjacent appearances of the same string, and the expression "$a$*\1" fails to match 'a'. When the referenced +subexpression matched more than one string, the back-referenced expression shall refer to the last matched string. If the +subexpression referenced by the back-reference matches more than one string because of an asterisk ( '*' ) or an interval +expression (see item (5)), the back-reference shall match the last (rightmost) of these strings.
+
+
When a BRE matching a single character, a subexpression, or a back-reference is followed by the special character asterisk ( +'*' ), together with that asterisk it shall match what zero or more consecutive occurrences of the BRE would match. For +example, "[ab]*" and "[ab][ab]" are equivalent when matching the string "ab".
+
+
When a BRE matching a single character, a subexpression, or a back-reference is followed by an interval expression of the format +"\{m\}", "\{m,\}", or "\{m,n\}", together with that interval expression it shall match what repeated +consecutive occurrences of the BRE would match. The values of m and n are decimal integers in the range 0 <= +m<= n<= {RE_DUP_MAX}, where m specifies the exact or minimum number of occurrences and n +specifies the maximum number of occurrences. The expression "\{m\}" shall match exactly m occurrences of the +preceding BRE, "\{m,\}" shall match at least m occurrences, and "\{m,n\}" shall match any number of +occurrences between m and n, inclusive.
+ +
For example, in the string "abababccccccd" the BRE "c\{3\}" is matched by characters seven to nine, the BRE +"$ab$\{4,\}" is not matched at all, and the BRE "c\{1,3\}d" is matched by characters ten to thirteen.
+

+ +

The behavior of multiple adjacent duplication symbols ( '*' and intervals) produces undefined results.

+ +

A subexpression repeated by an asterisk ( '*' ) or an interval expression shall not match a null expression unless this +is the only match for the repetition or it is necessary to satisfy the exact or minimum number of occurrences for the interval +expression.

+ +

9.3.7 BRE Precedence

+ +

The order of precedence shall be as shown in the following table:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

+ BRE Precedence (from high to low) +
+ Collation-related bracket symbols +	+ [==] [::] [..] +
+ Escaped characters +	+ \<special character> +
+ Bracket expression +	+ [] +
+ Subexpressions/back-references +	+ \n +
+ Single-character-BRE duplication +	+ * \{m,n\} +
+ Concatenation +	+ +
+ Anchoring +	+ ^ $ +

+ + +

9.3.8 BRE Expression Anchoring

+ +

A BRE can be limited to matching strings that begin or end a line; this is called "anchoring". The circumflex and dollar sign +special characters shall be considered BRE anchors in the following contexts:

+ +

+
A circumflex ( '^' ) shall be an anchor when used as the first character of an entire BRE. The implementation may treat +the circumflex as an anchor when used as the first character of a subexpression. The circumflex shall anchor the expression (or +optionally subexpression) to the beginning of a string; only sequences starting at the first character of a string shall be matched +by the BRE. For example, the BRE "^ab" matches "ab" in the string "abcdef", but fails to match in the +string "cdefab". The BRE "$^ab$" may match the former string. A portable BRE shall escape a leading circumflex +in a subexpression to match a literal circumflex.
+
+
A dollar sign ( '$' ) shall be an anchor when used as the last character of an entire BRE. The implementation may treat +a dollar sign as an anchor when used as the last character of a subexpression. The dollar sign shall anchor the expression (or +optionally subexpression) to the end of the string being matched; the dollar sign can be said to match the end-of-string following +the last character.
+
+
A BRE anchored by both '^' and '$' shall match only an entire string. For example, the BRE "^abcdef$" +matches strings consisting only of "abcdef".
+

+ +

9.4 Extended Regular Expressions

+ +

The extended regular expression (ERE) notation and construction rules shall apply to utilities defined as using extended regular +expressions; any exceptions to the following rules are noted in the descriptions of the specific utilities using EREs.

+ +

9.4.1 EREs Matching a Single Character or Collating Element

+ +

An ERE ordinary character, a special character preceded by a backslash, or a period shall match a single character. A bracket +expression shall match a single character or a single collating element. An ERE matching a single character enclosed in parentheses +shall match the same as the ERE without parentheses would have matched.

+ +

9.4.2 ERE Ordinary Characters

+ +

An ordinary character is an ERE that matches itself. An ordinary character is any character in the supported character set, +except for the ERE special characters listed in ERE Special Characters. The interpretation of an +ordinary character preceded by a backslash ( '\' ) is undefined.

+ +

9.4.3 ERE Special Characters

+ +

An ERE special character has special properties in certain contexts. Outside those contexts, or when preceded by a backslash, +such a character shall be an ERE that matches the special character itself. The extended regular expression special characters and +the contexts in which they shall have their special meaning are as follows:

+ +

.[\(

The period, left-bracket, backslash, and left-parenthesis shall be special except when used in a bracket expression (see RE Bracket Expression). Outside a bracket expression, a left-parenthesis immediately followed by a +right-parenthesis produces undefined results.

)

The right-parenthesis shall be special when matched with a preceding left-parenthesis, both outside a bracket expression.

*+?{

The asterisk, plus-sign, question-mark, and left-brace shall be special except when used in a bracket expression (see RE Bracket Expression). Any of the following uses produce undefined results: + +

+
If these characters appear first in an ERE, or immediately following a vertical-line, circumflex, or left-parenthesis
+
+
If a left-brace is not part of a valid interval expression (see EREs Matching Multiple Characters +)
+

|

The vertical-line is special except when used in a bracket expression (see RE Bracket Expression). +A vertical-line appearing first or last in an ERE, or immediately following a vertical-line or a left-parenthesis, or immediately +preceding a right-parenthesis, produces undefined results.

^

The circumflex shall be special when used as: + +

+
An anchor (see ERE Expression Anchoring)
+
+
The first character of a bracket expression (see RE Bracket Expression)
+

$

The dollar sign shall be special when used as an anchor.

+ +

9.4.4 Periods in EREs

+ +

A period ( '.' ), when used outside a bracket expression, is an ERE that shall match any character in the supported +character set except NUL.

+ +

9.4.5 ERE Bracket Expression

+ +

The rules for ERE Bracket Expressions are the same as for Basic Regular Expressions; see RE Bracket +Expression.

+ +

9.4.6 EREs Matching Multiple Characters

+ +

The following rules shall be used to construct EREs matching multiple characters from EREs matching a single character:

+ +

+
A concatenation of EREs shall match the concatenation of the character sequences matched by each component of the ERE. A +concatenation of EREs enclosed in parentheses shall match whatever the concatenation without the parentheses matches. For example, +both the ERE "cd" and the ERE "(cd)" are matched by the third and fourth character of the string +"abcdefabcdef".
+
+
When an ERE matching a single character or an ERE enclosed in parentheses is followed by the special character plus-sign ( +'+' ), together with that plus-sign it shall match what one or more consecutive occurrences of the ERE would match. For +example, the ERE "b+(bc)" matches the fourth to seventh characters in the string "acabbbcde". And, +"[ab]+" and "[ab][ab]*" are equivalent.
+
+
When an ERE matching a single character or an ERE enclosed in parentheses is followed by the special character asterisk ( +'*' ), together with that asterisk it shall match what zero or more consecutive occurrences of the ERE would match. For +example, the ERE "b*c" matches the first character in the string "cabbbcde", and the ERE "b*cd" matches +the third to seventh characters in the string "cabbbcdebbbbbbcdbc". And, "[ab]*" and "[ab][ab]" are +equivalent when matching the string "ab".
+
+
When an ERE matching a single character or an ERE enclosed in parentheses is followed by the special character question-mark ( +'?' ), together with that question-mark it shall match what zero or one consecutive occurrences of the ERE would match. +For example, the ERE "b?c" matches the second character in the string "acabbbcde".
+
+
When an ERE matching a single character or an ERE enclosed in parentheses is followed by an interval expression of the format +"{m}", "{m,}", or "{m,n}", together with that interval expression it shall match what repeated +consecutive occurrences of the ERE would match. The values of m and n are decimal integers in the range 0 <= +m<= n<= {RE_DUP_MAX}, where m specifies the exact or minimum number of occurrences and n +specifies the maximum number of occurrences. The expression "{m}" matches exactly m occurrences of the preceding +ERE, "{m,}" matches at least m occurrences, and "{m,n}" matches any number of occurrences between m +and n, inclusive.
+ +
For example, in the string "abababccccccd" the ERE "c{3}" is matched by characters seven to nine and the ERE +"(ab){2,}" is matched by characters one to six.
+

+ +

The behavior of multiple adjacent duplication symbols ( '+', '*', '?', and intervals) produces +undefined results.

+ +

An ERE matching a single character repeated by an '*', '?', or an interval expression shall not match a null +expression unless this is the only match for the repetition or it is necessary to satisfy the exact or minimum number of +occurrences for the interval expression.

+ +

9.4.7 ERE Alternation

+ +

Two EREs separated by the special character vertical-line ( '|' ) shall match a string that is matched by either. For +example, the ERE "a((bc)|d)" matches the string "abc" and the string "ad". Single characters, or +expressions matching single characters, separated by the vertical bar and enclosed in parentheses, shall be treated as an ERE +matching a single character.

+ +

9.4.8 ERE Precedence

+ +

The order of precedence shall be as shown in the following table:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

+ ERE Precedence (from high to low) +
+ Collation-related bracket symbols +	+ [==] [::] [..] +
+ Escaped characters +	+ \<special character> +
+ Bracket expression +	+ [] +
+ Grouping +	+ () +
+ Single-character-ERE duplication +	+ * + ? {m,n} +
+ Concatenation +	+ +
+ Anchoring +	+ ^ $ +
+ Alternation +	+ \| +

+ + +

For example, the ERE "abba|cde" matches either the string "abba" or the string "cde" (rather than the +string "abbade" or "abbcde", because concatenation has a higher order of precedence than alternation).

+ +

9.4.9 ERE Expression Anchoring

+ +

An ERE can be limited to matching strings that begin or end a line; this is called "anchoring". The circumflex and dollar sign +special characters shall be considered ERE anchors when used anywhere outside a bracket expression. This shall have the following +effects:

+ +

+
A circumflex ( '^' ) outside a bracket expression shall anchor the expression or subexpression it begins to the +beginning of a string; such an expression or subexpression can match only a sequence starting at the first character of a string. +For example, the EREs "^ab" and "(^ab)" match "ab" in the string "abcdef", but fail to match +in the string "cdefab", and the ERE "a^b" is valid, but can never match because the 'a' prevents the +expression "^b" from matching starting at the first character.
+
+
A dollar sign ( '$' ) outside a bracket expression shall anchor the expression or subexpression it ends to the end of a +string; such an expression or subexpression can match only a sequence ending at the last character of a string. For example, the +EREs "ef$" and "(ef$)" match "ef" in the string "abcdef", but fail to match in the string +"cdefab", and the ERE "e$f" is valid, but can never match because the 'f' prevents the expression +"e$" from matching ending at the last character.
+

+ +

9.5 Regular Expression Grammar

+ +

Grammars describing the syntax of both basic and extended regular expressions are presented in this section. The grammar takes +precedence over the text. See the Shell and Utilities volume of IEEE Std 1003.1-2001, Section 1.10, Grammar Conventions.

+ +

9.5.1 BRE/ERE Grammar Lexical Conventions

+ +

The lexical conventions for regular expressions are as described in this section.

+ +

Except as noted, the longest possible token or delimiter beginning at a given point is recognized.

+ +

The following tokens are processed (in addition to those string constants shown in the grammar):

+ +

COLL_ELEM_SINGLE

+Any single-character collating element, unless it is a META_CHAR.

COLL_ELEM_MULTI

+Any multi-character collating element.

BACKREF

Applicable only to basic regular expressions. The character string consisting of '\' followed by a single-digit +numeral, '1' to '9'.

DUP_COUNT

Represents a numeric constant. It shall be an integer in the range 0 <= DUP_COUNT <= {RE_DUP_MAX}. This token is +only recognized when the context of the grammar requires it. At all other times, digits not preceded by '\' are treated as +ORD_CHAR.

META_CHAR

One of the characters: + +

^: When found first in a bracket expression
-: When found anywhere but first (after an initial '^', if any) or last in a bracket expression, or as the ending range +point in a range expression
]: When found anywhere but first (after an initial '^', if any) in a bracket expression

L_ANCHOR

Applicable only to basic regular expressions. The character '^' when it appears as the first character of a basic +regular expression and when not QUOTED_CHAR. The '^' may be recognized as an anchor elsewhere; see BRE Expression Anchoring.

ORD_CHAR

A character, other than one of the special characters in SPEC_CHAR.

QUOTED_CHAR

In a BRE, one of the character sequences: + +

+\^    \.   \*    \[    \$    \\
+
+

+ +

In an ERE, one of the character sequences:

+ +

+\^    \.    \[    \$    \(    \)    \|
+\*    \+    \?    \{    \\
+
+

R_ANCHOR

(Applicable only to basic regular expressions.) The character '$' when it appears as the last character of a basic +regular expression and when not QUOTED_CHAR. The '$' may be recognized as an anchor elsewhere; see BRE Expression Anchoring.

SPEC_CHAR

For basic regular expressions, one of the following special characters: + +

.: Anywhere outside bracket expressions
\: Anywhere outside bracket expressions
[: Anywhere outside bracket expressions
^: When used as an anchor (see BRE Expression Anchoring) or when first in a bracket expression
$: When used as an anchor
*: Anywhere except first in an entire RE, anywhere in a bracket expression, directly following "\(", directly following +an anchoring '^'

+ +

For extended regular expressions, shall be one of the following special characters found anywhere outside bracket +expressions:

+ +

+^    .    [    $    (    )    |
+*    +    ?    {    \
+
+

+ +

(The close-parenthesis shall be considered special in this context only if matched with a preceding open-parenthesis.)

+ +

9.5.2 RE and Bracket Expression Grammar

+ +

This section presents the grammar for basic regular expressions, including the bracket expression grammar that is common to both +BREs and EREs.

+ +

+%token    ORD_CHAR QUOTED_CHAR DUP_COUNT
+

+%token    BACKREF L_ANCHOR R_ANCHOR
+

+%token    Back_open_paren  Back_close_paren
+/*          '\('             '\)'        */
+

+%token    Back_open_brace  Back_close_brace
+/*          '\{'             '\}'         */
+

+/* The following tokens are for the Bracket Expression
+   grammar common to both REs and EREs. */
+

+%token    COLL_ELEM_SINGLE COLL_ELEM_MULTI META_CHAR
+

+%token    Open_equal Equal_close Open_dot Dot_close Open_colon Colon_close
+/*           '[='       '=]'        '[.'     '.]'      '[:'       ':]'  */
+

+%token    class_name
+/* class_name is a keyword to the LC_CTYPE locale category */
+/* (representing a character class) in the current locale */
+/* and is only recognized between [: and :] */
+

+%start    basic_reg_exp
+%%
+

+/* --------------------------------------------
+   Basic Regular Expression
+   --------------------------------------------
+*/
+basic_reg_exp  :          RE_expression
+               | L_ANCHOR
+               |                        R_ANCHOR
+               | L_ANCHOR               R_ANCHOR
+               | L_ANCHOR RE_expression
+               |          RE_expression R_ANCHOR
+               | L_ANCHOR RE_expression R_ANCHOR
+               ;
+RE_expression  :               simple_RE
+               | RE_expression simple_RE
+               ;
+simple_RE      : nondupl_RE
+               | nondupl_RE RE_dupl_symbol
+               ;
+nondupl_RE     : one_char_or_coll_elem_RE
+               | Back_open_paren RE_expression Back_close_paren
+               | BACKREF
+               ;
+one_char_or_coll_elem_RE : ORD_CHAR
+               | QUOTED_CHAR
+               | '.'
+               | bracket_expression
+               ;
+RE_dupl_symbol : '*'
+               | Back_open_brace DUP_COUNT               Back_close_brace
+               | Back_open_brace DUP_COUNT ','           Back_close_brace
+               | Back_open_brace DUP_COUNT ',' DUP_COUNT Back_close_brace
+               ;
+

+/* --------------------------------------------
+   Bracket Expression
+   -------------------------------------------
+*/
+bracket_expression : '[' matching_list ']'
+               | '[' nonmatching_list ']'
+               ;
+matching_list  : bracket_list
+               ;
+nonmatching_list : '^' bracket_list
+               ;
+bracket_list   : follow_list
+               | follow_list '-'
+               ;
+follow_list    :             expression_term
+               | follow_list expression_term
+               ;
+expression_term : single_expression
+               | range_expression
+               ;
+single_expression : end_range
+               | character_class
+               | equivalence_class
+               ;
+range_expression : start_range end_range
+               | start_range '-'
+               ;
+start_range    : end_range '-'
+               ;
+end_range      : COLL_ELEM_SINGLE
+               | collating_symbol
+               ;
+collating_symbol : Open_dot COLL_ELEM_SINGLE Dot_close
+               | Open_dot COLL_ELEM_MULTI Dot_close
+               | Open_dot META_CHAR Dot_close
+               ;
+equivalence_class : Open_equal COLL_ELEM_SINGLE Equal_close
+               | Open_equal COLL_ELEM_MULTI Equal_close
+               ;
+character_class : Open_colon class_name Colon_close
+               ;
+
+

+ +

The BRE grammar does not permit L_ANCHOR or R_ANCHOR inside "$" and "$" (which implies that +'^' and '$' are ordinary characters). This reflects the semantic limits on the application, as noted in BRE Expression Anchoring. Implementations are permitted to extend the language to interpret '^' and +'$' as anchors in these locations, and as such, conforming applications cannot use unescaped '^' and '$' +in positions inside "$" and "$" that might be interpreted as anchors.

+ +

9.5.3 ERE Grammar

+ +

This section presents the grammar for extended regular expressions, excluding the bracket expression grammar.

+ +

Note:: The bracket expression grammar and the associated %token lines are identical between BREs and EREs. It has been omitted +from the ERE section to avoid unnecessary editorial duplication.

+ + + +

+%token  ORD_CHAR QUOTED_CHAR DUP_COUNT
+%start  extended_reg_exp
+%%
+

+/* --------------------------------------------
+   Extended Regular Expression
+   --------------------------------------------
+*/
+extended_reg_exp   :                      ERE_branch
+                   | extended_reg_exp '|' ERE_branch
+                   ;
+ERE_branch         :            ERE_expression
+                   | ERE_branch ERE_expression
+                   ;
+ERE_expression     : one_char_or_coll_elem_ERE
+                   | '^'
+                   | '$'
+                   | '(' extended_reg_exp ')'
+                   | ERE_expression ERE_dupl_symbol
+                   ;
+one_char_or_coll_elem_ERE  : ORD_CHAR
+                   | QUOTED_CHAR
+                   | '.'
+                   | bracket_expression
+                   ;
+ERE_dupl_symbol    : '*'
+                   | '+'
+                   | '?'
+                   | '{' DUP_COUNT               '}'
+                   | '{' DUP_COUNT ','           '}'
+                   | '{' DUP_COUNT ',' DUP_COUNT '}'
+                   ;
+
+

+ +

The ERE grammar does not permit several constructs that previous sections specify as having undefined results:

+ +

+
ORD_CHAR preceded by '\'
+
+
One or more ERE_dupl_symbols appearing first in an ERE, or immediately following '|', '^', or +'('
+
+
'{' not part of a valid ERE_dupl_symbol
+
+
'|' appearing first or last in an ERE, or immediately following '|' or '(', or immediately preceding +')'
+

+ +

Implementations are permitted to extend the language to allow these. Conforming applications cannot use such constructs.

+ +

+ +UNIX ® is a registered Trademark of The Open Group.
+POSIX ® is a registered Trademark of The IEEE.
+[ Main Index | XBD | XCU | XSH | XRAT +] + + +

+ + + diff --git a/ps01_grep/ps01_bnewbold.txt b/ps01_grep/ps01_bnewbold.txt new file mode 100644 index 0000000..52979e7 --- /dev/null +++ b/ps01_grep/ps01_bnewbold.txt @@ -0,0 +1,345 @@ +;;; 6.945 Problem Set #1 +;;; 02/10/2009 +;;; Bryan Newbold + + +Problem 1.1: Warmup +--------------- + +(define (r:+ pattern) + (r:repeat 1 #f pattern)) + +(define (r:* pattern) + (r:repeat 0 #f pattern)) + +(pp (r:grep (r:seq (r:quote "dog") + (r:+ (r:quote "cat"))) + "tests.txt")) + +;("[09]. catdogcat" "[11]. dogdogcatdogdog" "[13]. acatdogdogcats") + +(pp (r:grep (r:seq (r:quote "cat") + (r:+ (r:char-from (string->char-set "ogd"))) + (r:quote "cat")) + "tests.txt")) + +;("[09]. catdogcat" "[13]. acatdogdogcats") + +(pp (r:grep (r:seq " " + (r:+ (r:quote "cat")) + (r:eol)) + "tests.txt")) +; #f + +(pp (r:grep (r:* (r:quote "something that isn't there")) + "tests.txt")) +; matches all + +(pp (r:grep (r:repeat 2 3 (r:* (r:char-from (string->char-set "tca")))) + "tests.txt")) +; matches all: nothing repeated several times + +(pp (r:grep (r:seq (r:quote "dog") + (r:+ (r:quote "cat")) + (r:quote "dog")) + "tests.txt")) + +Problem 1.2: The Proposals +--------------- + +a) This would be the same as asking Louis Reasoner to explain why his +suggestion is wrong; the problem lies within r:repeat, so re-calling r:repeat +does not sidestep the problem. + +b) Bonnie's method is computationally more efficient because checking up to the +minimum number of repetitions is only executed once; after that successive +checks are made for additional repetitions. From a straight-forward +perspective, Bonnie's approach also uses significantly fewer characters of +final regex code: this could potentially (but unlikely) become important if +large expressions were searched for with a large range of possible repetitions. +Most importantly, Bonnie's method is more readable and brings the +implementation more inline with the underlying philosophy of regular +expressions (in my opinion). + +c) Ben's solution is the one that an experienced regex-er would think of first +and is the closest conceptual analog to r:repeat in the actual specification. +The code is compact and should be the most readable and explicit, though we'll +see what kind of syntaxual issues crop up... + +d) + +(define (r:repeat min max expr) + (if (not (exact-nonnegative-integer? min)) + (error "Min must be non-negative integer:" min)) + (if max + (begin + (if (not (exact-nonnegative-integer? max)) + (error "Max must be non-negative integer (or null):" max)) + (if (not (<= min max)) + (error "Min not less than max:" min max)))) + (cond + ((not max) (r:seq expr (string-append "\\{" (number->string min) ",\\}"))) + ((= max min) (r:seq expr (string-append "\\{" (number->string min) "\\}"))) + (else (r:seq expr (string-append "\\{" + (number->string min) + "," + (number->string max) + "\\}"))))) +#| tests: +(pp (r:grep (r:repeat 2 3 (r:quote "cat")) + "tests.txt")) +(pp (r:grep (r:repeat 2 #f (r:quote "cat")) + "tests.txt")) +(pp (r:grep (r:repeat 1 1 (r:quote "cat")) + "tests.txt")) +(pp (r:grep (r:repeat 4 3 (r:quote "cat")) + "tests.txt")) +(r:repeat 2 3 (r:quote "cat")) +|# + + +Problem 1.3: Optimization +-------------- +I simply removed the default parentheses around all expressions and implemented +a new r:subexp wrapper which puts parentheses around expressions. Then I had +r:alt and r:repeat wrap their expressions in r:subexp. This results in /some/ +redundant nesting but is a huge reduction. + +(define (r:subexp . exprs) + (string-append "\$" (apply string-append exprs) "\$")) + +(define (r:seq . exprs) + (apply string-append exprs)) + +(r:seq " " (r:subexp "sdf") "dddd") + +(define (r:repeat min max expr) + (if (not (exact-nonnegative-integer? min)) + (error "Min must be non-negative integer:" min)) + (if max + (begin + (if (not (exact-nonnegative-integer? max)) + (error "Max must be non-negative integer (or null):" max)) + (if (not (<= min max)) + (error "Min not less than max:" min max)))) + (cond + ((not max) (r:seq (r:subexp expr) + (string-append "\\{" (number->string min) ",\\}"))) + ((= max min) (r:seq (r:subexp expr) + (string-append "\\{" (number->string min) "\\}"))) + (else (r:seq (r:subexp expr) + (string-append "\\{" + (number->string min) + "," + (number->string max) + "\\}"))))) + + +Problem 1.4: Back-references +-------------- +I only implemented the most crude form of back-expressions. A better +implementation would include a method of naming subexpressions and refering +back to them by name: the implementation would do the appropriate counting to +find the correct backreference number for a given named subexpression. + +(define (r:backref n) + (string-append "\\" (number->string n))) + +Problem 1.5: Ugh! +------------- +a) Differences between BREs and EREs (I referenced the table at +http://www.greenend.org.uk/rjk/2002/06/regexp.html): + +* Some characters do not need escaping backslashes: mutiple matching brackets + ('\{' becomes '{'), subexpressions ('\(' becomes '(') +* The alternation operator '|' is added +* The '+' operator is added meaning one or more matches +* The '?' operator is added meaning one or zero matches + +b) The basic method I chose is to have each r: procedure return a list +containing two lambda procedures: the first will return the regular grep +version of the expression, while the second will return the extended grep +version. The r:grep procedure selects the first lambda and r:egrep selects the +second, and this decision cascades down recursively to the atomic expressions. + +My implementation is crude; the r-grep: and r-egrep: procedures should be +internal to the primary r: procedures (using let), and the whole process could +be implemented as a funtion. + +c) (not, uh, fully tested...) + +; the following are only the expressions which had to be rewritten... + +(define (r-grep:seq . exprs) + (let ((choose-grep (lambda (x) + (cond + ((string? x) x) + ((list? x) (car x)))))) + (apply string-append (map choose-grep exprs)))) + +(r-grep:seq "a" "b" '("grep" "egrep")) +; abgrep + +(define (r-egrep:seq . exprs) + (let ((choose-egrep (lambda (x) + (cond + ((string? x) x) + ((list? x) (cadr x)))))) + (apply string-append (map choose-egrep exprs)))) + +#| test: +(r-grep:seq "a" "b" '("grep" "egrep")) +; "abgrep" +(r-egrep:seq "a" "b" '("grep" "egrep")) +; "abegrep" +|# + +(define (r:seq . exprs) + (cons (lambda () (apply r-grep:seq exprs)) + (cons (lambda () (apply r-egrep:seq exprs)) '()))) + +#| test: +(r:seq "a" "b" '("grep" "egrep")) +;Value: (#[compound-procedure 14] #[compound-procedure 15]) +((car (r:seq "a" "b" '("grep" "egrep")))) +; "abgrep" +((cadr (r:seq "a" "b" '("grep" "egrep")))) +; "abegrep" +|# + +(define (r-egrep:alt . exprs) + (let ((choose-egrep (lambda (x) + (cond + ((string? x) x) + ((list? x) ((cadr x))))))) + (if (pair? exprs) + (apply r-egrep:seq + (cons (choose-egrep (car exprs)) + (append-map (lambda (expr) + (list "|" (choose-egrep expr))) + (cdr exprs)))) + (r-egrep:seq)))) + +(define (r-grep:alt . exprs) + (let ((choose-grep (lambda (x) + (cond + ((string? x) x) + ((list? x) ((car x))))))) + (if (pair? exprs) + (apply r-grep:seq + (cons (choose-grep (car exprs)) + (append-map (lambda (expr) + (list "\\|" (choose-grep expr))) + (cdr exprs)))) + (r-grep:seq)))) + +(define (r:alt . exprs) + (cons (lambda () (apply r-grep:alt exprs)) + (cons (lambda () (apply r-egrep:alt exprs)) '()))) + +#| test: +(r-egrep:alt "a" (r:dot) "c") +;Value: "a|.|c" +(r:alt "asdf" (r:dot)) +;Value: (#[compound-procedure 16] #[compound-procedure 17]) +((car (r:alt "asdf" (r:dot)))) +;Value: "asdf\\|." +((cadr (r:alt "asdf" (r:dot)))) +;Value: "asdf|." +|# + +(define (r-grep:repeat min max expr) + (let ((choose-grep (lambda (x) + (cond + ((string? x) x) + ((list? x) ((car x))))))) + (if (not (exact-nonnegative-integer? min)) + (error "Min must be non-negative integer:" min)) + (if max + (begin + (if (not (exact-nonnegative-integer? max)) + (error "Max must be non-negative integer (or null):" max)) + (if (not (<= min max)) + (error "Min not less than max:" min max)))) + (cond + ((not max) (r-grep:seq (r-grep:subexp (choose-grep expr)) + (string-append "\\{" (number->string min) ",\\}"))) + ((= max min) (r-grep:seq (r-grep:subexp (choose-grep expr)) + (string-append "\\{" (number->string min) "\\}"))) + (else (r-grep:seq (r-grep:subexp (choose-grep expr)) + (string-append "\\{" + (number->string min) + "," + (number->string max) + "\\}")))))) + +(define (r-egrep:repeat min max expr) + (let ((choose-egrep (lambda (x) + (cond + ((string? x) x) + ((list? x) ((cadr x))))))) + (if (not (exact-nonnegative-integer? min)) + (error "Min must be non-negative integer:" min)) + (if max + (begin + (if (not (exact-nonnegative-integer? max)) + (error "Max must be non-negative integer (or null):" max)) + (if (not (<= min max)) + (error "Min not less than max:" min max)))) + (cond + ((not max) (r-egrep:seq (r-egrep:subexp (choose-egrep expr)) + (string-append "{" (number->string min) ",}"))) + ((= max min) (r-egrep:seq (r-egrep:subexp (choose-egrep expr)) + (string-append "{" (number->string min) "}"))) + (else (r-egrep:seq (r-egrep:subexp (choose-egrep expr)) + (string-append "{" + (number->string min) + "," + (number->string max) + "}")))))) + +(define (r:repeat min max expr) + (cons (lambda () (r-grep:repeat min max expr)) + (cons (lambda () (r-egrep:repeat min max expr)) '()))) + +(define (r-grep:subexp . exprs) + (string-append "\$" (apply string-append + (map (lambda (x) + (cond + ((string? x) x) + ((pair? x) ((car x))))) + exprs)) "\$")) + +(define (r-egrep:subexp . exprs) + (string-append "(" (apply string-append + (map (lambda (x) + (cond + ((string? x) x) + ((pair? x) ((cadr x))))) + exprs)) ")")) + +(define (r:subexp . exprs) + (cons (lambda () (apply r-grep:subexp exprs)) + (cons (lambda () (apply r-egrep:subexp exprs)) '()))) + + +#| test: +(r:subexp "a" "b" (r:dot)) +;Value: (#[compound-procedure 25] #[compound-procedure 26]) +((cadr (r:subexp "a" "b" (r:dot)))) +;Value: "(ab.)" +((car (r:subexp "a" "b" (r:dot)))) +;Value: "\$ab.\$" +|# + +(define (r:grep expr filename) + (r:grep-like "grep" '() ((car expr)) filename)) + +(define (r:egrep expr filename) + (if (eq? microcode-id/operating-system 'nt) + (r:grep-like "grep" '("-E") ((cadr expr)) filename) + (r:grep-like "egrep" '() ((cadr expr)) filename))) + +(pp (r:grep (r:seq (r:repeat 2 3 (r:quote "cat")) (r:+ (r:quote "dog"))) + "tests.txt")) + diff --git a/ps01_grep/ps01_work.scm b/ps01_grep/ps01_work.scm new file mode 100644 index 0000000..7338f20 --- /dev/null +++ b/ps01_grep/ps01_work.scm @@ -0,0 +1,331 @@ +;;; 6.945 Problem Set #1 - workfile +;;; 02/10/2009 +;;; Bryan Newbold + +(load "regexp.scm") + + +(pp (r:grep (r:seq " " + (r:repeat 3 5 (r:alt (r:quote "cat") (r:quote "dog"))) + (r:eol)) + "tests.txt")) + +;------------------------------------------------------------------- +; 1.1: define r:* and r:+ + +(define (r:+ pattern) + (r:repeat 1 #f pattern)) + +(define (r:* pattern) + (r:repeat 0 #f pattern)) + +(pp (r:grep (r:seq (r:quote "dog") + (r:+ (r:quote "cat"))) + "tests.txt")) + +;("[09]. catdogcat" "[11]. dogdogcatdogdog" "[13]. acatdogdogcats") + +(pp (r:grep (r:seq (r:quote "cat") + (r:+ (r:char-from (string->char-set "ogd"))) + (r:quote "cat")) + "tests.txt")) + +;("[09]. catdogcat" "[13]. acatdogdogcats") + +(pp (r:grep (r:seq " " + (r:+ (r:quote "cat")) + (r:eol)) + "tests.txt")) +; #f + +(pp (r:grep (r:* (r:quote "something that isn't there")) + "tests.txt")) +; matches all + +(pp (r:grep (r:repeat 2 3 (r:* (r:char-from (string->char-set "tca")))) + "tests.txt")) +; matches all: nothing repeated several times + +(pp (r:grep (r:seq (r:quote "dog") + (r:+ (r:quote "cat")) + (r:quote "dog")) + "tests.txt")) + +;("[11]. dogdogcatdogdog") + +(pp (r:grep (r:seq (r:quote "dog") + (r:* (r:quote "cat")) + (r:quote "dog")) + "tests.txt")) +#| +("[10]. catcatdogdog" "[11]. dogdogcatdogdog" + "[12]. catcatcatdogdogdog" + "[13]. acatdogdogcats" + "[14]. ifacatdogdogs" + "[15]. acatdogdogsme") +|# + +;------------------------------------------------------------------------- +; 1.2 + +(define (r:repeat min max expr) + (if (not (exact-nonnegative-integer? min)) + (error "Min must be non-negative integer:" min)) + (if max + (begin + (if (not (exact-nonnegative-integer? max)) + (error "Max must be non-negative integer (or null):" max)) + (if (not (<= min max)) + (error "Min not less than max:" min max)))) + (cond + ((not max) (r:seq expr (string-append "\\{" (number->string min) ",\\}"))) + ((= max min) (r:seq expr (string-append "\\{" (number->string min) "\\}"))) + (else (r:seq expr (string-append "\\{" + (number->string min) + "," + (number->string max) + "\\}"))))) +#| tests: +(pp (r:grep (r:repeat 2 3 (r:quote "cat")) + "tests.txt")) +(pp (r:grep (r:repeat 2 #f (r:quote "cat")) + "tests.txt")) +(pp (r:grep (r:repeat 1 1 (r:quote "cat")) + "tests.txt")) +(pp (r:grep (r:repeat 4 3 (r:quote "cat")) + "tests.txt")) +(r:repeat 2 3 (r:quote "cat")) +|# + +;--------------------------------------------------------------------------- +; 1.3 + +(define (r:subexp . exprs) + (string-append "\$" (apply string-append exprs) "\$")) + +(define (r:seq . exprs) + (apply string-append exprs)) + +(r:seq " " (r:subexp "sdf") "dddd") + +(define (r:repeat min max expr) + (if (not (exact-nonnegative-integer? min)) + (error "Min must be non-negative integer:" min)) + (if max + (begin + (if (not (exact-nonnegative-integer? max)) + (error "Max must be non-negative integer (or null):" max)) + (if (not (<= min max)) + (error "Min not less than max:" min max)))) + (cond + ((not max) (r:seq (r:subexp expr) + (string-append "\\{" (number->string min) ",\\}"))) + ((= max min) (r:seq (r:subexp expr) + (string-append "\\{" (number->string min) "\\}"))) + (else (r:seq (r:subexp expr) + (string-append "\\{" + (number->string min) + "," + (number->string max) + "\\}"))))) + +(define (r:alt . exprs) + (if (pair? exprs) + (r:subexp (apply r:seq + (cons (car exprs) + (append-map (lambda (expr) + (list "\\|" expr)) + (cdr exprs))))) + (r:seq))) + +(r:alt "first thing" "second thing") + +;---------------------------------------------------------------------------- +; 1.4 + +(define (r:backref n) + (string-append "\\" (number->string n))) + +;---------------------------------------------------------------------------- +; 1.5 + +; the following are only the expressions which had to be rewritten... + +(define (r-grep:seq . exprs) + (let ((choose-grep (lambda (x) + (cond + ((string? x) x) + ((list? x) (car x)))))) + (apply string-append (map choose-grep exprs)))) + +(r-grep:seq "a" "b" '("grep" "egrep")) +; abgrep + +(define (r-egrep:seq . exprs) + (let ((choose-egrep (lambda (x) + (cond + ((string? x) x) + ((list? x) (cadr x)))))) + (apply string-append (map choose-egrep exprs)))) + +#| test: +(r-grep:seq "a" "b" '("grep" "egrep")) +; "abgrep" +(r-egrep:seq "a" "b" '("grep" "egrep")) +; "abegrep" +|# + +(define (r:seq . exprs) + (cons (lambda () (apply r-grep:seq exprs)) + (cons (lambda () (apply r-egrep:seq exprs)) '()))) + +#| test: +(r:seq "a" "b" '("grep" "egrep")) +;Value: (#[compound-procedure 14] #[compound-procedure 15]) +((car (r:seq "a" "b" '("grep" "egrep")))) +; "abgrep" +((cadr (r:seq "a" "b" '("grep" "egrep")))) +; "abegrep" +|# + +(define (r-egrep:alt . exprs) + (let ((choose-egrep (lambda (x) + (cond + ((string? x) x) + ((list? x) ((cadr x))))))) + (if (pair? exprs) + (apply r-egrep:seq + (cons (choose-egrep (car exprs)) + (append-map (lambda (expr) + (list "|" (choose-egrep expr))) + (cdr exprs)))) + (r-egrep:seq)))) + +(define (r-grep:alt . exprs) + (let ((choose-grep (lambda (x) + (cond + ((string? x) x) + ((list? x) ((car x))))))) + (if (pair? exprs) + (apply r-grep:seq + (cons (choose-grep (car exprs)) + (append-map (lambda (expr) + (list "\\|" (choose-grep expr))) + (cdr exprs)))) + (r-grep:seq)))) + +(define (r:alt . exprs) + (cons (lambda () (apply r-grep:alt exprs)) + (cons (lambda () (apply r-egrep:alt exprs)) '()))) + +#| test: +(r-egrep:alt "a" (r:dot) "c") +;Value: "a|.|c" +(r:alt "asdf" (r:dot)) +;Value: (#[compound-procedure 16] #[compound-procedure 17]) +((car (r:alt "asdf" (r:dot)))) +;Value: "asdf\\|." +((cadr (r:alt "asdf" (r:dot)))) +;Value: "asdf|." +|# + +(define (r-grep:repeat min max expr) + (let ((choose-grep (lambda (x) + (cond + ((string? x) x) + ((list? x) ((car x))))))) + (if (not (exact-nonnegative-integer? min)) + (error "Min must be non-negative integer:" min)) + (if max + (begin + (if (not (exact-nonnegative-integer? max)) + (error "Max must be non-negative integer (or null):" max)) + (if (not (<= min max)) + (error "Min not less than max:" min max)))) + (cond + ((not max) (r-grep:seq (r-grep:subexp (choose-grep expr)) + (string-append "\\{" (number->string min) ",\\}"))) + ((= max min) (r-grep:seq (r-grep:subexp (choose-grep expr)) + (string-append "\\{" (number->string min) "\\}"))) + (else (r-grep:seq (r-grep:subexp (choose-grep expr)) + (string-append "\\{" + (number->string min) + "," + (number->string max) + "\\}")))))) + + +(define (r-egrep:repeat min max expr) + (let ((choose-egrep (lambda (x) + (cond + ((string? x) x) + ((list? x) ((cadr x))))))) + (if (not (exact-nonnegative-integer? min)) + (error "Min must be non-negative integer:" min)) + (if max + (begin + (if (not (exact-nonnegative-integer? max)) + (error "Max must be non-negative integer (or null):" max)) + (if (not (<= min max)) + (error "Min not less than max:" min max)))) + (cond + ((not max) (r-egrep:seq (r-egrep:subexp (choose-egrep expr)) + (string-append "{" (number->string min) ",}"))) + ((= max min) (r-egrep:seq (r-egrep:subexp (choose-egrep expr)) + (string-append "{" (number->string min) "}"))) + (else (r-egrep:seq (r-egrep:subexp (choose-egrep expr)) + (string-append "{" + (number->string min) + "," + (number->string max) + "}")))))) + +(define (r:repeat min max expr) + (cons (lambda () (r-grep:repeat min max expr)) + (cons (lambda () (r-egrep:repeat min max expr)) '()))) + +(define (r-grep:subexp . exprs) + (string-append "\$" (apply string-append + (map (lambda (x) + (cond + ((string? x) x) + ((pair? x) ((car x))))) + exprs)) "\$")) + +(define (r-egrep:subexp . exprs) + (string-append "(" (apply string-append + (map (lambda (x) + (cond + ((string? x) x) + ((pair? x) ((cadr x))))) + exprs)) ")")) + +(define (r:subexp . exprs) + (cons (lambda () (apply r-grep:subexp exprs)) + (cons (lambda () (apply r-egrep:subexp exprs)) '()))) + +#| test: +(r:subexp "a" "b" (r:dot)) +;Value: (#[compound-procedure 25] #[compound-procedure 26]) +((cadr (r:subexp "a" "b" (r:dot)))) +;Value: "(ab.)" +((car (r:subexp "a" "b" (r:dot)))) +;Value: "\$ab.\$" +|# + +(define (r:grep expr filename) + (r:grep-like "grep" '() ((car expr)) filename)) + +(define (r:egrep expr filename) + (if (eq? microcode-id/operating-system 'nt) + (r:grep-like "grep" '("-E") ((cadr expr)) filename) + (r:grep-like "egrep" '() ((cadr expr)) filename))) + +(pp (r:grep (r:seq (r:repeat 2 3 (r:quote "cat")) (r:+ (r:quote "dog"))) + "tests.txt")) + + + + + + -- cgit v1.2.3